[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes

Thu Dec 15 16:57:11 CET 2011

MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:

> Chris Webb wrote:
> > 
> > If the failed node is just partitioned away from the rest of the cluster
> > rather than failing, what's supposed to happen to the sheep instances and
> > the qemus on it? I saw operations hang indefinitely, which is the intended
> 
> Sheepdog cannot distinguish the temporary disconnected node from the
> failed one, so the sheep instances will abort and qemus will hang
> forever.

That's fine, it's a safe behaviour and presumably I can also detect it on
the host and reboot. I don't particularly want to be able to restart
automatically, just to be reasonably sure that they won't spontaneously
restart without intervention if I automatically restart them elsewhere when
the node vanishes!

So just to be clear, is the expected behaviour here that the sheep on the
isolated node will exit as they can no longer continue? I think I might have
seen a hang rather than an exit, but I'll recheck with a recent corosync as
I think I accidentally ran my previous test with a relatively elderly one.

Best wishes,

Chris.