[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes

Thu Dec 15 21:35:06 CET 2011

At Thu, 15 Dec 2011 15:57:11 +0000,
Chris Webb wrote:
> 
> MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> 
> > Chris Webb wrote:
> > > 
> > > If the failed node is just partitioned away from the rest of the cluster
> > > rather than failing, what's supposed to happen to the sheep instances and
> > > the qemus on it? I saw operations hang indefinitely, which is the intended
> > 
> > Sheepdog cannot distinguish the temporary disconnected node from the
> > failed one, so the sheep instances will abort and qemus will hang
> > forever.
> 
> That's fine, it's a safe behaviour and presumably I can also detect it on
> the host and reboot. I don't particularly want to be able to restart
> automatically, just to be reasonably sure that they won't spontaneously
> restart without intervention if I automatically restart them elsewhere when
> the node vanishes!
> 
> So just to be clear, is the expected behaviour here that the sheep on the
> isolated node will exit as they can no longer continue? I think I might have

Yes, the sheep should exit in that case.

> seen a hang rather than an exit, but I'll recheck with a recent corosync as
> I think I accidentally ran my previous test with a relatively elderly one.

Probably, it is a bug of Sheepdog.  Is there an easy way to reproduce
it with a small cluster?  I'd like to try to test it, too.

Thanks,

Kazutaka