At Thu, 15 Dec 2011 15:57:11 +0000, Chris Webb wrote: > > MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes: > > > Chris Webb wrote: > > > > > > If the failed node is just partitioned away from the rest of the cluster > > > rather than failing, what's supposed to happen to the sheep instances and > > > the qemus on it? I saw operations hang indefinitely, which is the intended > > > > Sheepdog cannot distinguish the temporary disconnected node from the > > failed one, so the sheep instances will abort and qemus will hang > > forever. > > That's fine, it's a safe behaviour and presumably I can also detect it on > the host and reboot. I don't particularly want to be able to restart > automatically, just to be reasonably sure that they won't spontaneously > restart without intervention if I automatically restart them elsewhere when > the node vanishes! > > So just to be clear, is the expected behaviour here that the sheep on the > isolated node will exit as they can no longer continue? I think I might have Yes, the sheep should exit in that case. > seen a hang rather than an exit, but I'll recheck with a recent corosync as > I think I accidentally ran my previous test with a relatively elderly one. Probably, it is a bug of Sheepdog. Is there an easy way to reproduce it with a small cluster? I'd like to try to test it, too. Thanks, Kazutaka |