[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes

Tue Dec 13 13:27:03 CET 2011

MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:

> This patchset makes collie's I/Os like QEMU's ones, and adds support
> for automatic retry of collie commands.
> Chris, can you try the devel branch?

Hi. This seems to work very nicely. I made a three node cluster, created and
started writing to a VDI, and disconnected a node, whereupon a collie vdi
list on one of the remaining nodes hung for the corosync-dependent timeout
until the failed node was kicked out of the cluster, then sprung into life
with a correct vdi listing. collie node list showed the cluster now had two
nodes instead of one, and everything continued to work fine.

Rebooting and reattaching the disconnected node, the cluster grew back to
the full size again automatically. Very nice!

If the failed node is just partitioned away from the rest of the cluster
rather than failing, what's supposed to happen to the sheep instances and
the qemus on it? I saw operations hang indefinitely, which is the intended
behaviour I imagine? The case I wondered about is where the failed node is
later reattached to the rest of the cluster. I think it continues to hang in
that case rather than recovering and allowing the local VMs to proceed?

Cheers,

Chris.