MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes: > This patchset makes collie's I/Os like QEMU's ones, and adds support > for automatic retry of collie commands. > Chris, can you try the devel branch? Hi. This seems to work very nicely. I made a three node cluster, created and started writing to a VDI, and disconnected a node, whereupon a collie vdi list on one of the remaining nodes hung for the corosync-dependent timeout until the failed node was kicked out of the cluster, then sprung into life with a correct vdi listing. collie node list showed the cluster now had two nodes instead of one, and everything continued to work fine. Rebooting and reattaching the disconnected node, the cluster grew back to the full size again automatically. Very nice! If the failed node is just partitioned away from the rest of the cluster rather than failing, what's supposed to happen to the sheep instances and the qemus on it? I saw operations hang indefinitely, which is the intended behaviour I imagine? The case I wondered about is where the failed node is later reattached to the rest of the cluster. I think it continues to hang in that case rather than recovering and allowing the local VMs to proceed? Cheers, Chris. |