At Sun, 11 Dec 2011 17:57:54 +0000, Chris Webb wrote: > > MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes: > > > It looks a bit difficult to handle collie commands gracefully during > > node membership changes. I think of showing an error message to force > > users to retry the commands, and leaving this problem as a future work. > > Hi Kazutaka. If we defined an extra 'temporary failure; please retry' exit > code for collie, automated systems would be able to detect this case and > automatically wait and retry themselves too. It's probably fine to do > something like that and rely on the layer that's calling collie to retry if > that's easier to implement. > > Just a thought, but what happens to qemu VMs accessing sheepdog block > devices when this happens?> In that case, the gateway sheep daemon, which is localhost by default, will retry I/O requests automatically. > Presumably they do hang, and then restart once the node membership > is sorted?> Yes, until new node membership is established, qemu I/Os will be blocked. The time is depends on your corosync.conf and TCP connect timeout. > But (also presumably), new qemu VMs who try to start during the > change will fail? Yes. But after reading your mail, I guess it might be better to retry collie's I/O requests in the gateway like qemu's ones. It needs a slight change, but will support automatic retry simply. I'll send the patch soon to check how it works. Thanks, Kazutaka > It would be nice to know that this has > happened for a temporary reason too, but that might be harder to propagate > out of qemu. |