[Sheepdog] [PATCH] collie: show better error message while node membership is changing

Mon Dec 12 11:04:11 CET 2011

At Sun, 11 Dec 2011 17:57:54 +0000,
Chris Webb wrote:
> 
> MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> 
> > It looks a bit difficult to handle collie commands gracefully during
> > node membership changes.  I think of showing an error message to force
> > users to retry the commands, and leaving this problem as a future work.
> 
> Hi Kazutaka. If we defined an extra 'temporary failure; please retry' exit
> code for collie, automated systems would be able to detect this case and
> automatically wait and retry themselves too. It's probably fine to do
> something like that and rely on the layer that's calling collie to retry if
> that's easier to implement.
> 
> Just a thought, but what happens to qemu VMs accessing sheepdog block
> devices when this happens?>

In that case, the gateway sheep daemon, which is localhost by default,
will retry I/O requests automatically.

> Presumably they do hang, and then restart once the node membership
> is sorted?>

Yes, until new node membership is established, qemu I/Os will be
blocked.  The time is depends on your corosync.conf and TCP connect
timeout.

> But (also presumably), new qemu VMs who try to start during the
> change will fail?

Yes.  But after reading your mail, I guess it might be better to retry
collie's I/O requests in the gateway like qemu's ones.  It needs a
slight change, but will support automatic retry simply.  I'll send the
patch soon to check how it works.

Thanks,

Kazutaka

> It would be nice to know that this has
> happened for a temporary reason too, but that might be harder to propagate
> out of qemu.