[Sheepdog] Sheepdog reliability

Sun Nov 21 18:14:34 CET 2010

At Thu, 18 Nov 2010 15:48:01 +0100,
Dennis Jacobfeuerborn wrote:
> 
> On 11/18/2010 09:45 AM, MORITA Kazutaka wrote:
> > Hi,
> >
> > At Wed, 17 Nov 2010 14:44:34 +0100,
> > Dennis Jacobfeuerborn wrote:
> >>
> >> Hi,
> >> I've been following Sheepdog for a while and now that patches are being
> >> sent to include it in libvirt I want to start testing it. One question I
> >> have is how I can ensure the reliability of the Sheepdog cluster as a
> >> whole. Specifically I'm looking at two cases:
> >>
> >> Lets assume a setup with 4 nodes and a redundancy of 3.
> >>
> >> If one node fails what are the effects both for the cluster and the clients
> >> (e.g. potential i/o delays, messages, etc.)
> >
> > Until Sheepdog starts a new round of membership, the cluster suspends
> > any requests to data objects and the clients I/O is waited.  How long
> > to wait is up to the value of totem/consensus in corosync.conf.  The
> > default value is 1200 ms.  If you want to run Sheepdog with large
> > number of nodes, the value need to be larger number and the delay time
> > becomes larger.
> 
> Wouldn't it be better to decouple the client requests from these cluster 
> timings? This looks like a unnecessary bottleneck that gets worse as the 
> cluster gets larger. Why not let the client request have it's own timeout 
> of say 1 second and if no response arrives retry the request to one of the 
> nodes that carry one of the redundant copy of the blocks?
> That way a node failure would have less of an impact on the applications 
> and delays for the application request would become independent of the 
> cluster size.

The delay is necessary for the clients to support strong consistency
of Sheepdog.  Sheepdog stores objects with the version number of node
membership, and the number must be the same among replications.  If
one of the target nodes has an old node membership, we cannot
guarantee that.  So the clients need to wait until all target nodes
reach to the same round of memberships.

> 
> >> and what needs to be done once
> >> the node is replaced to get the cluster back into a healthy state?
> >
> > All you need to do is only starting a sheep daemon again.  If it
> > doesn't work, please let me know.
> 
> So when the node goes down will the cluster copy all of the lost blocks to 
> another node automatically to re-establish the redundancy requirement of 3 
> copies?

Yes.

> 
> If the new node is added to the cluster will it stay empty or will the 
> cluster rebalance the blocks according to some load criterium?

The data will be rebalanced automatically according to the consistent
hashing algorithm.

> 
> >>
> >> What happens if *all* nodes fail due to e.g. a power outage? What needs to
> >> be done to bring the cluster back up again?
> >
> > If no VM is running when all nodes fail, all you need to do is
> > starting all sheep daemons again.  However, if I/O requests are
> > processed when all nodes fail, Sheepdog needs to recover the objects
> > whose replicas are in inconsistent states (and it is not implemented
> > yet).
> >
> 
> What is the timeframe for this implementation after all this has to be 
> implemented before Sheepdog can go into productive use.

I'd like to release the next version in the end of this year and
support the feature in it.  I'll announce the next release plan in the
next weekend.

> 
> Regards,
>    Dennis
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog