[Sheepdog] Sheepdog reliability

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Thu Nov 18 09:45:57 CET 2010


Hi,

At Wed, 17 Nov 2010 14:44:34 +0100,
Dennis Jacobfeuerborn wrote:
> 
> Hi,
> I've been following Sheepdog for a while and now that patches are being 
> sent to include it in libvirt I want to start testing it. One question I 
> have is how I can ensure the reliability of the Sheepdog cluster as a 
> whole. Specifically I'm looking at two cases:
> 
> Lets assume a setup with 4 nodes and a redundancy of 3.
> 
> If one node fails what are the effects both for the cluster and the clients 
> (e.g. potential i/o delays, messages, etc.)

Until Sheepdog starts a new round of membership, the cluster suspends
any requests to data objects and the clients I/O is waited.  How long
to wait is up to the value of totem/consensus in corosync.conf.  The
default value is 1200 ms.  If you want to run Sheepdog with large
number of nodes, the value need to be larger number and the delay time
becomes larger.

> and what needs to be done once 
> the node is replaced to get the cluster back into a healthy state?

All you need to do is only starting a sheep daemon again.  If it
doesn't work, please let me know.

> 
> What happens if *all* nodes fail due to e.g. a power outage? What needs to 
> be done to bring the cluster back up again?

If no VM is running when all nodes fail, all you need to do is
starting all sheep daemons again.  However, if I/O requests are
processed when all nodes fail, Sheepdog needs to recover the objects
whose replicas are in inconsistent states (and it is not implemented
yet).

> 
> Since one of the goals of Sheepdog is to make the storage highly available 
> I'm trying to think of the scenarios that the cluster needs to be able to 
> handle.

Thanks!  It help us a lot.


Kazutaka


> 
> Regards,
>    Dennis
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list