[sheepdog-users] Cluster recover after loosing 2 devices

Tue Jun 17 18:34:10 CEST 2014

If you reweight automatically on disk unplugged by I/O error rather than maintaining the current internal rebalance that occurs you're going to force a maximal limit on the size of any one cluster.  A reweight forces a cluster-wide recovery, during which you are vulnerable to further loss.  Losing a drive is more common than losing an entire server.  A blind reweight on unplug should, perhaps, be an option, but not the default.

Recall that vdis are split into 4Mb files, that are balanced across servers based on stable hashing and weighting.  In your case, with 4 servers and a '-c 2', you can survive a single failure _at a time_ down to 2 total machines assuming strict, or 1 machine assuming available storage using unsafe.  When you lose two machines at a time you are going to lose access to the intersection of the vdi blocks that were on those two machines, which is some subset of the total number of blocks consisting of all vdis.  Due to zoning, no machine has more than a single copy of any particular vdi block.  In the event of '-c 3', even with two failed simultaneously failed machines, there is still a third machine that will have any given available block meaning the vdi is reconstructible.  That doesn't prevent three machines from failing and causing data loss.   No more so than Raid 6 makes you immune to drive loss, simply immune to up to 2 failed drives.

That's why erasure coding is an exciting addition to sheepdog 0.8, more options short of straight copies.  The larger the operational cluster, the more redundancy you're able to use and therefore the better your odds of not having a cluster killing event.

The consideration has to be made as to how large is the desired scalability of sheepdog?  The larger the cluster, the more frequent the failures occurring simply by pure numbers.  In a large cluster currently, a single failed drive doesn't trigger a cluster-wide recovery, but COULD trigger an out of disk space on that node when inaccessible vdi blocks are copied over.  If you force reweight on I/O you will be forcing much more frequent recovery events across a large cluster.  For significant amounts of data, that would be a large performance impact for significant periods of time.  All for the case of a -c 2 over 4 nodes running out of disk space...

On 06/17/2014 12:07 PM, Valerio Pachera wrote:

2014-06-17 16:23 GMT+02:00 Liu Yuan <namei.unix at gmail.com<mailto:namei.unix at gmail.com>>:

- if disks are unplugged by io error, we should reweight automatically.

Yes, this has to be done.

- if disks are plugged/unplugged by users, we don't do auto-reweight.

I also have another idea about it, but I'll discuss it on a new topic.

Try to pass '--strict' option to sheep daemon. It tell the cluster to stop the
service if nodes number is less than required redundancy policy.

I may try but I don't think it's going to work because, as you say,
"...if node number is less..."
not
"...if lost devices number is equal or greater than redundancy policy".

Keep my cluster with -c 2 as an example.
If 2 hosts were going down at the same time...what was going to happen? (Option --strict was not used)
If 2 devices were going down at the same time (as it happened)...does sheepdog react in the same way?
In both cases we don't have enough objects.

Generalizing, the problem is not solved using -c 3, it's just more unlike to happen.

Obviously, it doesn't matter if I loose 1,2,3 or all devices on a single host at the same time.
It's very different if it happens on more hosts.

(Sorry if I repeat the same concept in different ways).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140617/926f8c85/attachment-0005.vcf>