[sheepdog-users] Cluster recover after loosing 2 devices

Tue Jun 17 09:57:27 CEST 2014

2014-06-17 9:55 GMT+02:00 Valerio Pachera <sirio81 at gmail.com>:

> Hi, I try to summarize what happened on a cluster I manage.
> It may be interesting for other administrators.
> (It's a bit long, sorry).
>
> This cluster has 4 nodes and is running Sheepdog daemon version
> 0.7.0_131_g88f0024.
>
> I'm going to focus on node id0 and node id1.
> These are the mount points used by sheepdog.
>
> node id0
> /dev/mapper/vg00-sheepdsk01    217G  211G    5,2G  98% /mnt/sheep/dsk01
> /dev/sdb1    466G  466G     76K 100% /mnt/sheep/dsk02
> /dev/sdc1    1,9T  1,1T    755G  60% /mnt/sheep/dsk03
>
> node id1
> /dev/mapper/vg00-sheepdsk01    167G  103G     64G  62% /mnt/sheep/dsk01
> /dev/sdb1    466G  466G     12K 100% /mnt/sheep/dsk02
> /dev/sdc1    1,9T  1,1T    791G  58% /mnt/sheep/dsk03
>
> As you can see, two disks (500G) are full.
> During the night, the disk sdc1 (2T) of node id0 started giving I/O errors
> because of hardware problems.
> Sheepdog automatically unplugged this device.
>
> One hour later, sdc1 (2T) of node id1 also had hardware problems, started
> giving I/O errors and it got disconnected by sheepdog.
>
> In such case, loosing two big disks on two different nodes at the same
> time, is equivalent of loosing 2 nodes at the same time.
> My redundancy schema is -c 2 so, in any case, it wasn't possible to keep
> cluster consistency.
> But let's continue with the story.
>
> As soon as the first disk has failed, sheepdog on node id0 started the
> recovery and filled up sdb1 (500G).
> The same has happened one hour later to node id1 (during recovery).
>
> Notice that no nodes has left the cluster!
>
> What did I do?
>
> I stopped the cluster.
> 'dog cluster shutdown' wasn't enough, so I had to kill -9 all sheep
> daemons.
>
> At this point I couldn't simply change the broken disks because I didn't
> have enough objects to restore the vdi(s).
> I was able to copy the content of the broken sdc1 of ndoe id0 (by xfs_copy
> and xfs_repair).
> Probably some objects got lost there.
>
> I couldn't anyway run sheep on the same devices because sdb were full.
> So I
> - copied all the objects of /mnt/sheep/dsk01 and /mnt/sheep/dsk02 in
> /mnt/sheep/dsk03
> - moved the meta data in /var/lib/sheepdog
> - run sheepdog with only /var/lib/sheepdog,/mnt/sheep/dsk03
> - as soon the cluster and the recovery started, I also run
>   dog cluster reweight
>
> At the end of the recovery I run 'dog vdi check' on all vdi.
> Only 1 small vdi of 10G was missing objects.
> The other 'vdi check' printed only 'fixed replica'.
>
> Later on, I've been also able to clone the second broken disk by ddrescue
> and read it's content after an intense xfs_repair.
> I found 2 more vdi's objects but there were missing others.
>
> Anyway, I tried to run a guest from the restored cluster but it wasn't
> starting.
> I run the gust from a live cd (systemrescue) and run a fsck.ext4 on the
> guest's filesystem.
> It fixed ... to much ...
> After it, all the content was in lost+found.
> I tried with a second guest.
> I was able to mount the filesystem before the check, but after the fsck,
> all the content got lost.
> I attach a screenshot in a second mail.
> These guests were running during the creash.
> The filesystem of vdi that were not "running" during the crash are fine!
>
> So the strategy of restoring objects from the disconnected/broken disks
> worked!
> I suspect the filesystem issue happened before, when the second (2T) disk
> got broken or when the two disks of 500G started giving 'disk full'.
> In such scenario, the best thing probably would have been the cluster to
> automatically stop.
> I see that's more difficult to detect than a node leaving the cluster.
>
> I guess we may have 2 problems to reason about both related to multi disk:
>
> 1) when a disk get unplugged the weight of the node doesn't change.
> That may lead to disk full on that node.
> I don't know if in later sheepdog version the dog get disconnected from
> the cluster in such case, or it leads to unclear state (the node still in
> the cluster but unable to issue write requests)
>
> 2) The unlucky case of having more devices to break down in the same
> period on different hosts.
> with redundancy -c 2 I may loose a single host or a single disk (on a
> single node).
> with -c 2:2 I may loose 2 hosts or 2 disks on 2 different hosts.
> If loose 3 hosts, the cluster halt itself, waiting for the missing node to
> be back up.
> If 3 disks break down in the same time period (on different hosts), the
> cluster should also halt it self, or do something to keep the cluster
> consistency (waiting for a manual operation).
>
> Thank you.
>
> Valerio.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140617/c1863675/attachment-0005.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fsck.jpeg
Type: image/jpeg
Size: 55400 bytes
Desc: not available
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140617/c1863675/attachment-0005.jpeg>