[sheepdog-users] Cluster recover after loosing 2 devices

Tue Jun 17 16:23:12 CEST 2014

On Tue, Jun 17, 2014 at 09:55:04AM +0200, Valerio Pachera wrote:
> Hi, I try to summarize what happened on a cluster I manage.
> It may be interesting for other administrators.
> (It's a bit long, sorry).
> 
> This cluster has 4 nodes and is running Sheepdog daemon version
> 0.7.0_131_g88f0024.
> 
> I'm going to focus on node id0 and node id1.
> These are the mount points used by sheepdog.
> 
> node id0
> /dev/mapper/vg00-sheepdsk01    217G  211G    5,2G  98% /mnt/sheep/dsk01
> /dev/sdb1    466G  466G     76K 100% /mnt/sheep/dsk02
> /dev/sdc1    1,9T  1,1T    755G  60% /mnt/sheep/dsk03
> 
> node id1
> /dev/mapper/vg00-sheepdsk01    167G  103G     64G  62% /mnt/sheep/dsk01
> /dev/sdb1    466G  466G     12K 100% /mnt/sheep/dsk02
> /dev/sdc1    1,9T  1,1T    791G  58% /mnt/sheep/dsk03
> 
> As you can see, two disks (500G) are full.
> During the night, the disk sdc1 (2T) of node id0 started giving I/O errors
> because of hardware problems.
> Sheepdog automatically unplugged this device.
> 
> One hour later, sdc1 (2T) of node id1 also had hardware problems, started
> giving I/O errors and it got disconnected by sheepdog.
> 
> In such case, loosing two big disks on two different nodes at the same
> time, is equivalent of loosing 2 nodes at the same time.
> My redundancy schema is -c 2 so, in any case, it wasn't possible to keep
> cluster consistency.
> But let's continue with the story.
> 
> As soon as the first disk has failed, sheepdog on node id0 started the
> recovery and filled up sdb1 (500G).
> The same has happened one hour later to node id1 (during recovery).
> 
> Notice that no nodes has left the cluster!
> 
> What did I do?
> 
> I stopped the cluster.
> 'dog cluster shutdown' wasn't enough, so I had to kill -9 all sheep daemons.
> 
> At this point I couldn't simply change the broken disks because I didn't
> have enough objects to restore the vdi(s).
> I was able to copy the content of the broken sdc1 of ndoe id0 (by xfs_copy
> and xfs_repair).
> Probably some objects got lost there.
> 
> I couldn't anyway run sheep on the same devices because sdb were full.
> So I
> - copied all the objects of /mnt/sheep/dsk01 and /mnt/sheep/dsk02 in
> /mnt/sheep/dsk03
> - moved the meta data in /var/lib/sheepdog
> - run sheepdog with only /var/lib/sheepdog,/mnt/sheep/dsk03
> - as soon the cluster and the recovery started, I also run
>   dog cluster reweight
> 
> At the end of the recovery I run 'dog vdi check' on all vdi.
> Only 1 small vdi of 10G was missing objects.
> The other 'vdi check' printed only 'fixed replica'.
> 
> Later on, I've been also able to clone the second broken disk by ddrescue
> and read it's content after an intense xfs_repair.
> I found 2 more vdi's objects but there were missing others.
> 
> Anyway, I tried to run a guest from the restored cluster but it wasn't
> starting.
> I run the gust from a live cd (systemrescue) and run a fsck.ext4 on the
> guest's filesystem.
> It fixed ... to much ...
> After it, all the content was in lost+found.
> I tried with a second guest.
> I was able to mount the filesystem before the check, but after the fsck,
> all the content got lost.
> I attach a screenshot in a second mail.
> These guests were running during the creash.
> The filesystem of vdi that were not "running" during the crash are fine!
> 
> So the strategy of restoring objects from the disconnected/broken disks
> worked!
> I suspect the filesystem issue happened before, when the second (2T) disk
> got broken or when the two disks of 500G started giving 'disk full'.
> In such scenario, the best thing probably would have been the cluster to
> automatically stop.
> I see that's more difficult to detect than a node leaving the cluster.

Very nice real story, thanks!

Yes, we should do better. Now we act as following:

- for create request, we should return NO_SPACE err to VM (VM will translate it
  as IO ERROR)
- for write and read, we simply succeed.

We can't simply stop the cluster, because there are ditry data, we should try
our best to flush those dirty bits into backend.

> I guess we may have 2 problems to reason about both related to multi disk:
> 
> 1) when a disk get unplugged the weight of the node doesn't change.
> That may lead to disk full on that node.
> I don't know if in later sheepdog version the dog get disconnected from the
> cluster in such case, or it leads to unclear state (the node still in the
> cluster but unable to issue write requests)

Yes, I think we should

- if disks are unplugged by io error, we should reweight automatically.
- if disks are plugged/unplugged by users, we don't do auto-reweight.

> 2) The unlucky case of having more devices to break down in the same period
> on different hosts.
> with redundancy -c 2 I may loose a single host or a single disk (on a
> single node).
> with -c 2:2 I may loose 2 hosts or 2 disks on 2 different hosts.
> If loose 3 hosts, the cluster halt itself, waiting for the missing node to
> be back up.
> If 3 disks break down in the same time period (on different hosts), the
> cluster should also halt it self, or do something to keep the cluster
> consistency (waiting for a manual operation).

Try to pass '--strict' option to sheep daemon. It tell the cluster to stop the
service if nodes number is less than required redundancy policy.

Thanks
Yuan