[sheepdog-users] Cluster recover after loosing 2 devices

Tue Jun 17 16:34:14 CEST 2014

On Tue, Jun 17, 2014 at 10:23:12PM +0800, Liu Yuan wrote:
> On Tue, Jun 17, 2014 at 09:55:04AM +0200, Valerio Pachera wrote:
> > Hi, I try to summarize what happened on a cluster I manage.
> > It may be interesting for other administrators.
> > (It's a bit long, sorry).
> > 
> > This cluster has 4 nodes and is running Sheepdog daemon version
> > 0.7.0_131_g88f0024.
> > 
> > I'm going to focus on node id0 and node id1.
> > These are the mount points used by sheepdog.
> > 
> > node id0
> > /dev/mapper/vg00-sheepdsk01    217G  211G    5,2G  98% /mnt/sheep/dsk01
> > /dev/sdb1    466G  466G     76K 100% /mnt/sheep/dsk02
> > /dev/sdc1    1,9T  1,1T    755G  60% /mnt/sheep/dsk03
> > 
> > node id1
> > /dev/mapper/vg00-sheepdsk01    167G  103G     64G  62% /mnt/sheep/dsk01
> > /dev/sdb1    466G  466G     12K 100% /mnt/sheep/dsk02
> > /dev/sdc1    1,9T  1,1T    791G  58% /mnt/sheep/dsk03
> > 
> > As you can see, two disks (500G) are full.
> > During the night, the disk sdc1 (2T) of node id0 started giving I/O errors
> > because of hardware problems.
> > Sheepdog automatically unplugged this device.
> > 
> > One hour later, sdc1 (2T) of node id1 also had hardware problems, started
> > giving I/O errors and it got disconnected by sheepdog.
> > 
> > In such case, loosing two big disks on two different nodes at the same
> > time, is equivalent of loosing 2 nodes at the same time.
> > My redundancy schema is -c 2 so, in any case, it wasn't possible to keep
> > cluster consistency.
> > But let's continue with the story.
> > 
> > As soon as the first disk has failed, sheepdog on node id0 started the
> > recovery and filled up sdb1 (500G).
> > The same has happened one hour later to node id1 (during recovery).
> > 
> > Notice that no nodes has left the cluster!
> > 
> > What did I do?
> > 
> > I stopped the cluster.
> > 'dog cluster shutdown' wasn't enough, so I had to kill -9 all sheep daemons.
> > 
> > At this point I couldn't simply change the broken disks because I didn't
> > have enough objects to restore the vdi(s).
> > I was able to copy the content of the broken sdc1 of ndoe id0 (by xfs_copy
> > and xfs_repair).
> > Probably some objects got lost there.
> > 
> > I couldn't anyway run sheep on the same devices because sdb were full.
> > So I
> > - copied all the objects of /mnt/sheep/dsk01 and /mnt/sheep/dsk02 in
> > /mnt/sheep/dsk03
> > - moved the meta data in /var/lib/sheepdog
> > - run sheepdog with only /var/lib/sheepdog,/mnt/sheep/dsk03
> > - as soon the cluster and the recovery started, I also run
> >   dog cluster reweight
> > 
> > At the end of the recovery I run 'dog vdi check' on all vdi.
> > Only 1 small vdi of 10G was missing objects.
> > The other 'vdi check' printed only 'fixed replica'.
> > 
> > Later on, I've been also able to clone the second broken disk by ddrescue
> > and read it's content after an intense xfs_repair.
> > I found 2 more vdi's objects but there were missing others.
> > 
> > Anyway, I tried to run a guest from the restored cluster but it wasn't
> > starting.
> > I run the gust from a live cd (systemrescue) and run a fsck.ext4 on the
> > guest's filesystem.
> > It fixed ... to much ...
> > After it, all the content was in lost+found.
> > I tried with a second guest.
> > I was able to mount the filesystem before the check, but after the fsck,
> > all the content got lost.
> > I attach a screenshot in a second mail.
> > These guests were running during the creash.
> > The filesystem of vdi that were not "running" during the crash are fine!
> > 
> > So the strategy of restoring objects from the disconnected/broken disks
> > worked!
> > I suspect the filesystem issue happened before, when the second (2T) disk
> > got broken or when the two disks of 500G started giving 'disk full'.
> > In such scenario, the best thing probably would have been the cluster to
> > automatically stop.
> > I see that's more difficult to detect than a node leaving the cluster.
> 
> Very nice real story, thanks!
> 
> Yes, we should do better. Now we act as following:
> 
> - for create request, we should return NO_SPACE err to VM (VM will translate it
>   as IO ERROR)
> - for write and read, we simply succeed.
> 
> We can't simply stop the cluster, because there are ditry data, we should try
> our best to flush those dirty bits into backend.

I was wrong. For strict case, we only allow read requests fro VM if not enough
nodes. Strict option means:

1. For write requests from guest (might map to write or create sheep requests)
we return HALT to VM if not enough nodes to make sure write always get redundancy.

If --strict not set, sheep will try its best to make both read or write succeed
in the hope that later we can restore the redundancy.

Thanks
Yuan