[sheepdog-users] Cluster recover after loosing 2 devices
Liu Yuan
namei.unix at gmail.com
Tue Jun 17 16:34:14 CEST 2014
On Tue, Jun 17, 2014 at 10:23:12PM +0800, Liu Yuan wrote:
> On Tue, Jun 17, 2014 at 09:55:04AM +0200, Valerio Pachera wrote:
> > Hi, I try to summarize what happened on a cluster I manage.
> > It may be interesting for other administrators.
> > (It's a bit long, sorry).
> >
> > This cluster has 4 nodes and is running Sheepdog daemon version
> > 0.7.0_131_g88f0024.
> >
> > I'm going to focus on node id0 and node id1.
> > These are the mount points used by sheepdog.
> >
> > node id0
> > /dev/mapper/vg00-sheepdsk01 217G 211G 5,2G 98% /mnt/sheep/dsk01
> > /dev/sdb1 466G 466G 76K 100% /mnt/sheep/dsk02
> > /dev/sdc1 1,9T 1,1T 755G 60% /mnt/sheep/dsk03
> >
> > node id1
> > /dev/mapper/vg00-sheepdsk01 167G 103G 64G 62% /mnt/sheep/dsk01
> > /dev/sdb1 466G 466G 12K 100% /mnt/sheep/dsk02
> > /dev/sdc1 1,9T 1,1T 791G 58% /mnt/sheep/dsk03
> >
> > As you can see, two disks (500G) are full.
> > During the night, the disk sdc1 (2T) of node id0 started giving I/O errors
> > because of hardware problems.
> > Sheepdog automatically unplugged this device.
> >
> > One hour later, sdc1 (2T) of node id1 also had hardware problems, started
> > giving I/O errors and it got disconnected by sheepdog.
> >
> > In such case, loosing two big disks on two different nodes at the same
> > time, is equivalent of loosing 2 nodes at the same time.
> > My redundancy schema is -c 2 so, in any case, it wasn't possible to keep
> > cluster consistency.
> > But let's continue with the story.
> >
> > As soon as the first disk has failed, sheepdog on node id0 started the
> > recovery and filled up sdb1 (500G).
> > The same has happened one hour later to node id1 (during recovery).
> >
> > Notice that no nodes has left the cluster!
> >
> > What did I do?
> >
> > I stopped the cluster.
> > 'dog cluster shutdown' wasn't enough, so I had to kill -9 all sheep daemons.
> >
> > At this point I couldn't simply change the broken disks because I didn't
> > have enough objects to restore the vdi(s).
> > I was able to copy the content of the broken sdc1 of ndoe id0 (by xfs_copy
> > and xfs_repair).
> > Probably some objects got lost there.
> >
> > I couldn't anyway run sheep on the same devices because sdb were full.
> > So I
> > - copied all the objects of /mnt/sheep/dsk01 and /mnt/sheep/dsk02 in
> > /mnt/sheep/dsk03
> > - moved the meta data in /var/lib/sheepdog
> > - run sheepdog with only /var/lib/sheepdog,/mnt/sheep/dsk03
> > - as soon the cluster and the recovery started, I also run
> > dog cluster reweight
> >
> > At the end of the recovery I run 'dog vdi check' on all vdi.
> > Only 1 small vdi of 10G was missing objects.
> > The other 'vdi check' printed only 'fixed replica'.
> >
> > Later on, I've been also able to clone the second broken disk by ddrescue
> > and read it's content after an intense xfs_repair.
> > I found 2 more vdi's objects but there were missing others.
> >
> > Anyway, I tried to run a guest from the restored cluster but it wasn't
> > starting.
> > I run the gust from a live cd (systemrescue) and run a fsck.ext4 on the
> > guest's filesystem.
> > It fixed ... to much ...
> > After it, all the content was in lost+found.
> > I tried with a second guest.
> > I was able to mount the filesystem before the check, but after the fsck,
> > all the content got lost.
> > I attach a screenshot in a second mail.
> > These guests were running during the creash.
> > The filesystem of vdi that were not "running" during the crash are fine!
> >
> > So the strategy of restoring objects from the disconnected/broken disks
> > worked!
> > I suspect the filesystem issue happened before, when the second (2T) disk
> > got broken or when the two disks of 500G started giving 'disk full'.
> > In such scenario, the best thing probably would have been the cluster to
> > automatically stop.
> > I see that's more difficult to detect than a node leaving the cluster.
>
> Very nice real story, thanks!
>
> Yes, we should do better. Now we act as following:
>
> - for create request, we should return NO_SPACE err to VM (VM will translate it
> as IO ERROR)
> - for write and read, we simply succeed.
>
> We can't simply stop the cluster, because there are ditry data, we should try
> our best to flush those dirty bits into backend.
I was wrong. For strict case, we only allow read requests fro VM if not enough
nodes. Strict option means:
1. For write requests from guest (might map to write or create sheep requests)
we return HALT to VM if not enough nodes to make sure write always get redundancy.
If --strict not set, sheep will try its best to make both read or write succeed
in the hope that later we can restore the redundancy.
Thanks
Yuan
More information about the sheepdog-users
mailing list