[sheepdog-users] Cluster recover after loosing 2 devices

Tue Jun 17 09:55:04 CEST 2014

Hi, I try to summarize what happened on a cluster I manage.
It may be interesting for other administrators.
(It's a bit long, sorry).

This cluster has 4 nodes and is running Sheepdog daemon version
0.7.0_131_g88f0024.

I'm going to focus on node id0 and node id1.
These are the mount points used by sheepdog.

node id0
/dev/mapper/vg00-sheepdsk01    217G  211G    5,2G  98% /mnt/sheep/dsk01
/dev/sdb1    466G  466G     76K 100% /mnt/sheep/dsk02
/dev/sdc1    1,9T  1,1T    755G  60% /mnt/sheep/dsk03

node id1
/dev/mapper/vg00-sheepdsk01    167G  103G     64G  62% /mnt/sheep/dsk01
/dev/sdb1    466G  466G     12K 100% /mnt/sheep/dsk02
/dev/sdc1    1,9T  1,1T    791G  58% /mnt/sheep/dsk03

As you can see, two disks (500G) are full.
During the night, the disk sdc1 (2T) of node id0 started giving I/O errors
because of hardware problems.
Sheepdog automatically unplugged this device.

One hour later, sdc1 (2T) of node id1 also had hardware problems, started
giving I/O errors and it got disconnected by sheepdog.

In such case, loosing two big disks on two different nodes at the same
time, is equivalent of loosing 2 nodes at the same time.
My redundancy schema is -c 2 so, in any case, it wasn't possible to keep
cluster consistency.
But let's continue with the story.

As soon as the first disk has failed, sheepdog on node id0 started the
recovery and filled up sdb1 (500G).
The same has happened one hour later to node id1 (during recovery).

Notice that no nodes has left the cluster!

What did I do?

I stopped the cluster.
'dog cluster shutdown' wasn't enough, so I had to kill -9 all sheep daemons.

At this point I couldn't simply change the broken disks because I didn't
have enough objects to restore the vdi(s).
I was able to copy the content of the broken sdc1 of ndoe id0 (by xfs_copy
and xfs_repair).
Probably some objects got lost there.

I couldn't anyway run sheep on the same devices because sdb were full.
So I
- copied all the objects of /mnt/sheep/dsk01 and /mnt/sheep/dsk02 in
/mnt/sheep/dsk03
- moved the meta data in /var/lib/sheepdog
- run sheepdog with only /var/lib/sheepdog,/mnt/sheep/dsk03
- as soon the cluster and the recovery started, I also run
  dog cluster reweight

At the end of the recovery I run 'dog vdi check' on all vdi.
Only 1 small vdi of 10G was missing objects.
The other 'vdi check' printed only 'fixed replica'.

Later on, I've been also able to clone the second broken disk by ddrescue
and read it's content after an intense xfs_repair.
I found 2 more vdi's objects but there were missing others.

Anyway, I tried to run a guest from the restored cluster but it wasn't
starting.
I run the gust from a live cd (systemrescue) and run a fsck.ext4 on the
guest's filesystem.
It fixed ... to much ...
After it, all the content was in lost+found.
I tried with a second guest.
I was able to mount the filesystem before the check, but after the fsck,
all the content got lost.
I attach a screenshot in a second mail.
These guests were running during the creash.
The filesystem of vdi that were not "running" during the crash are fine!

So the strategy of restoring objects from the disconnected/broken disks
worked!
I suspect the filesystem issue happened before, when the second (2T) disk
got broken or when the two disks of 500G started giving 'disk full'.
In such scenario, the best thing probably would have been the cluster to
automatically stop.
I see that's more difficult to detect than a node leaving the cluster.

I guess we may have 2 problems to reason about both related to multi disk:

1) when a disk get unplugged the weight of the node doesn't change.
That may lead to disk full on that node.
I don't know if in later sheepdog version the dog get disconnected from the
cluster in such case, or it leads to unclear state (the node still in the
cluster but unable to issue write requests)

2) The unlucky case of having more devices to break down in the same period
on different hosts.
with redundancy -c 2 I may loose a single host or a single disk (on a
single node).
with -c 2:2 I may loose 2 hosts or 2 disks on 2 different hosts.
If loose 3 hosts, the cluster halt itself, waiting for the missing node to
be back up.
If 3 disks break down in the same time period (on different hosts), the
cluster should also halt it self, or do something to keep the cluster
consistency (waiting for a manual operation).

Thank you.

Valerio.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140617/01e731f2/attachment-0004.html>