[sheepdog-users] md remove has to trigget recover in some cases

Fri May 10 14:40:08 CEST 2013

On 05/10/2013 05:24 PM, Valerio Pachera wrote:
> Shortly, when we unplug a disk, if data can't be redistributed/rebuilt
> on the other node disks, we have to trigger a cluster recover.
> 
> Here is the reason:
> 
> 3 node with a single disk of 50G.
>   collie cluster format -c 2
> 
> Created 2 pre-allocated disk just to use space
>   colle vdi create -P garbage 10G
>   colle vdi create -P garbagebid 10G
> 
> Added a small disk
> root at test006:~# collie node md plug /mnt/sdb1/
> 
> root at test006:~# collie node info
> Id      Size    Used    Use%
>  0      50 GB   12 GB    24%
>  1      50 GB   12 GB    24%
>  2      55 GB   16 GB    28%
> Total   154 GB  40 GB    25%
> Total virtual image size        20 GB
> 
> root at test004:~# collie node md info --all
> Id      Used    Free    Path
> Node 0:
>  0      12 GB   38 GB   /mnt/sheepdog/obj
> Node 1:
>  0      12 GB   38 GB   /mnt/sheepdog/obj
> Node 2:
>  0      15 GB   35 GB   /mnt/sheepdog/obj
>  1      1.1 GB  3.5 GB  /mnt/sdb1
> 
> The funny part :-) remove the big disk
> 
> root at test006:~# collie node md unplug /mnt/sheepdog/obj
> 
> I've seen 'node recovery' working for some time on 192.168.2.46 (id2):
> sheepdog has been writing sata on the small disk till filling it full.
> After it, looks like nothing has changed (as we told, unplug of a
> disk, do not trigger cluster recover).
> 
> root at test004:~# collie node md info --all
> Id      Used    Free    Path
> Node 0:
>  0      12 GB   38 GB   /mnt/sheepdog/obj
> Node 1:
>  0      12 GB   38 GB   /mnt/sheepdog/obj
> Node 2:
>  0      4.6 GB  2.8 MB  /mnt/sdb1
> 
> !!!Vdi disks are not usabe now!!!
> 
> root at test004:~# collie vdi check garbage
>   FATAL: failed to write, Server has no space for new objects
> 
> I have no guests running.
> I have no idea of what could happen (data corruption, I/O errors...).
> 
> It's not possible to plug the disk back (I cleared data first)
> collie node md plug /mnt/sdb1/
>   Failed to execute request, look for sheep.log for more information
> 

Unplug disk indeed will trigger node-wide recovery.

No, you just plugged the wrong disk. You should

$ collie node md plug /mnt/sheepdog/obj

And everything will be fine.

For disk full condition, sheepdog can't treat it the way as EIO that
remove the fulled disk to avoid a possible chain of full disk failures
that previous disk full trigger another disk full until the whole
cluster is dead. So in your case, the cluster is considered full. Yeah,
one disk or one node is full, then the whole cluster is full.

Thanks,
Yuan