[sheepdog-users] Failing disk tests: disk not responding

Thu Oct 17 18:44:42 CEST 2013

On Wed, Oct 16, 2013 at 05:21:57PM +0200, Valerio Pachera wrote:
> Here I simulate the situation of a disk that physically breaks down
> and the OS can't contact it anymore.
> So we are not able to clean unplug and un-mount the device.
> 
> If it was a raid software, it would sing the disk as "failed" and
> continue working on the good one.
> 
> How is seepdog going to behave?
> 
> # dog node md info
> Id      Size    Used    Avail   Use%    Path
>  0      220 GB  24 GB   196 GB   10%    /mnt/sheep/dsk01/obj
>  1      149 GB  16 GB   133 GB   10%    /mnt/sheep/dsk03
> 
> # df -h | grep sheep
> /dev/mapper/vg00-sheepdog  220G   24G    196G  11% /mnt/sheep/dsk01
> /dev/sdc1                  149G   17G    133G  11% /mnt/sheep/dsk03
> 
> # echo 1 > /sys/block/sdc/device/delete
> 
> # ls /mnt/sheep/dsk03
> ls: impossibile accedere a /mnt/sheep/dsk03: Errore di input/output
> 
> # less /var/log/sheep.log
> Oct 16 16:50:51  ERROR [io 4151] for_each_object_in_path(175) failed
> to open /mnt/sheep/dsk03, Input/output error
> Oct 16 16:58:31  ERROR [io 4151] for_each_object_in_path(175) failed
> to open /mnt/sheep/dsk03, Input/output error
> Oct 16 16:58:43  ERROR [io 4151] md_access(457) failed to check
> /mnt/sheep/dsk03/007ab62200000020, Input/output error
> Oct 16 16:58:43  ERROR [io 4151] md_access(457) failed to check
> /mnt/sheep/dsk03/007ab62200000020, Input/output error
> Oct 16 16:58:43  ERROR [io 4151] md_access(457) failed to check
> /mnt/sheep/dsk03/.stale/007ab62200000020.1, Input/output error
> Oct 16 16:58:43  ERROR [io 4151] md_access(457) failed to check
> /mnt/sheep/dsk03/.stale/007ab62200000020.1, Input/output error
> Oct 16 16:58:43  ERROR [main] modify_event(156) event info for fd 25 not found
> 
> 
> # dog node md info
> Id      Size    Used    Avail   Use%    Path
>  0      220 GB  24 GB   196 GB   10%    /mnt/sheep/dsk01/obj
>  1      0.0 MB  0.0 MB  0.0 MB  -2147483648%    /mnt/sheep/dsk03
> 
> 
> (on another node)
> # dog vdi check squeeze1
>  22.0 % [=====================================================>
> 
> 
>                                      ] 2.2 GB / 10 GB     failed to
> read 7ab62200000020 from 192.168.2.47:7000, I/O error
> 
> 
> 
> After some time I notice
> 
> Oct 16 17:14:07  ERROR [gway 4150] err_to_sderr(95)
> oid=8036657100000000, Input/output error
> Oct 16 17:14:07  ERROR [gway 4150] gateway_replication_read(268) local
> read 8036657100000000 failed, Network error between sheep
> Oct 16 17:14:07   INFO [main] md_remove_disk(316) /mnt/sheep/dsk03
> from multi-disk array
> 
> # dog node md info
> Id      Size    Used    Avail   Use%    Path
>  0      220 GB  29 GB   191 GB   13%    /mnt/sheep/dsk01/obj
> 
> 
> My guests were not running, so I can't tell you if they were going to freeze.
> I might repeat the test tomorrow.
> 
> I would like to know if there's a fixed timeout before sheep is going
> the unplug the device or what else triggers it. 

There is only one event will trigger auto unplug, that is EIO of the broken disk
when client accesses it.

Guest will not go to freeze when disks managed by sheepdog get broken, this is
what a distributed system should provide as a bottom line.

Thanks
Yuan