[sheepdog-users] cache crash test

Liu Yuan namei.unix at gmail.com
Tue Jun 25 11:26:52 CEST 2013


On Tue, Jun 25, 2013 at 10:44:22AM +0200, Valerio Pachera wrote:
> Hi, the object sounds like a work joke, doesn't it? :-)
> 
> Here we go, this is my testing cluster.
> 
> # collie node info
> Id      Size    Used    Avail   Use%
>  0      931 GB  5.0 GB  926 GB    0%
>  1      518 GB  3.0 GB  515 GB    0%
>  2      518 GB  2.1 GB  516 GB    0%
> Total   1.9 TB  10 GB   1.9 TB    0%
> Total virtual image size        108 GB
> 
> # collie node md info --all
> Id      Size    Used    Avail   Use%    Path
> Node 0:
>  0      931 GB  5.0 GB  926 GB    0%    /mnt/sheep/dsk02
> Node 1:
>  0      220 GB  1.3 GB  218 GB    0%    /mnt/sheep/dsk01/obj
>  1      298 GB  1.7 GB  296 GB    0%    /mnt/sheep/dsk02
> Node 2:
>  0      220 GB  972 MB  219 GB    0%    /mnt/sheep/dsk01/obj
>  1      298 GB  1.1 GB  297 GB    0%    /mnt/sheep/dsk02
> 
> I created few vdi, then I have a running guest on node 0 that is
> writing 1M each 3 seconds (with oflag=direct).
> Guest is using cache=writeback.
> I killed node 2 to trigger a recover. It completed without any problem.
> 
> I've been checking sheep.log and I notice that "connect_to ... failed"
> is repeated for long time after the node has die.
> 
> Jun 25 10:30:21 [rw] get_vdi_copy_number(108) No VDI copy entry for 0 found
> Jun 25 10:30:21 [rw] screen_object_list(724) ERROR: can not find copy
> number for object 4f4239

This looks like a bug hanging around.

> ...
> Jun 25 10:31:15 [rw] connect_to(254) failed to connect to
> 192.168.2.47:7000: Connection refused
> 
> *Is that normal?*
> 

Partily yes. When node was gone, other sheep wouldn't notice it until
cluster driver notfied us. The ongoing operation trying to connect to failed
node will print out this log, before we know the node was gone. But seems that
we can do better and don't flood the log.

Thanks,
Yuan



More information about the sheepdog-users mailing list