[Sheepdog] connection fail and too many open files

Thu Sep 1 18:23:47 CEST 2011

At Wed, 31 Aug 2011 14:56:35 +0900,
Keiichi SHIMA wrote:
> 
> Hello,
> 
> I'm trying to use sheepdog as an iscsi backing store, and facing some issues.
> 
> I'm using 46 PCs as sheepdog storage nodes.  Making a cluster with them went fine (as long as I don't change membership), and I could create a disk image in the cluster.  I setup iscsi target on one of the sheepdog storage nodes, and setup another PC as an iscsi initiator.  I could mount the sheepdog disk over iscsi protocol.  I checked if I could make a filesystem on the mounted iscsi volume.  It went all fine.
> 
> But once I unmounted the volume (causing syncing on the disk), the sheepdog cluster started complaining.  In the log file of the storage node, which is also the iscsi target node, started showing the following error messages.
> 
>   Aug 31 02:22:06 forward_write_obj_req(396) failed to connect to 2001:200:d00:101::43:7000
>   Aug 31 02:22:06 store_queue_request(854) failed, 42, 3, 62ee040000001a , 1, 129
> 
> In the above case, the failed node was 2001:200:d00:101::43, but there were many same errors for different nodes.
> 
> I tried to perform collie on the node, but collie didn't respond.  From this point, the sheep started generating the following error messages.
> 
>   Aug 31 02:25:56 listen_handler(567) can't accept a new connection, Too many open files
> 
> 
> I uploaded sheep.log files of all the sheepdog storage nodes I was using during the above operation at
> 
>   http://member.wide.ad.jp/~shima/tmp/sheeplog-201108311426.tgz
> 
> 
> The following is the procedure I did to check the above behavior.
> 
> 1. setup a sheepdog cluster with 46 nodes with 3 copies
> 
>   collie cluster format --copies=3
> 
> 2. created a disk image
> 
>   qemu-img create sheepdog:disk00 -o preallocation=data 1G
> 
> 3. started iscsi target (tgtd) on 2001:200:d00:101::92 (corresponds to 172.16.22.92 in the uploaded log file)
> 
>   tgtd
>   tgtadm --op new --mode target --tid 1 --lld iscsi -T iqn.2011-09.jp.ad.wide.cloud.sheepdog.storage.1
>   tgtadm --op new --mode logicalunit --tid 1 --lun 1 -b disk00 --bstype sheepdog
>   tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
> 
> 4. mount the volume with other PC which is not a part of the sheepdog cluster
> 
> 5. making filesystem, read/write operation on the mounted volume
> 
> 6. unmount the volume
>   (sheep start generating error messages as shown above)
> 
> 7. perform collie operation on 2001:200:d00:101::92 (corresponds to 172.16.22.92 in the uploaded log file)
>   (sheep start generating another error messages as shown above)
> 
> 
> Is there any suggestions?

Thanks for the information.  I sent a patch to fix a "too many open
files" problem just now:

  http://lists.wpkg.org/pipermail/sheepdog/2011-September/001324.html

I think this would solve this problem.

Thanks,

Kazutaka