[Sheepdog] connection fail and too many open files

Keiichi SHIMA shima at wide.ad.jp
Wed Aug 31 07:56:35 CEST 2011


Hello,

I'm trying to use sheepdog as an iscsi backing store, and facing some issues.

I'm using 46 PCs as sheepdog storage nodes.  Making a cluster with them went fine (as long as I don't change membership), and I could create a disk image in the cluster.  I setup iscsi target on one of the sheepdog storage nodes, and setup another PC as an iscsi initiator.  I could mount the sheepdog disk over iscsi protocol.  I checked if I could make a filesystem on the mounted iscsi volume.  It went all fine.

But once I unmounted the volume (causing syncing on the disk), the sheepdog cluster started complaining.  In the log file of the storage node, which is also the iscsi target node, started showing the following error messages.

  Aug 31 02:22:06 forward_write_obj_req(396) failed to connect to 2001:200:d00:101::43:7000
  Aug 31 02:22:06 store_queue_request(854) failed, 42, 3, 62ee040000001a , 1, 129

In the above case, the failed node was 2001:200:d00:101::43, but there were many same errors for different nodes.

I tried to perform collie on the node, but collie didn't respond.  From this point, the sheep started generating the following error messages.

  Aug 31 02:25:56 listen_handler(567) can't accept a new connection, Too many open files


I uploaded sheep.log files of all the sheepdog storage nodes I was using during the above operation at

  http://member.wide.ad.jp/~shima/tmp/sheeplog-201108311426.tgz


The following is the procedure I did to check the above behavior.

1. setup a sheepdog cluster with 46 nodes with 3 copies

  collie cluster format --copies=3

2. created a disk image

  qemu-img create sheepdog:disk00 -o preallocation=data 1G

3. started iscsi target (tgtd) on 2001:200:d00:101::92 (corresponds to 172.16.22.92 in the uploaded log file)

  tgtd
  tgtadm --op new --mode target --tid 1 --lld iscsi -T iqn.2011-09.jp.ad.wide.cloud.sheepdog.storage.1
  tgtadm --op new --mode logicalunit --tid 1 --lun 1 -b disk00 --bstype sheepdog
  tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL

4. mount the volume with other PC which is not a part of the sheepdog cluster

5. making filesystem, read/write operation on the mounted volume

6. unmount the volume
  (sheep start generating error messages as shown above)

7. perform collie operation on 2001:200:d00:101::92 (corresponds to 172.16.22.92 in the uploaded log file)
  (sheep start generating another error messages as shown above)


Is there any suggestions?

---
Keiichi SHIMA  <shima at wide.ad.jp>
WIDE Project http://www.wide.ad.jp/






More information about the sheepdog mailing list