[sheepdog-users] Simultaneous startup of sheep daemon may fail

Andrew J. Hobbs ajhobbs at desu.edu
Wed Nov 13 15:49:31 CET 2013


Might be worth trying to repeat using zookeeper.  In our cluster (we have nodes in several buildings now), corosync proved to simply not be reliable for our purposes.  Only reason I'm wondering about this is it makes sense that during a mass start up (assuming it was shutdown properly), there might be a race condition or congestion causing lost packets.

On 11/13/2013 09:39 AM, Valerio Pachera wrote:
On my testing cluster I noticed that starting all sheeps at the "same time", may lead to failure in joining the cluster.

parallel-ssh -H 'test004 test005 test006 test007' /root/script/run_sheep.sh

root at test004:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.44:7000<http://192.168.2.44:7000>        128  738371776

root at test005:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.45:7000<http://192.168.2.45:7000>        119  755148992
   1   192.168.2.47:7000<http://192.168.2.47:7000>        137  788703424

root at test006:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.46:7000<http://192.168.2.46:7000>        128  771926208

root at test007:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.45:7000<http://192.168.2.45:7000>        119  755148992
   1   192.168.2.47:7000<http://192.168.2.47:7000>        137  788703424

It's not repeatable tough.
I tried to shutdown the cluster and re-run parallel-ssh and all nodes were showing the right 'node list' (4 nodes total).

It's not a problem for me but I was wondering if anybody else noticed the same behavior.
I also wonder if may depend on corosync or sheepdog.

I'm running sheep -v
and corosync 1.4.6.

I don't see anything useful in sheep.log

Nov 13 13:01:51   INFO [main] main(845) shutdown
Nov 13 15:11:19   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:11:19   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:11:19   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:11:19   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:11:19   INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:13:59   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:13:59   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:13:59   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:13:59   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:13:59   INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:14:41   INFO [main] main(845) shutdown
Nov 13 15:14:53   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:14:53   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:14:53   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:14:53   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 61440




-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131113/b113208f/attachment-0005.vcf>


More information about the sheepdog-users mailing list