[sheepdog-users] Simultaneous startup of sheep daemon may fail
Andrew J. Hobbs
ajhobbs at desu.edu
Wed Nov 13 15:49:31 CET 2013
Might be worth trying to repeat using zookeeper. In our cluster (we have nodes in several buildings now), corosync proved to simply not be reliable for our purposes. Only reason I'm wondering about this is it makes sense that during a mass start up (assuming it was shutdown properly), there might be a race condition or congestion causing lost packets.
On 11/13/2013 09:39 AM, Valerio Pachera wrote:
On my testing cluster I noticed that starting all sheeps at the "same time", may lead to failure in joining the cluster.
parallel-ssh -H 'test004 test005 test006 test007' /root/script/run_sheep.sh
root at test004:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.44:7000<http://192.168.2.44:7000> 128 738371776
root at test005:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.45:7000<http://192.168.2.45:7000> 119 755148992
1 192.168.2.47:7000<http://192.168.2.47:7000> 137 788703424
root at test006:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.46:7000<http://192.168.2.46:7000> 128 771926208
root at test007:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.45:7000<http://192.168.2.45:7000> 119 755148992
1 192.168.2.47:7000<http://192.168.2.47:7000> 137 788703424
It's not repeatable tough.
I tried to shutdown the cluster and re-run parallel-ssh and all nodes were showing the right 'node list' (4 nodes total).
It's not a problem for me but I was wondering if anybody else noticed the same behavior.
I also wonder if may depend on corosync or sheepdog.
I'm running sheep -v
and corosync 1.4.6.
I don't see anything useful in sheep.log
Nov 13 13:01:51 INFO [main] main(845) shutdown
Nov 13 15:11:19 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:11:19 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:11:19 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:11:19 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:11:19 INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:13:59 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:13:59 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:13:59 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:13:59 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:13:59 INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:14:41 INFO [main] main(845) shutdown
Nov 13 15:14:53 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:14:53 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:14:53 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:14:53 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 61440
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131113/b113208f/attachment-0005.vcf>
More information about the sheepdog-users
mailing list