[sheepdog-users] Simultaneous startup of sheep daemon may fail
Andrew J. Hobbs
ajhobbs at desu.edu
Wed Nov 13 15:58:55 CET 2013
Followup to my followup after a closer look at dog node list. I've seen this exact behavior before. Here's what happened.
Node 1 and 2 were in one building, node 3 was in another building over a 10G backbone. Nodes 1 and 2 were listed on each other. Node 3 was only listed to itself. In our case, it came down to multicast not being supported through our campus network core, which had to be traversed to get from 1/2 to 3. This was the deciding factor that made me switch to zookeeper. That, and the eventual goal to scale beyond the node count corosync can support.
This might not be the case if your machines are all on a single switch or virtualized, and frankly, it may not have been the case when we diagnosed it (it was a shotgun fix that logically makes sense after discussing the situation with networking staff). However, I can say that we haven't re-experienced this issue since adopting zookeeper.
On 11/13/2013 09:49 AM, Andrew J. Hobbs wrote:
Might be worth trying to repeat using zookeeper. In our cluster (we have nodes in several buildings now), corosync proved to simply not be reliable for our purposes. Only reason I'm wondering about this is it makes sense that during a mass start up (assuming it was shutdown properly), there might be a race condition or congestion causing lost packets.
On 11/13/2013 09:39 AM, Valerio Pachera wrote:
On my testing cluster I noticed that starting all sheeps at the "same time", may lead to failure in joining the cluster.
parallel-ssh -H 'test004 test005 test006 test007' /root/script/run_sheep.sh
root at test004:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.44:7000<http://192.168.2.44:7000><http://192.168.2.44:7000> 128 738371776
root at test005:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.45:7000<http://192.168.2.45:7000><http://192.168.2.45:7000> 119 755148992
1 192.168.2.47:7000<http://192.168.2.47:7000><http://192.168.2.47:7000> 137 788703424
root at test006:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.46:7000<http://192.168.2.46:7000><http://192.168.2.46:7000> 128 771926208
root at test007:~# dog node list
Id Host:Port V-Nodes Zone
0 192.168.2.45:7000<http://192.168.2.45:7000><http://192.168.2.45:7000> 119 755148992
1 192.168.2.47:7000<http://192.168.2.47:7000><http://192.168.2.47:7000> 137 788703424
It's not repeatable tough.
I tried to shutdown the cluster and re-run parallel-ssh and all nodes were showing the right 'node list' (4 nodes total).
It's not a problem for me but I was wondering if anybody else noticed the same behavior.
I also wonder if may depend on corosync or sheepdog.
I'm running sheep -v
and corosync 1.4.6.
I don't see anything useful in sheep.log
Nov 13 13:01:51 INFO [main] main(845) shutdown
Nov 13 15:11:19 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:11:19 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:11:19 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:11:19 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:11:19 INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:13:59 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:13:59 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:13:59 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:13:59 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:13:59 INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:14:41 INFO [main] main(845) shutdown
Nov 13 15:14:53 INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:14:53 INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:14:53 INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:14:53 INFO [main] check_host_env(424) Allowed open files 1024000, suggested 61440
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131113/d7c5014b/attachment-0005.vcf>
More information about the sheepdog-users
mailing list