[sheepdog-users] Simultaneous startup of sheep daemon may fail

Andrew J. Hobbs ajhobbs at desu.edu
Wed Nov 13 15:58:55 CET 2013


Followup to my followup after a closer look at dog node list.  I've seen this exact behavior before.  Here's what happened.

Node 1 and 2 were in one building, node 3 was in another building over a 10G backbone.  Nodes 1 and 2 were listed on each other.  Node 3 was only listed to itself.  In our case, it came down to multicast not being supported through our campus network core, which had to be traversed to get from 1/2 to 3.  This was the deciding factor that made me switch to zookeeper.  That, and the eventual goal to scale beyond the node count corosync can support.

This might not be the case if your machines are all on a single switch or virtualized, and frankly, it may not have been the case when we diagnosed it (it was a shotgun fix that logically makes sense after discussing the situation with networking staff).  However, I can say that we haven't re-experienced this issue since adopting zookeeper.

On 11/13/2013 09:49 AM, Andrew J. Hobbs wrote:

Might be worth trying to repeat using zookeeper.  In our cluster (we have nodes in several buildings now), corosync proved to simply not be reliable for our purposes.  Only reason I'm wondering about this is it makes sense that during a mass start up (assuming it was shutdown properly), there might be a race condition or congestion causing lost packets.

On 11/13/2013 09:39 AM, Valerio Pachera wrote:
On my testing cluster I noticed that starting all sheeps at the "same time", may lead to failure in joining the cluster.

parallel-ssh -H 'test004 test005 test006 test007' /root/script/run_sheep.sh

root at test004:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.44:7000<http://192.168.2.44:7000><http://192.168.2.44:7000>        128  738371776

root at test005:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.45:7000<http://192.168.2.45:7000><http://192.168.2.45:7000>        119  755148992
   1   192.168.2.47:7000<http://192.168.2.47:7000><http://192.168.2.47:7000>        137  788703424

root at test006:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.46:7000<http://192.168.2.46:7000><http://192.168.2.46:7000>        128  771926208

root at test007:~# dog node list
  Id   Host:Port         V-Nodes       Zone
   0   192.168.2.45:7000<http://192.168.2.45:7000><http://192.168.2.45:7000>        119  755148992
   1   192.168.2.47:7000<http://192.168.2.47:7000><http://192.168.2.47:7000>        137  788703424

It's not repeatable tough.
I tried to shutdown the cluster and re-run parallel-ssh and all nodes were showing the right 'node list' (4 nodes total).

It's not a problem for me but I was wondering if anybody else noticed the same behavior.
I also wonder if may depend on corosync or sheepdog.

I'm running sheep -v
and corosync 1.4.6.

I don't see anything useful in sheep.log

Nov 13 13:01:51   INFO [main] main(845) shutdown
Nov 13 15:11:19   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:11:19   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:11:19   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:11:19   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:11:19   INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:13:59   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:13:59   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:13:59   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:13:59   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 6144000
Nov 13 15:13:59   INFO [main] main(838) sheepdog daemon (version 0.7.0_197_g9f718d2) started
Nov 13 15:14:41   INFO [main] main(845) shutdown
Nov 13 15:14:53   INFO [main] md_add_disk(310) /mnt/sheep/dsk01, vdisk nr 217, total disk 1
Nov 13 15:14:53   INFO [main] md_add_disk(310) /mnt/sheep/dsk02, vdisk nr 233, total disk 2
Nov 13 15:14:53   INFO [main] send_join_request(777) IPv4 ip:192.168.2.44 port:7000
Nov 13 15:14:53   INFO [main] check_host_env(424) Allowed open files 1024000, suggested 61440








-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131113/d7c5014b/attachment-0005.vcf>


More information about the sheepdog-users mailing list