[sheepdog-users] some nodes not join to sheepdog cluster when using corosync.

Saeki Masaki saeki.masaki at po.ntts.co.jp
Fri May 23 07:06:19 CEST 2014


Hi,

When trying to build a Sheepdog cluster of 12 nodes using Corosync, we found a strange behavior.
Launch a Sheep process on each server, some nodes were not join to the cluster.( for more infomation below )
Which "node" or not , was unspecified.

I'm not familiar with corosync, but corosync logged "enabling flow control" .
because send message buffer is full .

In our environment, unlikely to occur number of nodes small.
and we can not possible to reproduce in sheepdog v0.7.8.
and not possible to reproduce in corosync v2.3.3.

I have some question
1. In sheepdog v0.8.1 message size, when join cluster, is it increased from v0.7.8 ?
2. Which corosync version is used mainly.

infomation of this 
---
The environment occurred is,
  CentOS6.5 ( 2.6.32-431.el6.x86_64)
  sheepdog v0.8.1
  corosync-1.4.1-17.el6_5.1

---
Preparation
 Delete cluster file and data of Sheepdog completely.

---
Steps to Reproduce
[root at sds01 ~]# ssh sds01 "sheep -p 7000 -b 192.168.2.11 -i host=192.168.2.11,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"
[root at sds01 ~]# ssh sds02 "sheep -p 7000 -b 192.168.2.12 -i host=192.168.2.12,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"
(Snip 9 nodes)
[root at sds01 ~]# ssh sds12 "sheep -p 7000 -b 192.168.2.22 -i host=192.168.2.22,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"

---
Confirmation of the results
 Some nodes had different status, and log

[root at sds01 ~]# dog node list -a 192.168.2.11
  Id Host:Port V-Nodes Zone
   0 192.168.2.11:7000 33 184723648
   1 192.168.2.12:7000 80 201500864
   2 192.168.2.13:7000 140 218278080
   3 192.168.2.14:7000 145 235055296
   4 192.168.2.15:7000 147 251832512
   5 192.168.2.16:7000 145 268609728
   6 192.168.2.17:7000 147 285386944
   7 192.168.2.18:7000 145 302164160
   8 192.168.2.19:7000 146 318941376
   9 192.168.2.20:7000 144 335718592
  10 192.168.2.21:7000 146 352495808
  11 192.168.2.22:7000 119 369273024
[root at sds01 ~]# dog node list -a 192.168.2.16
  Id Host:Port V-Nodes Zone
   0 192.168.2.11:7000 33 184723648
   1 192.168.2.12:7000 82 201500864
   2 192.168.2.13:7000 143 218278080
   3 192.168.2.14:7000 148 235055296
   4 192.168.2.15:7000 150 251832512
   5 192.168.2.16:7000 148 268609728
   6 192.168.2.17:7000 150 285386944
   7 192.168.2.18:7000 149 302164160
   8 192.168.2.19:7000 149 318941376

---
sds01 sheepdog log
May 12 15:33:25 DEBUG [main] tx_main(832) 37, 192.168.2.21:58765
May 12 15:33:25 DEBUG [block] sockfd_cache_put_long(372) 192.168.2.21:7001 idx 0
May 12 15:33:28 DEBUG [main] cdrv_cpg_confchg(555) mem:12, joined:1, left:0
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 0
May 12 15:33:28 DEBUG [main] sd_join_handler(765) check IPv4 ip:192.168.2.22 port:7000, 2
May 12 15:33:28 DEBUG [main] sd_join_handler(778) 192.168.2.22:7000: cluster_status = 0x2
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] sd_accept_handler(907) join IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.11 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.12 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.13 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.14 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.15 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.16 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.17 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.18 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.19 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.20 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.21 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [main] update_cluster_info(646) status = 2, epoch = 0
May 12 15:33:28 DEBUG [main] sockfd_cache_add(239) 192.168.2.22:7000, count 12
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.11 port:7000 has 33 vnodes, free space 41482960896
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.12 port:7000 has 80 vnodes, free space 101949628416
(Snip)
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.22 port:7000 has 119 vnodes, free space 150564032512
May 12 15:33:28 DEBUG [block] do_get_vdis(495) try to get vdi bitmap from IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [block] sockfd_cache_get_long(344) create cache connection 192.168.2.22:7001 idx 0
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] __corosync_dispatch(373) wait for a next dispatch event
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [block] connect_to(209) 38, 192.168.2.22:7001
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [block] sockfd_cache_put_long(372) 192.168.2.22:7001 idx 0
May 12 15:33:28 DEBUG [main] listen_handler(996) accepted a new connection: 39
May 12 15:33:28 DEBUG [main] client_handler(916) 1, 0
May 12 15:33:28 DEBUG [main] rx_main(780) 39, 192.168.2.22:43156
May 12 15:33:28 DEBUG [main] queue_request(454) GET_VDI_COPIES, 2
May 12 15:33:28 DEBUG [io 12749] do_process_work(1428) ab, 0, 0
May 12 15:33:28 DEBUG [main] client_handler(916) 4, 0
May 12 15:33:28 DEBUG [main] tx_main(832) 39, 192.168.2.22:43156

sds06 sheepdog log
May 12 15:33:25 DEBUG [main] tx_main(832) 35, 192.168.2.21:60702
May 12 15:33:28 DEBUG [main] listen_handler(996) accepted a new connection: 36
May 12 15:33:28 DEBUG [main] client_handler(916) 1, 0
May 12 15:33:28 DEBUG [main] rx_main(780) 36, 192.168.2.22:38787
May 12 15:33:28 DEBUG [main] queue_request(454) GET_VDI_COPIES, 2
May 12 15:33:28 DEBUG [io 15732] do_process_work(1428) ab, 0, 0
May 12 15:33:28 DEBUG [main] client_handler(916) 4, 0
May 12 15:33:28 DEBUG [main] tx_main(832) 36, 192.168.2.22:38787

Regards.




More information about the sheepdog-users mailing list