Hello, I'm having trouble adding new node into existing cluster. My current cluster consist of 4 servers. Each running a gateway at port 7000 and data at port 7001 on sheepdog version 0.5.6 The current 'cluster info', 'node list' is: Cluster status: running Cluster created at Fri Feb 8 21:19:59 2013 Epoch Time Version 2013-02-13 17:27:22 11 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-13 17:26:15 10 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-13 16:19:14 9 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-13 16:19:14 8 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-13 12:50:48 7 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-13 03:43:46 6 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000] 2013-02-13 00:07:55 5 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001] 2013-02-13 00:06:45 4 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000] 2013-02-08 21:21:32 3 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-08 21:21:07 2 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] 2013-02-08 21:20:00 1 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001] M Id Host:Port V-Nodes Zone - 0 172.16.0.91:7000 0 1526730924 - 1 172.16.0.91:7001 73 1526730924 - 2 172.16.0.92:7000 0 1543508140 - 3 172.16.0.92:7001 22 1543508140 - 4 172.16.0.93:7000 0 1560285356 - 5 172.16.0.93:7001 74 1560285356 - 6 172.16.0.94:7000 0 1577062572 - 7 172.16.0.94:7001 87 1577062572 ============================== Now I want to add a new server into it. I first start the gateway with '/opt/sheep/sbin/sheep -g /vz/sheep-gw'. This one goes into the cluster without any problem. The cluster epoch is then incremented to version 12 2013-02-14 21:47:53 12 [172.16.0.91:7000, 172.16.0.91:7001, 172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001, 172.16.0.95:7000] But it is not the same for data node. I use command '/opt/sheep/sbin/sheep -s 900000 -p 7001 /vz/sheep-data' to start the sheepdog daemon. This process cannot join the cluster. This log is when I tried starting it for 3 times. The cluster epoch is increased by 6 (node join + node leave 3 times). Feb 14 21:57:55 [main] jrnl_recover(230) opening the directory /vz/sheep-data/journal/ Feb 14 21:57:55 [main] jrnl_recover(235) starting journal recovery Feb 14 21:57:55 [main] jrnl_recover(291) journal recovery complete Feb 14 21:57:55 [main] init_signal(171) register signal_handler for 10 Feb 14 21:57:55 [main] init_disk_space(371) disk free space is 943718400000 Feb 14 21:57:55 [main] create_cluster(1134) use corosync cluster driver as default Feb 14 21:57:55 [main] create_cluster(1163) zone id = 1593839788 Feb 14 21:57:55 [main] send_join_request(998) IPv4 ip:172.16.0.95 port:7001 Feb 14 21:57:55 [main] check_host_env(419) Allowed core file size 0, suggested unlimited Feb 14 21:57:55 [main] main(690) sheepdog daemon (version 0.5.6) started Feb 14 21:57:55 [main] cdrv_cpg_confchg(579) mem:10, joined:1, left:0 Feb 14 21:57:55 [main] cdrv_cpg_confchg(656) Not promoting because member is not in our event list. Feb 14 21:57:55 [main] cdrv_cpg_deliver(472) 0 Feb 14 21:57:55 [main] cdrv_cpg_deliver(472) 1 Feb 14 21:57:55 [main] sd_join_handler(1028) join IPv4 ip:172.16.0.95 port:7001 Feb 14 21:57:55 [main] sd_join_handler(1030) [0] IPv4 ip:172.16.0.91 port:7000 Feb 14 21:57:55 [main] sd_join_handler(1030) [1] IPv4 ip:172.16.0.93 port:7000 Feb 14 21:57:55 [main] sd_join_handler(1030) [2] IPv4 ip:172.16.0.94 port:7000 Feb 14 21:57:55 [main] sd_join_handler(1030) [3] IPv4 ip:172.16.0.91 port:7001 Feb 14 21:57:55 [main] sd_join_handler(1030) [4] IPv4 ip:172.16.0.93 port:7001 Feb 14 21:57:55 [main] sd_join_handler(1030) [5] IPv4 ip:172.16.0.94 port:7001 Feb 14 21:57:55 [main] sd_join_handler(1030) [6] IPv4 ip:172.16.0.92 port:7000 Feb 14 21:57:55 [main] sd_join_handler(1030) [7] IPv4 ip:172.16.0.92 port:7001 Feb 14 21:57:55 [main] sd_join_handler(1030) [8] IPv4 ip:172.16.0.95 port:7000 Feb 14 21:57:55 [main] sd_join_handler(1030) [9] IPv4 ip:172.16.0.95 port:7001 Feb 14 21:57:55 [main] update_cluster_info(783) status = 1, epoch = 18, finished: 0 Feb 14 21:57:55 [main] crash_handler(322) sheep pid 6965 exited unexpectedly. Here's strace result for command 'strace /opt/sheep/sbin/sheep -s 900000 -f -d -p 7001 /vz/sheep-data' (foreground, debug) http://pastebin.com/ej7787XC By the way, no data loss in the cluster but just only I can't join new node. -- Personal hosting by icez network http://www.thzhost.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130214/4e1e1993/attachment.html> |