[sheepdog-users] cannot add new node into cluster
icez network
icez at icez.net
Thu Feb 14 16:24:17 CET 2013
Hello,
I'm having trouble adding new node into existing cluster. My current
cluster consist of 4 servers. Each running a gateway at port 7000 and data
at port 7001 on sheepdog version 0.5.6
The current 'cluster info', 'node list' is:
Cluster status: running
Cluster created at Fri Feb 8 21:19:59 2013
Epoch Time Version
2013-02-13 17:27:22 11 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000, 172.16.0.94:7001]
2013-02-13 17:26:15 10 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000,
172.16.0.94:7001]
2013-02-13 16:19:14 9 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000, 172.16.0.94:7001]
2013-02-13 16:19:14 8 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000,
172.16.0.94:7001]
2013-02-13 12:50:48 7 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000, 172.16.0.94:7001]
2013-02-13 03:43:46 6 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000]
2013-02-13 00:07:55 5 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001]
2013-02-13 00:06:45 4 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000]
2013-02-08 21:21:32 3 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000, 172.16.0.94:7001]
2013-02-08 21:21:07 2 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.93:7000, 172.16.0.93:7001, 172.16.0.94:7000,
172.16.0.94:7001]
2013-02-08 21:20:00 1 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000, 172.16.0.94:7001]
M Id Host:Port V-Nodes Zone
- 0 172.16.0.91:7000 0 1526730924
- 1 172.16.0.91:7001 73 1526730924
- 2 172.16.0.92:7000 0 1543508140
- 3 172.16.0.92:7001 22 1543508140
- 4 172.16.0.93:7000 0 1560285356
- 5 172.16.0.93:7001 74 1560285356
- 6 172.16.0.94:7000 0 1577062572
- 7 172.16.0.94:7001 87 1577062572
==============================
Now I want to add a new server into it. I first start the gateway with
'/opt/sheep/sbin/sheep -g /vz/sheep-gw'. This one goes into the cluster
without any problem. The cluster epoch is then incremented to version 12
2013-02-14 21:47:53 12 [172.16.0.91:7000, 172.16.0.91:7001,
172.16.0.92:7000, 172.16.0.92:7001, 172.16.0.93:7000, 172.16.0.93:7001,
172.16.0.94:7000, 172.16.0.94:7001, 172.16.0.95:7000]
But it is not the same for data node. I use command '/opt/sheep/sbin/sheep
-s 900000 -p 7001 /vz/sheep-data' to start the sheepdog daemon. This
process cannot join the cluster. This log is when I tried starting it for 3
times. The cluster epoch is increased by 6 (node join + node leave 3 times).
Feb 14 21:57:55 [main] jrnl_recover(230) opening the directory
/vz/sheep-data/journal/
Feb 14 21:57:55 [main] jrnl_recover(235) starting journal recovery
Feb 14 21:57:55 [main] jrnl_recover(291) journal recovery complete
Feb 14 21:57:55 [main] init_signal(171) register signal_handler for 10
Feb 14 21:57:55 [main] init_disk_space(371) disk free space is 943718400000
Feb 14 21:57:55 [main] create_cluster(1134) use corosync cluster driver as
default
Feb 14 21:57:55 [main] create_cluster(1163) zone id = 1593839788
Feb 14 21:57:55 [main] send_join_request(998) IPv4 ip:172.16.0.95 port:7001
Feb 14 21:57:55 [main] check_host_env(419) Allowed core file size 0,
suggested unlimited
Feb 14 21:57:55 [main] main(690) sheepdog daemon (version 0.5.6) started
Feb 14 21:57:55 [main] cdrv_cpg_confchg(579) mem:10, joined:1, left:0
Feb 14 21:57:55 [main] cdrv_cpg_confchg(656) Not promoting because member
is not in our event list.
Feb 14 21:57:55 [main] cdrv_cpg_deliver(472) 0
Feb 14 21:57:55 [main] cdrv_cpg_deliver(472) 1
Feb 14 21:57:55 [main] sd_join_handler(1028) join IPv4 ip:172.16.0.95
port:7001
Feb 14 21:57:55 [main] sd_join_handler(1030) [0] IPv4 ip:172.16.0.91
port:7000
Feb 14 21:57:55 [main] sd_join_handler(1030) [1] IPv4 ip:172.16.0.93
port:7000
Feb 14 21:57:55 [main] sd_join_handler(1030) [2] IPv4 ip:172.16.0.94
port:7000
Feb 14 21:57:55 [main] sd_join_handler(1030) [3] IPv4 ip:172.16.0.91
port:7001
Feb 14 21:57:55 [main] sd_join_handler(1030) [4] IPv4 ip:172.16.0.93
port:7001
Feb 14 21:57:55 [main] sd_join_handler(1030) [5] IPv4 ip:172.16.0.94
port:7001
Feb 14 21:57:55 [main] sd_join_handler(1030) [6] IPv4 ip:172.16.0.92
port:7000
Feb 14 21:57:55 [main] sd_join_handler(1030) [7] IPv4 ip:172.16.0.92
port:7001
Feb 14 21:57:55 [main] sd_join_handler(1030) [8] IPv4 ip:172.16.0.95
port:7000
Feb 14 21:57:55 [main] sd_join_handler(1030) [9] IPv4 ip:172.16.0.95
port:7001
Feb 14 21:57:55 [main] update_cluster_info(783) status = 1, epoch = 18,
finished: 0
Feb 14 21:57:55 [main] crash_handler(322) sheep pid 6965 exited
unexpectedly.
Here's strace result for command 'strace /opt/sheep/sbin/sheep -s 900000 -f
-d -p 7001 /vz/sheep-data' (foreground, debug)
http://pastebin.com/ej7787XC
By the way, no data loss in the cluster but just only I can't join new node.
--
Personal hosting by icez network
http://www.thzhost.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130214/4e1e1993/attachment-0003.html>
More information about the sheepdog-users
mailing list