[Sheepdog] Segmentation faults and cluster failure

Shawn Moore smmoore at gmail.com
Mon Sep 19 17:21:34 CEST 2011


> I sent a patch to show a correct output of 'collie cluster info'
> without segfault.  Can you try it out?

I went ahead and pulled down "77f26b4" as I was using "3a2801b" for my testing.


> From your log messages, it looks like node174 stores a higher epoch.
> I think if you run a sheep daemon on node174 first, Sheepdog would
> work again.

I had already tried starting node174 first, but with the new code, at
least "collie cluster info" doesn't segfault anymore:
[root at node174 ~]# collie cluster info
Cluster status: Waiting for other nodes joining

Creation time        Epoch Nodes
2011-09-15 20:21:18     17 [192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     16 [192.168.0.157:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     15 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     14 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     13 [192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     12 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     11 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     10 [192.168.0.156:7000, 192.168.0.173:7000,
192.168.0.174:7000]


But I still can't get the other nodes to join.  Here is the sheep.log
from node174:
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000001
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000002
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000003
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000004
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000005
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000006
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000007
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000008
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000009
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000010
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000011
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000012
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000013
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000014
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000015
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000016
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
/node/sheepdog/obj//00000017
Sep 19 11:08:43 jrnl_recover(2238) Openning the directory
/node/sheepdog/journal/00000017/.
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 3
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 1
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 0
Sep 19 11:08:43 worker_routine(206) started this thread 1
Sep 19 11:08:43 worker_routine(206) started this thread 2
Sep 19 11:08:43 worker_routine(206) started this thread 2
Sep 19 11:08:43 worker_routine(206) started this thread 3
Sep 19 11:08:43 set_addr(1723) addr = 192.168.0.174, port = 7000
Sep 19 11:08:43 create_cluster(1778) zone id = 1
Sep 19 11:08:43 main(167) Sheepdog daemon (version 0.2.3) started
Sep 19 11:08:43 sd_confchg(1621) confchg nodeid aed92998
Sep 19 11:08:43 sd_confchg(1623) 1 0 1
Sep 19 11:08:43 sd_confchg(1627) [0] node_id: aed92998, pid: 8646, reason: 0
Sep 19 11:08:43 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:08:43 start_cpg_event_work(1465) 0 0
Sep 19 11:08:43 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:08:43 cpg_event_done(1315) 0x254e020
Sep 19 11:08:43 __sd_confchg_done(1206) 8646 aed92998
Sep 19 11:08:43 update_cluster_info(683) l nodeid: aed92998, pid:
8646, ip: 192.168.0.174:7000
Sep 19 11:08:43 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:38 sd_confchg(1621) confchg nodeid add92998
Sep 19 11:09:38 sd_confchg(1623) 2 0 1
Sep 19 11:09:38 sd_confchg(1627) [0] node_id: add92998, pid: 8097,
reason: 1940777327
Sep 19 11:09:38 sd_confchg(1627) [1] node_id: aed92998, pid: 8646,
reason: 6485728
Sep 19 11:09:38 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:09:38 start_cpg_event_work(1465) 0 0
Sep 19 11:09:38 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:09:38 cpg_event_done(1315) 0x254e020
Sep 19 11:09:38 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:09:38 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:38 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.173:7000, nodeid: add92998, pid: 8097
Sep 19 11:09:38 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:38 start_cpg_event_work(1465) 0 1
Sep 19 11:09:38 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:38 cpg_event_fn(1293) 1
Sep 19 11:09:38 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.173:7000, pid: 8097
Sep 19 11:09:38 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:38 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.173:7000
Sep 19 11:09:38 get_cluster_status(440) sheepdog is waiting with newer
epoch, 16 17 192.168.0.173:7000
Sep 19 11:09:38 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:39 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.173:7000, nodeid: aed92998, pid: 8646
Sep 19 11:09:39 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:39 start_cpg_event_work(1465) 0 1
Sep 19 11:09:39 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:39 cpg_event_fn(1293) 3
Sep 19 11:09:39 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.173:7000, pid: 8097
Sep 19 11:09:39 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:39 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.173:7000
Sep 19 11:09:39 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:58 sd_confchg(1621) confchg nodeid 9cd92998
Sep 19 11:09:58 sd_confchg(1623) 3 0 1
Sep 19 11:09:58 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
Sep 19 11:09:58 sd_confchg(1627) [1] node_id: add92998, pid: 8097, reason: 0
Sep 19 11:09:58 sd_confchg(1627) [2] node_id: aed92998, pid: 8646, reason: 0
Sep 19 11:09:58 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:09:58 start_cpg_event_work(1465) 0 0
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:09:58 cpg_event_done(1315) 0x254e020
Sep 19 11:09:58 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e020
Sep 19 11:09:58 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.156:7000, nodeid: 9cd92998, pid: 14918
Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:58 cpg_event_fn(1293) 1
Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.156:7000, pid: 14918
Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.156:7000
Sep 19 11:09:58 get_cluster_status(440) sheepdog is waiting with newer
epoch, 15 17 192.168.0.156:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:09:58 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.156:7000, nodeid: aed92998, pid: 8646
Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:09:58 cpg_event_fn(1293) 3
Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.156:7000, pid: 14918
Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.156:7000
Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:04 sd_confchg(1621) confchg nodeid 9cd92998
Sep 19 11:10:04 sd_confchg(1623) 4 0 1
Sep 19 11:10:04 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
Sep 19 11:10:04 sd_confchg(1627) [1] node_id: 9dd92998, pid: 8515, reason: 0
Sep 19 11:10:04 sd_confchg(1627) [2] node_id: add92998, pid: 8097,
reason: 1940777327
Sep 19 11:10:04 sd_confchg(1627) [3] node_id: aed92998, pid: 8646,
reason: 6485728
Sep 19 11:10:04 sd_confchg(1641) allow new confchg, 0x254e020
Sep 19 11:10:04 start_cpg_event_work(1465) 0 0
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e020, 0 2
Sep 19 11:10:04 cpg_event_done(1315) 0x254e020
Sep 19 11:10:04 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
ip: 192.168.0.174:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e020
Sep 19 11:10:04 sd_deliver(987) op: 1, state: 1, size: 32840, from:
192.168.0.157:7000, nodeid: 9dd92998, pid: 8515
Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:10:04 cpg_event_fn(1293) 1
Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
192.168.0.157:7000, pid: 8515
Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
from: 192.168.0.157:7000
Sep 19 11:10:04 get_cluster_status(440) sheepdog is waiting with newer
epoch, 16 17 192.168.0.157:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:04 sd_deliver(987) op: 1, state: 3, size: 32840, from:
192.168.0.157:7000, nodeid: aed92998, pid: 8646
Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
Sep 19 11:10:04 cpg_event_fn(1293) 3
Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
192.168.0.157:7000, pid: 8515
Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
from: 192.168.0.157:7000
Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
Sep 19 11:10:10 listen_handler(613) accepted a new connection, 11
Sep 19 11:10:10 queue_request(211) 82
Sep 19 11:10:10 start_cpg_event_work(1465) 0 2
Sep 19 11:10:10 cluster_queue_request(261) 0x7f92a13fb010 82
Sep 19 11:10:10 client_handler(563) closed a connection, 11
Sep 19 11:10:13 listen_handler(613) accepted a new connection, 11
Sep 19 11:10:13 queue_request(211) 87
Sep 19 11:10:13 start_cpg_event_work(1465) 0 2
Sep 19 11:10:13 cluster_queue_request(261) 0x254e340 87
Sep 19 11:10:13 client_handler(563) closed a connection, 11


Thanks for your assistance with this



More information about the sheepdog mailing list