[Sheepdog] Segmentation faults and cluster failure

Liu Yuan namei.unix at gmail.com
Tue Sep 20 03:55:19 CEST 2011


On 09/19/2011 11:21 PM, Shawn Moore wrote:
>> I sent a patch to show a correct output of 'collie cluster info'
>> without segfault.  Can you try it out?
> I went ahead and pulled down "77f26b4" as I was using "3a2801b" for my testing.
>
>
>>  From your log messages, it looks like node174 stores a higher epoch.
>> I think if you run a sheep daemon on node174 first, Sheepdog would
>> work again.
> I had already tried starting node174 first, but with the new code, at
> least "collie cluster info" doesn't segfault anymore:
> [root at node174 ~]# collie cluster info
> Cluster status: Waiting for other nodes joining
>
> Creation time        Epoch Nodes
> 2011-09-15 20:21:18     17 [192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     16 [192.168.0.157:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     15 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     14 [192.168.0.156:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     13 [192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     12 [192.168.0.156:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     11 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     10 [192.168.0.156:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
>
>
> But I still can't get the other nodes to join.  Here is the sheep.log
> from node174:
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000001
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000002
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000003
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000004
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000005
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000006
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000007
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000008
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000009
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000010
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000011
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000012
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000013
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000014
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000015
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000016
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969400000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969500000000
> Sep 19 11:08:43 init_epoch_path(1932) found the vdi obj, 80f5969600000000
> Sep 19 11:08:43 init_epoch_path(1911) found the obj dir,
> /node/sheepdog/obj//00000017
> Sep 19 11:08:43 jrnl_recover(2238) Openning the directory
> /node/sheepdog/journal/00000017/.
> Sep 19 11:08:43 worker_routine(206) started this thread 0
> Sep 19 11:08:43 worker_routine(206) started this thread 0
> Sep 19 11:08:43 worker_routine(206) started this thread 3
> Sep 19 11:08:43 worker_routine(206) started this thread 0
> Sep 19 11:08:43 worker_routine(206) started this thread 1
> Sep 19 11:08:43 worker_routine(206) started this thread 0
> Sep 19 11:08:43 worker_routine(206) started this thread 0
> Sep 19 11:08:43 worker_routine(206) started this thread 1
> Sep 19 11:08:43 worker_routine(206) started this thread 2
> Sep 19 11:08:43 worker_routine(206) started this thread 2
> Sep 19 11:08:43 worker_routine(206) started this thread 3
> Sep 19 11:08:43 set_addr(1723) addr = 192.168.0.174, port = 7000
> Sep 19 11:08:43 create_cluster(1778) zone id = 1
> Sep 19 11:08:43 main(167) Sheepdog daemon (version 0.2.3) started
> Sep 19 11:08:43 sd_confchg(1621) confchg nodeid aed92998
> Sep 19 11:08:43 sd_confchg(1623) 1 0 1
> Sep 19 11:08:43 sd_confchg(1627) [0] node_id: aed92998, pid: 8646, reason: 0
> Sep 19 11:08:43 sd_confchg(1641) allow new confchg, 0x254e020
> Sep 19 11:08:43 start_cpg_event_work(1465) 0 0
> Sep 19 11:08:43 cpg_event_fn(1279) 0x254e020, 0 2
> Sep 19 11:08:43 cpg_event_done(1315) 0x254e020
> Sep 19 11:08:43 __sd_confchg_done(1206) 8646 aed92998
> Sep 19 11:08:43 update_cluster_info(683) l nodeid: aed92998, pid:
> 8646, ip: 192.168.0.174:7000
> Sep 19 11:08:43 cpg_event_done(1373) free 0x254e020
> Sep 19 11:09:38 sd_confchg(1621) confchg nodeid add92998
> Sep 19 11:09:38 sd_confchg(1623) 2 0 1
> Sep 19 11:09:38 sd_confchg(1627) [0] node_id: add92998, pid: 8097,
> reason: 1940777327
> Sep 19 11:09:38 sd_confchg(1627) [1] node_id: aed92998, pid: 8646,
> reason: 6485728
> Sep 19 11:09:38 sd_confchg(1641) allow new confchg, 0x254e020
> Sep 19 11:09:38 start_cpg_event_work(1465) 0 0
> Sep 19 11:09:38 cpg_event_fn(1279) 0x254e020, 0 2
> Sep 19 11:09:38 cpg_event_done(1315) 0x254e020
> Sep 19 11:09:38 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
> ip: 192.168.0.174:7000
> Sep 19 11:09:38 cpg_event_done(1373) free 0x254e020
> Sep 19 11:09:38 sd_deliver(987) op: 1, state: 1, size: 32840, from:
> 192.168.0.173:7000, nodeid: add92998, pid: 8097
> Sep 19 11:09:38 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:09:38 start_cpg_event_work(1465) 0 1
> Sep 19 11:09:38 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:09:38 cpg_event_fn(1293) 1
> Sep 19 11:09:38 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
> 192.168.0.173:7000, pid: 8097
> Sep 19 11:09:38 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:09:38 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
> from: 192.168.0.173:7000
> Sep 19 11:09:38 get_cluster_status(440) sheepdog is waiting with newer
> epoch, 16 17 192.168.0.173:7000
> Sep 19 11:09:38 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:09:39 sd_deliver(987) op: 1, state: 3, size: 32840, from:
> 192.168.0.173:7000, nodeid: aed92998, pid: 8646
> Sep 19 11:09:39 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:09:39 start_cpg_event_work(1465) 0 1
> Sep 19 11:09:39 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:09:39 cpg_event_fn(1293) 3
> Sep 19 11:09:39 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
> 192.168.0.173:7000, pid: 8097
> Sep 19 11:09:39 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:09:39 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
> from: 192.168.0.173:7000
> Sep 19 11:09:39 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:09:58 sd_confchg(1621) confchg nodeid 9cd92998
> Sep 19 11:09:58 sd_confchg(1623) 3 0 1
> Sep 19 11:09:58 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
> Sep 19 11:09:58 sd_confchg(1627) [1] node_id: add92998, pid: 8097, reason: 0
> Sep 19 11:09:58 sd_confchg(1627) [2] node_id: aed92998, pid: 8646, reason: 0
> Sep 19 11:09:58 sd_confchg(1641) allow new confchg, 0x254e020
> Sep 19 11:09:58 start_cpg_event_work(1465) 0 0
> Sep 19 11:09:58 cpg_event_fn(1279) 0x254e020, 0 2
> Sep 19 11:09:58 cpg_event_done(1315) 0x254e020
> Sep 19 11:09:58 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
> ip: 192.168.0.174:7000
> Sep 19 11:09:58 cpg_event_done(1373) free 0x254e020
> Sep 19 11:09:58 sd_deliver(987) op: 1, state: 1, size: 32840, from:
> 192.168.0.156:7000, nodeid: 9cd92998, pid: 14918
> Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
> Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:09:58 cpg_event_fn(1293) 1
> Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
> 192.168.0.156:7000, pid: 14918
> Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
> from: 192.168.0.156:7000
> Sep 19 11:09:58 get_cluster_status(440) sheepdog is waiting with newer
> epoch, 15 17 192.168.0.156:7000
> Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:09:58 sd_deliver(987) op: 1, state: 3, size: 32840, from:
> 192.168.0.156:7000, nodeid: aed92998, pid: 8646
> Sep 19 11:09:58 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:09:58 start_cpg_event_work(1465) 0 1
> Sep 19 11:09:58 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:09:58 cpg_event_fn(1293) 3
> Sep 19 11:09:58 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
> 192.168.0.156:7000, pid: 14918
> Sep 19 11:09:58 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:09:58 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
> from: 192.168.0.156:7000
> Sep 19 11:09:58 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:10:04 sd_confchg(1621) confchg nodeid 9cd92998
> Sep 19 11:10:04 sd_confchg(1623) 4 0 1
> Sep 19 11:10:04 sd_confchg(1627) [0] node_id: 9cd92998, pid: 14918, reason: 0
> Sep 19 11:10:04 sd_confchg(1627) [1] node_id: 9dd92998, pid: 8515, reason: 0
> Sep 19 11:10:04 sd_confchg(1627) [2] node_id: add92998, pid: 8097,
> reason: 1940777327
> Sep 19 11:10:04 sd_confchg(1627) [3] node_id: aed92998, pid: 8646,
> reason: 6485728
> Sep 19 11:10:04 sd_confchg(1641) allow new confchg, 0x254e020
> Sep 19 11:10:04 start_cpg_event_work(1465) 0 0
> Sep 19 11:10:04 cpg_event_fn(1279) 0x254e020, 0 2
> Sep 19 11:10:04 cpg_event_done(1315) 0x254e020
> Sep 19 11:10:04 __sd_confchg_done(1232) l nodeid: aed92998, pid: 8646,
> ip: 192.168.0.174:7000
> Sep 19 11:10:04 cpg_event_done(1373) free 0x254e020
> Sep 19 11:10:04 sd_deliver(987) op: 1, state: 1, size: 32840, from:
> 192.168.0.157:7000, nodeid: 9dd92998, pid: 8515
> Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
> Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:10:04 cpg_event_fn(1293) 1
> Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 1, size: 32840, from:
> 192.168.0.157:7000, pid: 8515
> Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 1, size: 32840,
> from: 192.168.0.157:7000
> Sep 19 11:10:04 get_cluster_status(440) sheepdog is waiting with newer
> epoch, 16 17 192.168.0.157:7000
> Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:10:04 sd_deliver(987) op: 1, state: 3, size: 32840, from:
> 192.168.0.157:7000, nodeid: aed92998, pid: 8646
> Sep 19 11:10:04 sd_deliver(996) allow new deliver, 0x254e1a0
> Sep 19 11:10:04 start_cpg_event_work(1465) 0 1
> Sep 19 11:10:04 cpg_event_fn(1279) 0x254e1a0, 1 2
> Sep 19 11:10:04 cpg_event_fn(1293) 3
> Sep 19 11:10:04 __sd_deliver(839) op: 1, state: 3, size: 32840, from:
> 192.168.0.157:7000, pid: 8515
> Sep 19 11:10:04 cpg_event_done(1315) 0x254e1a0
> Sep 19 11:10:04 __sd_deliver_done(955) op: 1, state: 3, size: 32840,
> from: 192.168.0.157:7000
> Sep 19 11:10:04 cpg_event_done(1373) free 0x254e1a0
> Sep 19 11:10:10 listen_handler(613) accepted a new connection, 11
> Sep 19 11:10:10 queue_request(211) 82
> Sep 19 11:10:10 start_cpg_event_work(1465) 0 2
> Sep 19 11:10:10 cluster_queue_request(261) 0x7f92a13fb010 82
> Sep 19 11:10:10 client_handler(563) closed a connection, 11
> Sep 19 11:10:13 listen_handler(613) accepted a new connection, 11
> Sep 19 11:10:13 queue_request(211) 87
> Sep 19 11:10:13 start_cpg_event_work(1465) 0 2
> Sep 19 11:10:13 cluster_queue_request(261) 0x254e340 87
> Sep 19 11:10:13 client_handler(563) closed a connection, 11
>
>
> Thanks for your assistance with this

So I guess you have shutdowned the cluster by 'collie cluster shutdown' 
command, no? would you please attach the log from the nodes that 
wouldnot join?

I think the patch set 'sheep: teach sheepdog to better recovery the 
shut-down cluster' might solve your problem if you happen to have the 
problem of recovering cluster from the shutdown state. But right now 
Kazutaka might be reviewing it and please wait for it merging.

Thanks,
Yuan



More information about the sheepdog mailing list