I will apologize in advance for this really long post, but I like to provide as much data as I can in advance. I have been experimenting for about a day on using sheepdog and one of our largest issues at the time was how to handle replicating to different data centers. It looks like zones can handle this after browsing the listserv and git repo. So last night I tore down the cluster and re-setup with zone 1 and 2 and copies 2. That worked just like I wanted it to. DC 1 DC 2 node173 node156 node174 node157 Today I went to do some testing on failing node(s). When I killed one of them everything seems ok and when I bring it back, it appears to re-sync just fine. I waited until I saw "recovery complete" and then about 5 minutes later, I killed an entire DC (or zone). In this example I killed DC2 which had nodes 156 and 157. In listing which nodes had the vdi object I can see that one node had the object (should be 2) and one didn't. Which I understand because the "mirror" side is down. So then I go to bring it back up (zone 2) and I can see recovery starting to take place. Once I see "recovery complete" I do: [node156 ~]# collie node info Id Size Used Use% 0 386 GB 21 GB 5% 1 381 GB 17 GB 4% 2 398 GB 21 GB 5% 3 394 GB 17 GB 4% Total 1.5 TB 76 GB 4%, total virtual VDI Size 100 GB Looks good I think. Then I do: [node156 ~]# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 192.168.0.156:7000 64 2 1 - 192.168.0.157:7000 64 2 * 2 - 192.168.0.173:7000 64 1 3 - 192.168.0.174:7000 64 1 Still looks good. Then when I do: [node156 ~]# collie cluster info Cluster status: running Creation time Epoch Nodes 2011-09-15 20:21:18 15 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 14 [192.168.0.156:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 13 [192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 12 [192.168.0.156:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 11 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 10 [192.168.0.156:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 9 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 8 [192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 1996-09-04 21:47:32 825112369 [3030:3000:7d7f:0:4682:c294:7d7f:0, 80f4:e394:7d7f:0:90e0:3ae2:ff7f:0:57504, a8e0:3ae2:ff7f:0:100:0:100:0:21887, a50b:4000::d0de:e394:7d7f:0:21418, 3f00:0:7d7f:0:300:::14641, 3034:2032:313a:3437:3a33:3200:ff7f:0, 5a97:4000::5c78:9694:7d7f:0:46, c0ea:6000::f877:8a94:7d7f:0:50776, 403c:8a94:7d7f:0:ffff:ffff:::58232, 100::28c2:6000:0:0, 2000:0:2f00:0:1500:0:400:0:8, 300:0:f700:0:100:0:7d7f:0:51136, 9022:b801::bf00:0:3b8a:0:34560, 0:0:f0a0:500::, :::12596, ::8000:300:3:1c7f:1504:1:58232, 300::b0e1:3ae2:ff7f:0:57948, ::5a97:4000:0:0:9011, 100::78e3:3ae2:ff7f:0:58232, ca00::2095:4000:0:0:50624, c04f:4000::dba1:4000:0:0:51328, ::13a3:4000:0:0:51520, c06a:4000::, ::4fe2:3ae2:ff7f:0:1, ::88a9:9294:7d7f:0, c085:4000:::5779, 98e3:3ae2:ff7f:0:586:4000:::64360, :::7088, 70e3:3ae2:ff7f::, 5dec:8b94:7d7f:::58232, ::300:0:201e:4000:0:0, 4b16:7372:df8d:7014:b01b:4000:::58224, :::5707, 4b16:43aa:c8a4:8bea::ff7f:0, ::c085:4000:0:0:58232, 300::, b01b:4000::70e3:3ae2:ff7f:0, d91b:4000::68e3:3ae2:ff7f:0:28, 300::16f9:3ae2:ff7f:0:63773, 25f9:3ae2:ff7f:::63786, 47f9:3ae2:ff7f:0:52f9:3ae2:ff7f:0:63842, 70f9:3ae2:ff7f:0:93f9:3ae2:ff7f:0:63910, b0f9:3ae2:ff7f:0:affe:3ae2:ff7f:0:65225, 15ff:3ae2:ff7f:0:1fff:3ae2:ff7f:0:65328, 47ff:3ae2:ff7f:0:4fff:3ae2:ff7f:0:65370, 67ff:3ae2:ff7f:0:9dff:3ae2:ff7f:0:65471, d4ff:3ae2:ff7f:::33, f0:3fe2:ff7f:0:1000:::62463, 600::10:0:0:0:17, 6400::300:0:0:0:64, 400::3800:0:0:0:5, 800::700:0:0:0:61440, 800:::9, b01b:4000:0:0:b00::, c00:::13, ::e00:0:0:0, 1700:::25, 79e5:3ae2:ff7f:0:1f00:::65511, f00::89e5:3ae2:ff7f:0, :::63232, 5439:b9ef:4638:8a25:8b78:3836:5f36:3400, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, ::, 6c6c:6965:63:6c75:7374:6572:69:6e66:111, 4d45:3d6e:6f64:6531:3536:2e63:6174:6177:24930, 4552:4d3d:7874:6572:6d00:5348:454c:4c3d:25135, 6800:4849:5354:5349:5a45:3d31:3030:3000:21331, 4e54:3d31:3532:2e34:312e:3231:312e:3131:8249, 3232:53:5348:5f54:5459:3d2f:6465:762f:29808, 4552:3d72:6f6f:7400:4c53:5f43:4f4c:4f52:15699, 693d:3031:3b33:343a:6c6e:3d30:313b:3336:27962, 693d:3430:3b33:333a:736f:3d30:313b:3335:25658, 353a:6264:3d34:303b:3333:3b30:313a:6364:13373, 313a:6f72:3d34:303b:3331:3b30:313a:6d69:12349, 373b:3431:3a73:753d:3337:3b34:313a:7367:13117, 613d:3330:3b34:313a:7477:3d33:303b:3432:28474, 323a:7374:3d33:373b:3434:3a65:783d:3031:13115, 723d:3031:3b33:313a:2a2e:7467:7a3d:3031:13115, 6a3d:3031:3b33:313a:2a2e:7461:7a3d:3031:13115, 683d:3031:3b33:313a:2a2e:6c7a:6d61:3d30:15153, 6c7a:3d30:313b:3331:3a2a:2e74:787a:3d30:15153, 6970:3d30:313b:3331:3a2a:2e7a:3d30:313b:12595, 313b:3331:3a2a:2e64:7a3d:3031:3b33:313a:11818, 3331:3a2a:2e6c:7a3d:3031:3b33:313a:2a2e:31352, 3a2a:2e62:7a32:3d30:313b:3331:3a2a:2e74:31330, 3a2a:2e74:627a:323d:3031:3b33:313a:2a2e:31330, 3a2a:2e74:7a3d:3031:3b33:313a:2a2e:6465:15714, 2a2e:7270:6d3d:3031:3b33:313a:2a2e:6a61:15730, 2a2e:7261:723d:3031:3b33:313a:2a2e:6163:15717, 2a2e:7a6f:6f3d:3031:3b33:313a:2a2e:6370:28521, 3a2a:2e37:7a3d:3031:3b33:313a:2a2e:727a:12349, 2e6a:7067:3d30:313b:3335:3a2a:2e6a:7065:15719, 2a2e:6769:663d:3031:3b33:353a:2a2e:626d:15728, 2a2e:7062:6d3d:3031:3b33:353a:2a2e:7067:15725, 2a2e:7070:6d3d:3031:3b33:353a:2a2e:7467:15713, 2a2e:7862:6d3d:3031:3b33:353a:2a2e:7870:15725, 2a2e:7469:663d:3031:3b33:353a:2a2e:7469:26214, 3a2a:2e70:6e67:3d30:313b:3335:3a2a:2e73:26486, 3a2a:2e73:7667:7a3d:3031:3b33:353a:2a2e:28269, 353a:2a2e:7063:783d:3031:3b33:353a:2a2e:28525, 353a:2a2e:6d70:673d:3031:3b33:353a:2a2e:28781, 3335:3a2a:2e6d:3276:3d30:313b:3335:3a2a:27950, 3335:3a2a:2e6f:676d:3d30:313b:3335:3a2a:27950, 3335:3a2a:2e6d:3476:3d30:313b:3335:3a2a:27950, 3b33:353a:2a2e:766f:623d:3031:3b33:353a:11818, 3335:3a2a:2e6e:7576:3d30:313b:3335:3a2a:30510, 3335:3a2a:2e61:7366:3d30:313b:3335:3a2a:29230, 353a:2a2e:726d:7662:3d30:313b:3335:3a2a:26158, 3335:3a2a:2e61:7669:3d30:313b:3335:3a2a:26158, 3335:3a2a:2e66:6c76:3d30:313b:3335:3a2a:26414, 353a:2a2e:646c:3d30:313b:3335:3a2a:2e78:26211, 3a2a:2e78:7764:3d30:313b:3335:3a2a:2e79:30325, 3a2a:2e63:676d:3d30:313b:3335:3a2a:2e65:26221, 3a2a:2e61:7876:3d30:313b:3335:3a2a:2e61:30830, 3a2a:2e6f:6776:3d30:313b:3335:3a2a:2e6f:30823, 3a2a:2e61:6163:3d30:313b:3336:3a2a:2e61:15733, 2a2e:666c:6163:3d30:313b:3336:3a2a:2e6d:25705, 3a2a:2e6d:6964:693d:3031:3b33:363a:2a2e:27501, 363a:2a2e:6d70:333d:3031:3b33:363a:2a2e:28781, 363a:2a2e:6f67:673d:3031:3b33:363a:2aSegmentation fault So doesn't look good. Then I do this same command "collie cluster info" on the other zone 2 node and it does the exact same thing "segmentation fault". Then I go to the zone 1 nodes and run it and they segfault as well. So now the cluster is completely down. I have tried to find the node with the highest epoch and start it back up first but no matter what I do I can't get the cluster up, always segfaults. One node173 I get: Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000001 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000002 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000003 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000004 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000005 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000006 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000007 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000008 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000009 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000010 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000011 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000012 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000013 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000014 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000015 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000016 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969300000000 Sep 16 09:41:01 init_epoch_path(1934) found the vdi obj, 80f5969200000000 Sep 16 09:41:01 jrnl_recover(2240) Openning the directory /node/sheepdog/journal/00000016/. Sep 16 09:41:01 worker_routine(206) started this thread 0 Sep 16 09:41:01 worker_routine(206) started this thread 0 Sep 16 09:41:01 worker_routine(206) started this thread 3 Sep 16 09:41:01 worker_routine(206) started this thread 1 Sep 16 09:41:01 worker_routine(206) started this thread 0 Sep 16 09:41:01 worker_routine(206) started this thread 0 Sep 16 09:41:01 worker_routine(206) started this thread 0 Sep 16 09:41:01 worker_routine(206) started this thread 1 Sep 16 09:41:01 worker_routine(206) started this thread 2 Sep 16 09:41:01 worker_routine(206) started this thread 2 Sep 16 09:41:01 worker_routine(206) started this thread 3 Sep 16 09:41:01 set_addr(1723) addr = 192.168.0.173, port = 7000 Sep 16 09:41:01 create_cluster(1778) zone id = 1 Sep 16 09:41:01 main(167) Sheepdog daemon (version 0.2.3) started Sep 16 09:41:01 sd_confchg(1621) confchg nodeid add92998 Sep 16 09:41:01 sd_confchg(1623) 1 0 1 Sep 16 09:41:01 sd_confchg(1627) [0] node_id: -1378276968, pid: 19921, reason: 0 Sep 16 09:41:01 sd_confchg(1641) allow new confchg, 0x24e5020 Sep 16 09:41:01 start_cpg_event_work(1465) 0 0 Sep 16 09:41:01 cpg_event_fn(1279) 0x24e5020, 0 2 Sep 16 09:41:01 cpg_event_done(1315) 0x24e5020 Sep 16 09:41:01 __sd_confchg_done(1206) 19921 add92998 Sep 16 09:41:01 update_cluster_info(683) l nodeid: add92998, pid: 19921, ip: 192.168.0.173:7000 Sep 16 09:41:01 cpg_event_done(1373) free 0x24e5020 Then after this I try to bring up node174. node174 says: Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000001 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000002 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000003 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000004 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000005 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000006 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000007 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000008 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000009 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000010 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000011 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000012 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000013 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000014 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000015 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000016 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969400000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969500000000 Sep 16 09:41:18 init_epoch_path(1934) found the vdi obj, 80f5969600000000 Sep 16 09:41:18 init_epoch_path(1913) found the obj dir, /node/sheepdog/obj//00000017 Sep 16 09:41:18 jrnl_recover(2240) Openning the directory /node/sheepdog/journal/00000017/. Sep 16 09:41:18 worker_routine(206) started this thread 0 Sep 16 09:41:18 worker_routine(206) started this thread 0 Sep 16 09:41:18 worker_routine(206) started this thread 1 Sep 16 09:41:18 worker_routine(206) started this thread 0 Sep 16 09:41:18 worker_routine(206) started this thread 0 Sep 16 09:41:18 worker_routine(206) started this thread 1 Sep 16 09:41:18 worker_routine(206) started this thread 0 Sep 16 09:41:18 worker_routine(206) started this thread 2 Sep 16 09:41:18 worker_routine(206) started this thread 2 Sep 16 09:41:18 worker_routine(206) started this thread 3 Sep 16 09:41:18 worker_routine(206) started this thread 3 Sep 16 09:41:18 set_addr(1723) addr = 192.168.0.174, port = 7000 Sep 16 09:41:18 create_cluster(1778) zone id = 1 Sep 16 09:41:18 main(167) Sheepdog daemon (version 0.2.3) started Sep 16 09:41:18 sd_confchg(1621) confchg nodeid add92998 Sep 16 09:41:18 sd_confchg(1623) 2 0 1 Sep 16 09:41:18 sd_confchg(1627) [0] node_id: -1378276968, pid: 19921, reason: -1308593377 Sep 16 09:41:18 sd_confchg(1627) [1] node_id: -1361499752, pid: 24781, reason: 6485728 Sep 16 09:41:18 sd_confchg(1641) allow new confchg, 0x269a020 Sep 16 09:41:18 start_cpg_event_work(1465) 0 0 Sep 16 09:41:18 cpg_event_fn(1279) 0x269a020, 0 2 Sep 16 09:41:18 cpg_event_done(1315) 0x269a020 Sep 16 09:41:18 send_join_request(1168) 2933467544 24781 Sep 16 09:41:18 cpg_event_done(1373) free 0x269a020 Sep 16 09:41:18 sd_deliver(987) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000, nodeid: 2933467544, pid: 24781 Sep 16 09:41:18 sd_deliver(996) allow new deliver, 0x269a160 Sep 16 09:41:18 start_cpg_event_work(1465) 0 1 Sep 16 09:41:18 cpg_event_fn(1279) 0x269a160, 1 2 Sep 16 09:41:18 cpg_event_fn(1293) 1 Sep 16 09:41:18 __sd_deliver(839) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000, pid: 24781 Sep 16 09:41:18 cpg_event_done(1315) 0x269a160 Sep 16 09:41:18 __sd_deliver_done(955) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000 Sep 16 09:41:18 cpg_event_done(1373) free 0x269a160 Sep 16 09:41:18 sd_deliver(987) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000, nodeid: 2916690328, pid: 19921 Sep 16 09:41:18 sd_deliver(996) allow new deliver, 0x269a160 Sep 16 09:41:18 start_cpg_event_work(1465) 0 1 Sep 16 09:41:18 cpg_event_fn(1279) 0x269a160, 1 2 Sep 16 09:41:18 cpg_event_fn(1293) 3 Sep 16 09:41:18 __sd_deliver(839) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000, pid: 24781 Sep 16 09:41:18 cpg_event_done(1315) 0x269a160 Sep 16 09:41:18 update_cluster_info(611) failed to join sheepdog, 65 Sep 16 09:41:18 __sd_deliver_done(955) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000 Sep 16 09:41:18 cpg_event_done(1373) free 0x269a160 During this same time, node173 says: Sep 16 09:41:18 sd_confchg(1621) confchg nodeid add92998 Sep 16 09:41:18 sd_confchg(1623) 2 0 1 Sep 16 09:41:18 sd_confchg(1627) [0] node_id: -1378276968, pid: 19921, reason: 1404584655 Sep 16 09:41:18 sd_confchg(1627) [1] node_id: -1361499752, pid: 24781, reason: 6485728 Sep 16 09:41:18 sd_confchg(1641) allow new confchg, 0x24e5020 Sep 16 09:41:18 start_cpg_event_work(1465) 0 0 Sep 16 09:41:18 cpg_event_fn(1279) 0x24e5020, 0 2 Sep 16 09:41:18 cpg_event_done(1315) 0x24e5020 Sep 16 09:41:18 __sd_confchg_done(1232) l nodeid: add92998, pid: 19921, ip: 192.168.0.173:7000 Sep 16 09:41:18 cpg_event_done(1373) free 0x24e5020 Sep 16 09:41:18 sd_deliver(987) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000, nodeid: 2933467544, pid: 24781 Sep 16 09:41:18 sd_deliver(996) allow new deliver, 0x24e51a0 Sep 16 09:41:18 start_cpg_event_work(1465) 0 1 Sep 16 09:41:18 cpg_event_fn(1279) 0x24e51a0, 1 2 Sep 16 09:41:18 cpg_event_fn(1293) 1 Sep 16 09:41:18 __sd_deliver(839) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000, pid: 24781 Sep 16 09:41:18 cpg_event_done(1315) 0x24e51a0 Sep 16 09:41:18 __sd_deliver_done(955) op: 1, state: 1, size: 32840, from: 192.168.0.174:7000 Sep 16 09:41:18 get_cluster_status(435) sheepdog is waiting with older epoch, 17 16 192.168.0.174:7000 Sep 16 09:41:18 cpg_event_done(1373) free 0x24e51a0 Sep 16 09:41:18 sd_deliver(987) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000, nodeid: 2916690328, pid: 19921 Sep 16 09:41:18 sd_deliver(996) allow new deliver, 0x24e51a0 Sep 16 09:41:18 start_cpg_event_work(1465) 0 1 Sep 16 09:41:18 cpg_event_fn(1279) 0x24e51a0, 1 2 Sep 16 09:41:18 cpg_event_fn(1293) 3 Sep 16 09:41:18 __sd_deliver(839) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000, pid: 24781 Sep 16 09:41:18 cpg_event_done(1315) 0x24e51a0 Sep 16 09:41:18 __sd_deliver_done(955) op: 1, state: 3, size: 32840, from: 192.168.0.174:7000 Sep 16 09:41:18 cpg_event_done(1373) free 0x24e51a0 Currently nothing on here is important and it's only testing data, but I would like to know how to recover from this situation if it were for real. |