> BTW, would you please try 'collie cluster info' to check if the outputs are > consistent on each node. In my testing last night, I went from 4 nodes (2 zones of 2 nodes) to 6 nodes (3 zones of 2 nodes): ZONE 1: node173, node174 ZONE 2: node156, node157 ZONE 3: blade161, blade162 These two nodes (blade161 and blade162) were added to the already running cluster, so the copies is still 2. Is there anyway to change the copies after the cluster creation and re-distribute? Nodes 173, 174, 156 and 157 all said the same thing: 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000] 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 28 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 27 [192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] When I got to blade161, I see: 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000] 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 28 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 27 [192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 1969-12-31 19:00:21 19 [b00:0:4500:0:300:0:6c01:0, b0b9:ffff:ffff:ffff:60e0:7402:::63516, 87:0:2200::e040:500, ::, 1500::8000:300:0:0:14641, 3331:2031:393a:3030:3a32:3100:ff7f:0:49056, 20ea:3532:ff7f:0:300:::26800, 10c5:b1e5:ec7f:0:400:::44618, 100::a8f6:b1e5:ec7f:0:65535, 24eb:3532:ff7f:0:1400:0:ec7f:0:60048, 3234:6562:3a33:3533:323a:6666:3766:3a30:12602, 3a36:3636:363a:3337:3636:3a33:6133:303a:12849, 100:::65535, 5:0:500:0:bf00:0:3b8a:0:768, 11:131a:12:f17:1600::, cc76:66e5:ec7f:0:5:0:500:0:31744, 3:1c7f:1504:1:90ea:3532:ff7f:0:60568, 300:::63782, 400::3823:4000:0:0:39898, 300::98ec:3532:ff7f:0:34835, 7699:4000::20d4:6000:0:0:15472, 5ba6:4000::e0d6:6000:0:0, 93a7:4000::a0d7:6000:0:0:27072, ::, ::48f4:b1e5:ec7f:0, :::3629, 822:3a32:ff7f:0:80f6:b1e5:ec7f:0, ::, 35b0:700::308a:4000:0:0, 5318:4000::b8ec:3532:ff7f:0:31744, 100::] Segmentation fault Then I go to blade162 and get: [root at blade162 ~]# collie cluster info failed to send a req, Success failed to get a rsp, Success I then look at the commands again on 173, 174, 156, and 157 and they all report: 2011-09-15 20:21:18 36 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 35 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000] 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] I know I built all the nodes off of git version: collie-sheepdog-v0.2.3-75-g066d753.tar.gz It seems the logs just always report as "version 0.2.3" no matter what git version I've used. Also it seems the sheep.log uses NON 24hr time but collie uses 24hr time which is preferred, and the command "collie cluster info" shows the same date/time for all epochs. Shouldn't it show the date/time of when that epoch was created? These logs were collected after 161 and 162 died and before they were brought back. You can find the node logs from all nodes below, ~13MB compressed over 200MB un-compressed. http://www.stormpoint.com/files/sd_2011-09-23.tgz I was able to bring 161 and 162 back into the cluster as zone 3 ok. 2011-09-15 20:21:18 38 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 37 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 36 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 35 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000] 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.174:7000] |