[Sheepdog] Segmentation faults and cluster failure

Shawn Moore smmoore at gmail.com
Fri Sep 23 17:03:36 CEST 2011


> BTW, would you please try 'collie cluster info' to check if the outputs are
> consistent on each node.

In my testing last night, I went from 4 nodes (2 zones of 2 nodes) to
6 nodes (3 zones of 2 nodes):
ZONE 1: node173, node174
ZONE 2: node156, node157
ZONE 3: blade161, blade162

These two nodes (blade161 and blade162) were added to the already
running cluster, so the copies is still 2.  Is there anyway to change
the copies after the cluster creation and re-distribute?

Nodes 173, 174, 156 and 157 all said the same thing:
2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     28 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     27 [192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]


When I got to blade161, I see:
2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     28 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     27 [192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
1969-12-31 19:00:21     19 [b00:0:4500:0:300:0:6c01:0,
b0b9:ffff:ffff:ffff:60e0:7402:::63516, 87:0:2200::e040:500, ::,
1500::8000:300:0:0:14641, 3331:2031:393a:3030:3a32:3100:ff7f:0:49056,
20ea:3532:ff7f:0:300:::26800, 10c5:b1e5:ec7f:0:400:::44618,
100::a8f6:b1e5:ec7f:0:65535, 24eb:3532:ff7f:0:1400:0:ec7f:0:60048,
3234:6562:3a33:3533:323a:6666:3766:3a30:12602,
3a36:3636:363a:3337:3636:3a33:6133:303a:12849, 100:::65535,
5:0:500:0:bf00:0:3b8a:0:768, 11:131a:12:f17:1600::,
cc76:66e5:ec7f:0:5:0:500:0:31744,
3:1c7f:1504:1:90ea:3532:ff7f:0:60568, 300:::63782,
400::3823:4000:0:0:39898, 300::98ec:3532:ff7f:0:34835,
7699:4000::20d4:6000:0:0:15472, 5ba6:4000::e0d6:6000:0:0,
93a7:4000::a0d7:6000:0:0:27072, ::, ::48f4:b1e5:ec7f:0, :::3629,
822:3a32:ff7f:0:80f6:b1e5:ec7f:0, ::, 35b0:700::308a:4000:0:0,
5318:4000::b8ec:3532:ff7f:0:31744, 100::]
Segmentation fault


Then I go to blade162 and get:
[root at blade162 ~]# collie cluster info
failed to send a req, Success
failed to get a rsp, Success


I then look at the commands again on 173, 174, 156, and 157 and they all report:
2011-09-15 20:21:18     36 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     35 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]


I know I built all the nodes off of git version:
collie-sheepdog-v0.2.3-75-g066d753.tar.gz
It seems the logs just always report as "version 0.2.3" no matter what
git version I've used.


Also it seems the sheep.log uses NON 24hr time but collie uses 24hr
time which is preferred, and the command "collie cluster info" shows
the same date/time for all epochs.  Shouldn't it show the date/time of
when that epoch was created?


These logs were collected after 161 and 162 died and before they were
brought back. You can find the node logs from all nodes below, ~13MB
compressed over 200MB un-compressed.
http://www.stormpoint.com/files/sd_2011-09-23.tgz


I was able to bring 161 and 162 back into the cluster as zone 3 ok.
2011-09-15 20:21:18     38 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     37 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     36 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     35 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
192.168.0.174:7000]
2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.173:7000, 192.168.0.174:7000]
2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
192.168.0.174:7000]



More information about the sheepdog mailing list