[Sheepdog] Segmentation faults and cluster failure
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Sat Sep 24 09:28:45 CEST 2011
At Fri, 23 Sep 2011 11:03:36 -0400,
Shawn Moore wrote:
>
> > BTW, would you please try 'collie cluster info' to check if the outputs are
> > consistent on each node.
>
> In my testing last night, I went from 4 nodes (2 zones of 2 nodes) to
> 6 nodes (3 zones of 2 nodes):
> ZONE 1: node173, node174
> ZONE 2: node156, node157
> ZONE 3: blade161, blade162
>
> These two nodes (blade161 and blade162) were added to the already
> running cluster, so the copies is still 2. Is there anyway to change
> the copies after the cluster creation and re-distribute?
No, unfortunately. I think it is a bit difficult to support it, but
would like to in future.
>
> Nodes 173, 174, 156 and 157 all said the same thing:
> 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 28 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 27 [192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
>
>
> When I got to blade161, I see:
> 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 28 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 27 [192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 1969-12-31 19:00:21 19 [b00:0:4500:0:300:0:6c01:0,
> b0b9:ffff:ffff:ffff:60e0:7402:::63516, 87:0:2200::e040:500, ::,
> 1500::8000:300:0:0:14641, 3331:2031:393a:3030:3a32:3100:ff7f:0:49056,
> 20ea:3532:ff7f:0:300:::26800, 10c5:b1e5:ec7f:0:400:::44618,
> 100::a8f6:b1e5:ec7f:0:65535, 24eb:3532:ff7f:0:1400:0:ec7f:0:60048,
> 3234:6562:3a33:3533:323a:6666:3766:3a30:12602,
> 3a36:3636:363a:3337:3636:3a33:6133:303a:12849, 100:::65535,
> 5:0:500:0:bf00:0:3b8a:0:768, 11:131a:12:f17:1600::,
> cc76:66e5:ec7f:0:5:0:500:0:31744,
> 3:1c7f:1504:1:90ea:3532:ff7f:0:60568, 300:::63782,
> 400::3823:4000:0:0:39898, 300::98ec:3532:ff7f:0:34835,
> 7699:4000::20d4:6000:0:0:15472, 5ba6:4000::e0d6:6000:0:0,
> 93a7:4000::a0d7:6000:0:0:27072, ::, ::48f4:b1e5:ec7f:0, :::3629,
> 822:3a32:ff7f:0:80f6:b1e5:ec7f:0, ::, 35b0:700::308a:4000:0:0,
> 5318:4000::b8ec:3532:ff7f:0:31744, 100::]
> Segmentation fault
>
>
> Then I go to blade162 and get:
> [root at blade162 ~]# collie cluster info
> failed to send a req, Success
> failed to get a rsp, Success
>
>
> I then look at the commands again on 173, 174, 156, and 157 and they all report:
> 2011-09-15 20:21:18 36 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 35 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18 31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18 30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18 29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
>
>
> I know I built all the nodes off of git version:
> collie-sheepdog-v0.2.3-75-g066d753.tar.gz
> It seems the logs just always report as "version 0.2.3" no matter what
> git version I've used.
I'll add a git revision to the version number. :)
>
>
> Also it seems the sheep.log uses NON 24hr time but collie uses 24hr
> time which is preferred, and the command "collie cluster info" shows
> the same date/time for all epochs. Shouldn't it show the date/time of
> when that epoch was created?
Good point. The epoch creation time is not used in internal Sheepdog,
but it would be useful for users.
>
>
> These logs were collected after 161 and 162 died and before they were
> brought back. You can find the node logs from all nodes below, ~13MB
> compressed over 200MB un-compressed.
> http://www.stormpoint.com/files/sd_2011-09-23.tgz
Thanks, this would be helpful for us.
Kazutaka
More information about the sheepdog
mailing list