[Sheepdog] Segmentation faults and cluster failure

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Sat Sep 24 09:28:45 CEST 2011


At Fri, 23 Sep 2011 11:03:36 -0400,
Shawn Moore wrote:
> 
> > BTW, would you please try 'collie cluster info' to check if the outputs are
> > consistent on each node.
> 
> In my testing last night, I went from 4 nodes (2 zones of 2 nodes) to
> 6 nodes (3 zones of 2 nodes):
> ZONE 1: node173, node174
> ZONE 2: node156, node157
> ZONE 3: blade161, blade162
> 
> These two nodes (blade161 and blade162) were added to the already
> running cluster, so the copies is still 2.  Is there anyway to change
> the copies after the cluster creation and re-distribute?

No, unfortunately.  I think it is a bit difficult to support it, but
would like to in future.

> 
> Nodes 173, 174, 156 and 157 all said the same thing:
> 2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     28 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     27 [192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 
> 
> When I got to blade161, I see:
> 2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     28 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     27 [192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 1969-12-31 19:00:21     19 [b00:0:4500:0:300:0:6c01:0,
> b0b9:ffff:ffff:ffff:60e0:7402:::63516, 87:0:2200::e040:500, ::,
> 1500::8000:300:0:0:14641, 3331:2031:393a:3030:3a32:3100:ff7f:0:49056,
> 20ea:3532:ff7f:0:300:::26800, 10c5:b1e5:ec7f:0:400:::44618,
> 100::a8f6:b1e5:ec7f:0:65535, 24eb:3532:ff7f:0:1400:0:ec7f:0:60048,
> 3234:6562:3a33:3533:323a:6666:3766:3a30:12602,
> 3a36:3636:363a:3337:3636:3a33:6133:303a:12849, 100:::65535,
> 5:0:500:0:bf00:0:3b8a:0:768, 11:131a:12:f17:1600::,
> cc76:66e5:ec7f:0:5:0:500:0:31744,
> 3:1c7f:1504:1:90ea:3532:ff7f:0:60568, 300:::63782,
> 400::3823:4000:0:0:39898, 300::98ec:3532:ff7f:0:34835,
> 7699:4000::20d4:6000:0:0:15472, 5ba6:4000::e0d6:6000:0:0,
> 93a7:4000::a0d7:6000:0:0:27072, ::, ::48f4:b1e5:ec7f:0, :::3629,
> 822:3a32:ff7f:0:80f6:b1e5:ec7f:0, ::, 35b0:700::308a:4000:0:0,
> 5318:4000::b8ec:3532:ff7f:0:31744, 100::]
> Segmentation fault
> 
> 
> Then I go to blade162 and get:
> [root at blade162 ~]# collie cluster info
> failed to send a req, Success
> failed to get a rsp, Success
> 
> 
> I then look at the commands again on 173, 174, 156, and 157 and they all report:
> 2011-09-15 20:21:18     36 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     35 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     34 [192.168.0.156:7000, 192.168.0.157:7000,
> 192.168.0.161:7000, 192.168.0.162:7000, 192.168.0.173:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     33 [192.168.0.156:7000, 192.168.0.161:7000,
> 192.168.0.162:7000, 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     32 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.173:7000, 192.168.0.174:7000]
> 2011-09-15 20:21:18     31 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 2011-09-15 20:21:18     30 [192.168.0.161:7000, 192.168.0.162:7000]
> 2011-09-15 20:21:18     29 [192.168.0.161:7000, 192.168.0.162:7000,
> 192.168.0.174:7000]
> 
> 
> I know I built all the nodes off of git version:
> collie-sheepdog-v0.2.3-75-g066d753.tar.gz
> It seems the logs just always report as "version 0.2.3" no matter what
> git version I've used.

I'll add a git revision to the version number. :)

> 
> 
> Also it seems the sheep.log uses NON 24hr time but collie uses 24hr
> time which is preferred, and the command "collie cluster info" shows
> the same date/time for all epochs.  Shouldn't it show the date/time of
> when that epoch was created?

Good point.  The epoch creation time is not used in internal Sheepdog,
but it would be useful for users.

> 
> 
> These logs were collected after 161 and 162 died and before they were
> brought back. You can find the node logs from all nodes below, ~13MB
> compressed over 200MB un-compressed.
> http://www.stormpoint.com/files/sd_2011-09-23.tgz

Thanks, this would be helpful for us.

Kazutaka



More information about the sheepdog mailing list