[Sheepdog] Cluster doesn't come up correctly after reboot

Wido den Hollander wido at pcextreme.nl
Sun Apr 18 20:47:12 CEST 2010


Hi,

My sheepdog cluster isn't online, so i gets rebooted a few times a week.

I'm using the cluster for testing Ceph and Sheepdog, and this week i was
playing more with Ceph then Sheepdog.

Now i just checked my cluster and it seems that my nodes can't find
eachother anymore.

I have 5 nodes:

osd1: 192.168.6.211
osd2: 192.168.6.212
osd3: 192.168.6.213
osd4: 192.168.6.214
osd5: 192.168.6.215

Some output i get when checking my cluster status:

root at osd1:~# shepherd info -t dog
  Idx	Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0	4f5de28d9ad07d49 - 192.168.6.211:7000
  1	d3d995c9a4f4336a - 192.168.6.212:7000
root at osd1:~# 

root at osd2:~# shepherd info -t dog
  Idx	Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0	4f5de28d9ad07d49 - 192.168.6.211:7000
  1	d3d995c9a4f4336a - 192.168.6.212:7000
root at osd2:~# 

root at osd3:~# shepherd info -t dog
  Idx	Node id (FNV-1a) - Host:Port
------------------------------------------------
  0	27ca81e942cd0eef - 192.168.6.213:7000
* 1	27ca81e942cd0eef - 192.168.6.213:7000
root at osd3:~# 

root at osd4:~# shepherd info -t dog
  Idx	Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0	4f5de28d9ad07d49 - 192.168.6.211:7000
  1	d3d995c9a4f4336a - 192.168.6.212:7000
root at osd4:~# 

root at osd5:~# shepherd info -t dog
  Idx	Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0	13e9d7233684c11d - 192.168.6.215:7000
  1	27ca81e942cd0eef - 192.168.6.213:7000
  2	27ca81e942cd0eef - 192.168.6.213:7000
root at osd5:~#

As you can see, they don't seem to find eachother anymore.

I double check, collie is running on all 5 nodes and the sheepdog
directory is mounted on all 5.

Please note, this cluster was running fine a few days ago, nothing
changed in the mount points,  corosync configuration or anything else
regarding sheepdog.

What i did notice is:

root at osd1:~# shepherd info -t cluster
there is inconsistency between epochs

Ctime              Epoch Nodes
10-04-15 17:24:00      4 [192.168.6.215:7000, 192.168.6.215:7000,
192.168.6.213:7000, 192.168.6.211:7000, 192.168.6.211:7000,
192.168.6.214:7000]
root at osd1:~#

Creating a new image also fails..

root at osd1:~# /usr/local/bin/qemu-img create -f sheepdog johndoe 10G
Formatting 'johndoe', fmt=sheepdog size=10737418240 
do_sd_create 1143: Invalid error code, johndoe
qemu-img: Error while formatting
root at osd1:~# 

I got the cluster running again after clearing all the sheepdog
directories and do a mkfs again, but this shouldn't happen, a cluster
should survive several reboots, shouldn't it?

After rebooting my machines, the sheepdog cluster was unstable again.
Same result, nodes couldn't find eachother.

In my syslog i see:

Apr 18 20:41:29 osd1 corosync[814]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:29 osd1 corosync[814]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Apr 18 20:41:30 osd1 corosync[814]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:30 osd1 corosync[814]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Apr 18 20:41:32 osd1 corosync[814]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:32 osd1 corosync[814]:   [MAIN  ] Completed service
synchronization, ready to provide service.

I attached the collie.log of "osd1", i hope this helps.

Any ideas?

-- 
Met vriendelijke groet,

Wido den Hollander
Hoofd Systeembeheer / CSO
Telefoon Support Nederland: 0900 9633 (45 cpm)
Telefoon Support België: 0900 70312 (45 cpm)
Telefoon Direct: (+31) (0)20 50 60 104
Fax: +31 (0)20 50 60 111
E-mail: support at pcextreme.nl
Website: http://www.pcextreme.nl
Kennisbank: http://support.pcextreme.nl/
Netwerkstatus: http://nmc.pcextreme.nl


-------------- next part --------------
A non-text attachment was scrubbed...
Name: collie.log
Type: text/x-log
Size: 34335 bytes
Desc: not available
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20100418/1d74ebab/attachment-0002.bin>


More information about the sheepdog mailing list