[Sheepdog] Cluster doesn't come up correctly after reboot
Wido den Hollander
wido at pcextreme.nl
Sun Apr 18 20:47:12 CEST 2010
Hi,
My sheepdog cluster isn't online, so i gets rebooted a few times a week.
I'm using the cluster for testing Ceph and Sheepdog, and this week i was
playing more with Ceph then Sheepdog.
Now i just checked my cluster and it seems that my nodes can't find
eachother anymore.
I have 5 nodes:
osd1: 192.168.6.211
osd2: 192.168.6.212
osd3: 192.168.6.213
osd4: 192.168.6.214
osd5: 192.168.6.215
Some output i get when checking my cluster status:
root at osd1:~# shepherd info -t dog
Idx Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0 4f5de28d9ad07d49 - 192.168.6.211:7000
1 d3d995c9a4f4336a - 192.168.6.212:7000
root at osd1:~#
root at osd2:~# shepherd info -t dog
Idx Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0 4f5de28d9ad07d49 - 192.168.6.211:7000
1 d3d995c9a4f4336a - 192.168.6.212:7000
root at osd2:~#
root at osd3:~# shepherd info -t dog
Idx Node id (FNV-1a) - Host:Port
------------------------------------------------
0 27ca81e942cd0eef - 192.168.6.213:7000
* 1 27ca81e942cd0eef - 192.168.6.213:7000
root at osd3:~#
root at osd4:~# shepherd info -t dog
Idx Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0 4f5de28d9ad07d49 - 192.168.6.211:7000
1 d3d995c9a4f4336a - 192.168.6.212:7000
root at osd4:~#
root at osd5:~# shepherd info -t dog
Idx Node id (FNV-1a) - Host:Port
------------------------------------------------
* 0 13e9d7233684c11d - 192.168.6.215:7000
1 27ca81e942cd0eef - 192.168.6.213:7000
2 27ca81e942cd0eef - 192.168.6.213:7000
root at osd5:~#
As you can see, they don't seem to find eachother anymore.
I double check, collie is running on all 5 nodes and the sheepdog
directory is mounted on all 5.
Please note, this cluster was running fine a few days ago, nothing
changed in the mount points, corosync configuration or anything else
regarding sheepdog.
What i did notice is:
root at osd1:~# shepherd info -t cluster
there is inconsistency between epochs
Ctime Epoch Nodes
10-04-15 17:24:00 4 [192.168.6.215:7000, 192.168.6.215:7000,
192.168.6.213:7000, 192.168.6.211:7000, 192.168.6.211:7000,
192.168.6.214:7000]
root at osd1:~#
Creating a new image also fails..
root at osd1:~# /usr/local/bin/qemu-img create -f sheepdog johndoe 10G
Formatting 'johndoe', fmt=sheepdog size=10737418240
do_sd_create 1143: Invalid error code, johndoe
qemu-img: Error while formatting
root at osd1:~#
I got the cluster running again after clearing all the sheepdog
directories and do a mkfs again, but this shouldn't happen, a cluster
should survive several reboots, shouldn't it?
After rebooting my machines, the sheepdog cluster was unstable again.
Same result, nodes couldn't find eachother.
In my syslog i see:
Apr 18 20:41:29 osd1 corosync[814]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:29 osd1 corosync[814]: [MAIN ] Completed service
synchronization, ready to provide service.
Apr 18 20:41:30 osd1 corosync[814]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:30 osd1 corosync[814]: [MAIN ] Completed service
synchronization, ready to provide service.
Apr 18 20:41:32 osd1 corosync[814]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Apr 18 20:41:32 osd1 corosync[814]: [MAIN ] Completed service
synchronization, ready to provide service.
I attached the collie.log of "osd1", i hope this helps.
Any ideas?
--
Met vriendelijke groet,
Wido den Hollander
Hoofd Systeembeheer / CSO
Telefoon Support Nederland: 0900 9633 (45 cpm)
Telefoon Support België: 0900 70312 (45 cpm)
Telefoon Direct: (+31) (0)20 50 60 104
Fax: +31 (0)20 50 60 111
E-mail: support at pcextreme.nl
Website: http://www.pcextreme.nl
Kennisbank: http://support.pcextreme.nl/
Netwerkstatus: http://nmc.pcextreme.nl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: collie.log
Type: text/x-log
Size: 34335 bytes
Desc: not available
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20100418/1d74ebab/attachment-0002.bin>
More information about the sheepdog
mailing list