[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan
Chris Webb
chris at arachsys.com
Fri Nov 25 00:00:12 CET 2011
Hi. I pulled the current head of devel, 075306fb23, and when I failed a node
by taking the eth1 down, a collie vdi list worked correctly on one of the
remaining nodes:
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d
29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb
No 'failed to read object' error messages this time, so it looks like the
cluster survives a node failing now.
However, on the failed node, the sheep didn't seem to detect the partition:
it was still running and collie node list showed all the nodes:
[host in the cluster]
0026# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
[host partitioned from network]
0028# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
6 - 172.16.101.11:7000 64 191172780
7 - 172.16.101.11:7001 64 191172780
8 - 172.16.101.11:7002 64 191172780
Sure enough, when I brought back the network connection to the failed node,
things broke in the cluster:
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d
29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb
failed to read object, 80eeb4fc00000000 No object found
failed to read a inode header
and on the resurrected host:
0028# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 8085a93d00000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80aa3feb00000000 Remote node has an old epoch
failed to read a inode header
I can reproduce this with a small test case tomorrow if you like, and capture
some sheep logs?
Best wishes,
Chris.
More information about the sheepdog
mailing list