[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

Chris Webb chris at arachsys.com
Fri Nov 25 00:00:12 CET 2011


Hi. I pulled the current head of devel, 075306fb23, and when I failed a node
by taking the eth1 down, a collie vdi list worked correctly on one of the
remaining nodes:

0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
  0334cd4a-820d-41fb-b8ff-e31ce5f43143     1  515 MB   48 MB  0.0 MB 2011-11-24 22:47   85a93d
  29118ca3-08aa-43df-83e7-5bf1d65142a5     1  515 MB  516 MB  0.0 MB 2011-11-24 22:38   aa3feb

No 'failed to read object' error messages this time, so it looks like the
cluster survives a node failing now.

However, on the failed node, the sheep didn't seem to detect the partition:
it was still running and collie node list showed all the nodes:

[host in the cluster]
0026# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348

[host partitioned from network]
0028# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
     6 - 172.16.101.11:7000     64  191172780
     7 - 172.16.101.11:7001     64  191172780
     8 - 172.16.101.11:7002     64  191172780

Sure enough, when I brought back the network connection to the failed node,
things broke in the cluster:

0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
  0334cd4a-820d-41fb-b8ff-e31ce5f43143     1  515 MB   48 MB  0.0 MB 2011-11-24 22:47   85a93d
  29118ca3-08aa-43df-83e7-5bf1d65142a5     1  515 MB  516 MB  0.0 MB 2011-11-24 22:38   aa3feb
failed to read object, 80eeb4fc00000000 No object found
failed to read a inode header

and on the resurrected host:

0028# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 8085a93d00000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80aa3feb00000000 Remote node has an old epoch
failed to read a inode header

I can reproduce this with a small test case tomorrow if you like, and capture
some sheep logs?

Best wishes,

Chris. 



More information about the sheepdog mailing list