[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Fri Nov 25 09:57:59 CET 2011
At Thu, 24 Nov 2011 23:00:12 +0000,
Chris Webb wrote:
>
> Hi. I pulled the current head of devel, 075306fb23, and when I failed a node
> by taking the eth1 down, a collie vdi list worked correctly on one of the
> remaining nodes:
>
> 0026# collie vdi list
> name id size used shared creation time vdi id
> ------------------------------------------------------------------
> 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d
> 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb
>
> No 'failed to read object' error messages this time, so it looks like the
> cluster survives a node failing now.
>
> However, on the failed node, the sheep didn't seem to detect the partition:
> it was still running and collie node list showed all the nodes:
>
> [host in the cluster]
> 0026# collie node list
> Idx - Host:Port Vnodes Zone
> ---------------------------------------------
> 0 - 172.16.101.7:7000 64 124063916
> 1 - 172.16.101.7:7001 64 124063916
> 2 - 172.16.101.7:7002 64 124063916
> 3 - 172.16.101.9:7000 64 157618348
> 4 - 172.16.101.9:7001 64 157618348
> 5 - 172.16.101.9:7002 64 157618348
>
> [host partitioned from network]
> 0028# collie node list
> Idx - Host:Port Vnodes Zone
> ---------------------------------------------
> 0 - 172.16.101.7:7000 64 124063916
> 1 - 172.16.101.7:7001 64 124063916
> 2 - 172.16.101.7:7002 64 124063916
> 3 - 172.16.101.9:7000 64 157618348
> 4 - 172.16.101.9:7001 64 157618348
> 5 - 172.16.101.9:7002 64 157618348
> 6 - 172.16.101.11:7000 64 191172780
> 7 - 172.16.101.11:7001 64 191172780
> 8 - 172.16.101.11:7002 64 191172780
I couldn't reproduce this. On my environment, the last 3 nodes
stopped correctly with a network partition error. Perhaps, is this a
corosync problem?
>
> Sure enough, when I brought back the network connection to the failed node,
> things broke in the cluster:
>
> 0026# collie vdi list
> name id size used shared creation time vdi id
> ------------------------------------------------------------------
> 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d
> 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb
> failed to read object, 80eeb4fc00000000 No object found
> failed to read a inode header
>
> and on the resurrected host:
>
> 0028# collie vdi list
> name id size used shared creation time vdi id
> ------------------------------------------------------------------
> failed to read object, 8085a93d00000000 Remote node has an old epoch
> failed to read a inode header
> failed to read object, 80aa3feb00000000 Remote node has an old epoch
> failed to read a inode header
>
> I can reproduce this with a small test case tomorrow if you like, and capture
> some sheep logs?
Yes, I'd like to see the logs. :)
Thanks,
Kazutaka
>
> Best wishes,
>
> Chris.
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
More information about the sheepdog
mailing list