[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

Fri Nov 25 09:57:59 CET 2011

At Thu, 24 Nov 2011 23:00:12 +0000,
Chris Webb wrote:
> 
> Hi. I pulled the current head of devel, 075306fb23, and when I failed a node
> by taking the eth1 down, a collie vdi list worked correctly on one of the
> remaining nodes:
> 
> 0026# collie vdi list
>   name        id    size    used  shared    creation time   vdi id
> ------------------------------------------------------------------
>   0334cd4a-820d-41fb-b8ff-e31ce5f43143     1  515 MB   48 MB  0.0 MB 2011-11-24 22:47   85a93d
>   29118ca3-08aa-43df-83e7-5bf1d65142a5     1  515 MB  516 MB  0.0 MB 2011-11-24 22:38   aa3feb
> 
> No 'failed to read object' error messages this time, so it looks like the
> cluster survives a node failing now.
> 
> However, on the failed node, the sheep didn't seem to detect the partition:
> it was still running and collie node list showed all the nodes:
> 
> [host in the cluster]
> 0026# collie node list
>    Idx - Host:Port          Vnodes       Zone
> ---------------------------------------------
>      0 - 172.16.101.7:7000      64  124063916
>      1 - 172.16.101.7:7001      64  124063916
>      2 - 172.16.101.7:7002      64  124063916
>      3 - 172.16.101.9:7000      64  157618348
>      4 - 172.16.101.9:7001      64  157618348
>      5 - 172.16.101.9:7002      64  157618348
> 
> [host partitioned from network]
> 0028# collie node list
>    Idx - Host:Port          Vnodes       Zone
> ---------------------------------------------
>      0 - 172.16.101.7:7000      64  124063916
>      1 - 172.16.101.7:7001      64  124063916
>      2 - 172.16.101.7:7002      64  124063916
>      3 - 172.16.101.9:7000      64  157618348
>      4 - 172.16.101.9:7001      64  157618348
>      5 - 172.16.101.9:7002      64  157618348
>      6 - 172.16.101.11:7000     64  191172780
>      7 - 172.16.101.11:7001     64  191172780
>      8 - 172.16.101.11:7002     64  191172780

I couldn't reproduce this.  On my environment, the last 3 nodes
stopped correctly with a network partition error.  Perhaps, is this a
corosync problem?

> 
> Sure enough, when I brought back the network connection to the failed node,
> things broke in the cluster:
> 
> 0026# collie vdi list
>   name        id    size    used  shared    creation time   vdi id
> ------------------------------------------------------------------
>   0334cd4a-820d-41fb-b8ff-e31ce5f43143     1  515 MB   48 MB  0.0 MB 2011-11-24 22:47   85a93d
>   29118ca3-08aa-43df-83e7-5bf1d65142a5     1  515 MB  516 MB  0.0 MB 2011-11-24 22:38   aa3feb
> failed to read object, 80eeb4fc00000000 No object found
> failed to read a inode header
> 
> and on the resurrected host:
> 
> 0028# collie vdi list
>   name        id    size    used  shared    creation time   vdi id
> ------------------------------------------------------------------
> failed to read object, 8085a93d00000000 Remote node has an old epoch
> failed to read a inode header
> failed to read object, 80aa3feb00000000 Remote node has an old epoch
> failed to read a inode header
> 
> I can reproduce this with a small test case tomorrow if you like, and capture
> some sheep logs?

Yes, I'd like to see the logs. :)

Thanks,

Kazutaka

> 
> Best wishes,
> 
> Chris. 
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog