At Thu, 24 Nov 2011 23:00:12 +0000, Chris Webb wrote: > > Hi. I pulled the current head of devel, 075306fb23, and when I failed a node > by taking the eth1 down, a collie vdi list worked correctly on one of the > remaining nodes: > > 0026# collie vdi list > name id size used shared creation time vdi id > ------------------------------------------------------------------ > 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d > 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb > > No 'failed to read object' error messages this time, so it looks like the > cluster survives a node failing now. > > However, on the failed node, the sheep didn't seem to detect the partition: > it was still running and collie node list showed all the nodes: > > [host in the cluster] > 0026# collie node list > Idx - Host:Port Vnodes Zone > --------------------------------------------- > 0 - 172.16.101.7:7000 64 124063916 > 1 - 172.16.101.7:7001 64 124063916 > 2 - 172.16.101.7:7002 64 124063916 > 3 - 172.16.101.9:7000 64 157618348 > 4 - 172.16.101.9:7001 64 157618348 > 5 - 172.16.101.9:7002 64 157618348 > > [host partitioned from network] > 0028# collie node list > Idx - Host:Port Vnodes Zone > --------------------------------------------- > 0 - 172.16.101.7:7000 64 124063916 > 1 - 172.16.101.7:7001 64 124063916 > 2 - 172.16.101.7:7002 64 124063916 > 3 - 172.16.101.9:7000 64 157618348 > 4 - 172.16.101.9:7001 64 157618348 > 5 - 172.16.101.9:7002 64 157618348 > 6 - 172.16.101.11:7000 64 191172780 > 7 - 172.16.101.11:7001 64 191172780 > 8 - 172.16.101.11:7002 64 191172780 I couldn't reproduce this. On my environment, the last 3 nodes stopped correctly with a network partition error. Perhaps, is this a corosync problem? > > Sure enough, when I brought back the network connection to the failed node, > things broke in the cluster: > > 0026# collie vdi list > name id size used shared creation time vdi id > ------------------------------------------------------------------ > 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d > 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb > failed to read object, 80eeb4fc00000000 No object found > failed to read a inode header > > and on the resurrected host: > > 0028# collie vdi list > name id size used shared creation time vdi id > ------------------------------------------------------------------ > failed to read object, 8085a93d00000000 Remote node has an old epoch > failed to read a inode header > failed to read object, 80aa3feb00000000 Remote node has an old epoch > failed to read a inode header > > I can reproduce this with a small test case tomorrow if you like, and capture > some sheep logs? Yes, I'd like to see the logs. :) Thanks, Kazutaka > > Best wishes, > > Chris. > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |