Hi. I pulled the current head of devel, 075306fb23, and when I failed a node by taking the eth1 down, a collie vdi list worked correctly on one of the remaining nodes: 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb No 'failed to read object' error messages this time, so it looks like the cluster survives a node failing now. However, on the failed node, the sheep didn't seem to detect the partition: it was still running and collie node list showed all the nodes: [host in the cluster] 0026# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 172.16.101.7:7000 64 124063916 1 - 172.16.101.7:7001 64 124063916 2 - 172.16.101.7:7002 64 124063916 3 - 172.16.101.9:7000 64 157618348 4 - 172.16.101.9:7001 64 157618348 5 - 172.16.101.9:7002 64 157618348 [host partitioned from network] 0028# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 172.16.101.7:7000 64 124063916 1 - 172.16.101.7:7001 64 124063916 2 - 172.16.101.7:7002 64 124063916 3 - 172.16.101.9:7000 64 157618348 4 - 172.16.101.9:7001 64 157618348 5 - 172.16.101.9:7002 64 157618348 6 - 172.16.101.11:7000 64 191172780 7 - 172.16.101.11:7001 64 191172780 8 - 172.16.101.11:7002 64 191172780 Sure enough, when I brought back the network connection to the failed node, things broke in the cluster: 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ 0334cd4a-820d-41fb-b8ff-e31ce5f43143 1 515 MB 48 MB 0.0 MB 2011-11-24 22:47 85a93d 29118ca3-08aa-43df-83e7-5bf1d65142a5 1 515 MB 516 MB 0.0 MB 2011-11-24 22:38 aa3feb failed to read object, 80eeb4fc00000000 No object found failed to read a inode header and on the resurrected host: 0028# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ failed to read object, 8085a93d00000000 Remote node has an old epoch failed to read a inode header failed to read object, 80aa3feb00000000 Remote node has an old epoch failed to read a inode header I can reproduce this with a small test case tomorrow if you like, and capture some sheep logs? Best wishes, Chris. |