I tested again with the latest stable release of corosync, version 1.4.2. In this case, the behaviour is different, but still odd! I start with a completely blank cluster on 002{6,7,8}, three O_DIRECT sheep daemons per host: 0026# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 172.16.101.7:7000 64 124063916 1 - 172.16.101.7:7001 64 124063916 2 - 172.16.101.7:7002 64 124063916 3 - 172.16.101.9:7000 64 157618348 4 - 172.16.101.9:7001 64 157618348 5 - 172.16.101.9:7002 64 157618348 6 - 172.16.101.11:7000 64 191172780 7 - 172.16.101.11:7001 64 191172780 8 - 172.16.101.11:7002 64 191172780 0026# collie cluster format --copies=2 0026# collie vdi create test 1G 0026# collie vdi create test2 1G Now I kill the network on 0028: 0028# ip link set eth1 down 0028# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ [HANG] ^C 0028# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 172.16.101.7:7000 64 124063916 1 - 172.16.101.7:7001 64 124063916 2 - 172.16.101.7:7002 64 124063916 3 - 172.16.101.9:7000 64 157618348 4 - 172.16.101.9:7001 64 157618348 5 - 172.16.101.9:7002 64 157618348 6 - 172.16.101.11:7000 64 191172780 7 - 172.16.101.11:7001 64 191172780 8 - 172.16.101.11:7002 64 191172780 Hmm, hasn't noticed it's partitioned. Meanwhile, back on 0026: 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ failed to read object, 807c2b2500000000 Remote node has a new epoch failed to read a inode header failed to read object, 80fd381500000000 Remote node has a new epoch failed to read a inode header 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ failed to read object, 807c2b2500000000 Remote node has a new epoch failed to read a inode header failed to read object, 80fd381500000000 Remote node has a new epoch failed to read a inode header 0026# sleep 60 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ failed to read object, 807c2b2500000000 Remote node has a new epoch failed to read a inode header failed to read object, 80fd381500000000 Remote node has a new epoch failed to read a inode header However, if I wait a bit longer: 0026# collie node list Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 172.16.101.7:7000 64 124063916 1 - 172.16.101.7:7001 64 124063916 2 - 172.16.101.7:7002 64 124063916 3 - 172.16.101.9:7000 64 157618348 4 - 172.16.101.9:7001 64 157618348 5 - 172.16.101.9:7002 64 157618348 0026# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ test 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 7c2b25 test2 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 fd3815 ...it's okay again. Time to bring back the machine with the missing network: 0028# ip link set eth1 up 0028# collie vdi list name id size used shared creation time vdi id ------------------------------------------------------------------ failed to read object, 807c2b2500000000 Remote node has an old epoch failed to read a inode header failed to read object, 80fd381500000000 Remote node has an old epoch failed to read a inode header [wait a bit] 0028# collie vdi list there is no active sheep daemons [sic] but they haven't really exited: 0028# ps ax | grep sheep 1798 ? Ssl 0:00 sheep -D -p 7000 /mnt/sheep-0028-00 1801 ? Ss 0:00 sheep -D -p 7000 /mnt/sheep-0028-00 1819 ? Ssl 0:00 sheep -D -p 7001 /mnt/sheep-0028-01 1822 ? Ss 0:00 sheep -D -p 7001 /mnt/sheep-0028-01 1840 ? Ssl 0:00 sheep -D -p 7002 /mnt/sheep-0028-02 1842 ? Ss 0:00 sheep -D -p 7002 /mnt/sheep-0028-02 Presumably they're not forwarding properly, though, if they're not responding to collie vdi list? I've popped the log files from this test session at http://cdw.me.uk/tmp/sheep-0026-00.log http://cdw.me.uk/tmp/sheep-0026-01.log http://cdw.me.uk/tmp/sheep-0026-02.log http://cdw.me.uk/tmp/sheep-0027-00.log http://cdw.me.uk/tmp/sheep-0027-01.log http://cdw.me.uk/tmp/sheep-0027-02.log http://cdw.me.uk/tmp/sheep-0028-00.log http://cdw.me.uk/tmp/sheep-0028-01.log http://cdw.me.uk/tmp/sheep-0028-02.log There doesn't seem to be much helpful in there unfortunately. I'll try with the latest 1.3.x corosync next to see if the behaviour is the same. Best wishes, Chris. |