MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes: > I've sent some fixes related to network failure. Can you try with the > devel branch again? Hi. I've just retried with this updated version. When I ran with corosync-1.4.2, the remaining cluster just hung (apparently forever) without ever timing out and detecting that the disconnected node had died. Nothing in the sheep.log after the disconnection to give any clue as to why, I'm afraid. However, with corosync-1.3.4, it correctly hung for a minute and then started responding, with However, with corosync-1.3.4, it correctly hung for a minute and then started responding, with However, with corosync-1.3.4, it correctly hung for a minute and then started responding, with However, with corosync-1.3.4, it correctly hung for a minute and then started responding, with a node listing that looks like 0026# collie node list M Id Host:Port V-Nodes Zone - 0 172.16.101.7:7000 64 124063916 - 1 172.16.101.7:7001 64 124063916 - 2 172.16.101.7:7002 64 124063916 - 3 172.16.101.9:7000 64 157618348 - 4 172.16.101.9:7001 64 157618348 - 5 172.16.101.9:7002 64 157618348 so it has correctly detected that the third machine has vanished. Bizarrely, collie vdi list is succeeding on the node that I've disconnected, but presumably that's because it happens to have a copy of the superblock? 0028# collie vdi list Name Id Size Used Shared Creation time VDI id 7c53c905-d279-4dd2-95be-37e20dbfc494 1 515 MB 300 MB 0.0 MB 2011-12-16 10:58 7d7c86 0028# collie node list M Id Host:Port V-Nodes Zone - 0 172.16.101.7:7000 64 124063916 - 1 172.16.101.7:7001 64 124063916 - 2 172.16.101.7:7002 64 124063916 - 3 172.16.101.9:7000 64 157618348 - 4 172.16.101.9:7001 64 157618348 - 5 172.16.101.9:7002 64 157618348 - 6 172.16.101.11:7000 64 191172780 - 7 172.16.101.11:7001 64 191172780 - 8 172.16.101.11:7002 64 191172780 It's still doing this ten minutes after I took eth1 down. However, 0028# collie node info Id Size Used Use% [hang] failed to connect to 172.16.101.7:7000: Connection timed out 0028# time collie node info Id Size Used Use% failed to connect to 172.16.101.7:7000: Connection timed out real 3m9.419s user 0m0.001s sys 0m0.001s Nothing has appeared in the sheep.log for any of the three sheep on the host since I took the network interface down. I'm also a little baffled that no node appears to be the master in the above node lists. I checked the code. It definitely is supposed to print "*" for node master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd. Cheers, Chris. |