At Fri, 16 Dec 2011 11:30:38 +0000, Chris Webb wrote: > > MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes: > > > I've sent some fixes related to network failure. Can you try with the > > devel branch again? > > Hi. I've just retried with this updated version. > > When I ran with corosync-1.4.2, the remaining cluster just hung (apparently > forever) without ever timing out and detecting that the disconnected node > had died. Nothing in the sheep.log after the disconnection to give any clue > as to why, I'm afraid. > > However, with corosync-1.3.4, it correctly hung for a minute and then > started responding, with However, with corosync-1.3.4, it correctly hung for > a minute and then started responding, with However, with corosync-1.3.4, it > correctly hung for a minute and then started responding, with However, with > corosync-1.3.4, it correctly hung for a minute and then started responding, > with a node listing that looks like > > 0026# collie node list > M Id Host:Port V-Nodes Zone > - 0 172.16.101.7:7000 64 124063916 > - 1 172.16.101.7:7001 64 124063916 > - 2 172.16.101.7:7002 64 124063916 > - 3 172.16.101.9:7000 64 157618348 > - 4 172.16.101.9:7001 64 157618348 > - 5 172.16.101.9:7002 64 157618348 > > so it has correctly detected that the third machine has vanished. > > Bizarrely, collie vdi list is succeeding on the node that I've disconnected, > but presumably that's because it happens to have a copy of the superblock? > > 0028# collie vdi list > Name Id Size Used Shared Creation time VDI id > 7c53c905-d279-4dd2-95be-37e20dbfc494 1 515 MB 300 MB 0.0 MB 2011-12-16 10:58 7d7c86 > 0028# collie node list > M Id Host:Port V-Nodes Zone > - 0 172.16.101.7:7000 64 124063916 > - 1 172.16.101.7:7001 64 124063916 > - 2 172.16.101.7:7002 64 124063916 > - 3 172.16.101.9:7000 64 157618348 > - 4 172.16.101.9:7001 64 157618348 > - 5 172.16.101.9:7002 64 157618348 > - 6 172.16.101.11:7000 64 191172780 > - 7 172.16.101.11:7001 64 191172780 > - 8 172.16.101.11:7002 64 191172780 > > It's still doing this ten minutes after I took eth1 down. However, > > 0028# collie node info > Id Size Used Use% > [hang] > failed to connect to 172.16.101.7:7000: Connection timed out > 0028# time collie node info > Id Size Used Use% > failed to connect to 172.16.101.7:7000: Connection timed out > > real 3m9.419s > user 0m0.001s > sys 0m0.001s > > Nothing has appeared in the sheep.log for any of the three sheep on the host > since I took the network interface down. Sorry, I couldn't reproduce it. I'll keep in mind this problem, but I think of releasing 0.3.0 because it seems that the fatal blocking problem doesn't happen if you use corosync 1.3.x. > > I'm also a little baffled that no node appears to be the master in the above > node lists. I checked the code. It definitely is supposed to print "*" for node > master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd. Currently, it is up to the using cluster driver whether we need to elect a master node. In addition, I thought that the administrator doesn't need to know which node is master, so I removed the info from outputs. Is it inconvenient for you? Thanks, Kazutaka |