[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes
Chris Webb
chris at arachsys.com
Fri Dec 16 12:30:38 CET 2011
MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> I've sent some fixes related to network failure. Can you try with the
> devel branch again?
Hi. I've just retried with this updated version.
When I ran with corosync-1.4.2, the remaining cluster just hung (apparently
forever) without ever timing out and detecting that the disconnected node
had died. Nothing in the sheep.log after the disconnection to give any clue
as to why, I'm afraid.
However, with corosync-1.3.4, it correctly hung for a minute and then
started responding, with However, with corosync-1.3.4, it correctly hung for
a minute and then started responding, with However, with corosync-1.3.4, it
correctly hung for a minute and then started responding, with However, with
corosync-1.3.4, it correctly hung for a minute and then started responding,
with a node listing that looks like
0026# collie node list
M Id Host:Port V-Nodes Zone
- 0 172.16.101.7:7000 64 124063916
- 1 172.16.101.7:7001 64 124063916
- 2 172.16.101.7:7002 64 124063916
- 3 172.16.101.9:7000 64 157618348
- 4 172.16.101.9:7001 64 157618348
- 5 172.16.101.9:7002 64 157618348
so it has correctly detected that the third machine has vanished.
Bizarrely, collie vdi list is succeeding on the node that I've disconnected,
but presumably that's because it happens to have a copy of the superblock?
0028# collie vdi list
Name Id Size Used Shared Creation time VDI id
7c53c905-d279-4dd2-95be-37e20dbfc494 1 515 MB 300 MB 0.0 MB 2011-12-16 10:58 7d7c86
0028# collie node list
M Id Host:Port V-Nodes Zone
- 0 172.16.101.7:7000 64 124063916
- 1 172.16.101.7:7001 64 124063916
- 2 172.16.101.7:7002 64 124063916
- 3 172.16.101.9:7000 64 157618348
- 4 172.16.101.9:7001 64 157618348
- 5 172.16.101.9:7002 64 157618348
- 6 172.16.101.11:7000 64 191172780
- 7 172.16.101.11:7001 64 191172780
- 8 172.16.101.11:7002 64 191172780
It's still doing this ten minutes after I took eth1 down. However,
0028# collie node info
Id Size Used Use%
[hang]
failed to connect to 172.16.101.7:7000: Connection timed out
0028# time collie node info
Id Size Used Use%
failed to connect to 172.16.101.7:7000: Connection timed out
real 3m9.419s
user 0m0.001s
sys 0m0.001s
Nothing has appeared in the sheep.log for any of the three sheep on the host
since I took the network interface down.
I'm also a little baffled that no node appears to be the master in the above
node lists. I checked the code. It definitely is supposed to print "*" for node
master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd.
Cheers,
Chris.
More information about the sheepdog
mailing list