[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Wed Dec 28 17:42:14 CET 2011
At Fri, 16 Dec 2011 11:30:38 +0000,
Chris Webb wrote:
>
> MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
>
> > I've sent some fixes related to network failure. Can you try with the
> > devel branch again?
>
> Hi. I've just retried with this updated version.
>
> When I ran with corosync-1.4.2, the remaining cluster just hung (apparently
> forever) without ever timing out and detecting that the disconnected node
> had died. Nothing in the sheep.log after the disconnection to give any clue
> as to why, I'm afraid.
>
> However, with corosync-1.3.4, it correctly hung for a minute and then
> started responding, with However, with corosync-1.3.4, it correctly hung for
> a minute and then started responding, with However, with corosync-1.3.4, it
> correctly hung for a minute and then started responding, with However, with
> corosync-1.3.4, it correctly hung for a minute and then started responding,
> with a node listing that looks like
>
> 0026# collie node list
> M Id Host:Port V-Nodes Zone
> - 0 172.16.101.7:7000 64 124063916
> - 1 172.16.101.7:7001 64 124063916
> - 2 172.16.101.7:7002 64 124063916
> - 3 172.16.101.9:7000 64 157618348
> - 4 172.16.101.9:7001 64 157618348
> - 5 172.16.101.9:7002 64 157618348
>
> so it has correctly detected that the third machine has vanished.
>
> Bizarrely, collie vdi list is succeeding on the node that I've disconnected,
> but presumably that's because it happens to have a copy of the superblock?
>
> 0028# collie vdi list
> Name Id Size Used Shared Creation time VDI id
> 7c53c905-d279-4dd2-95be-37e20dbfc494 1 515 MB 300 MB 0.0 MB 2011-12-16 10:58 7d7c86
> 0028# collie node list
> M Id Host:Port V-Nodes Zone
> - 0 172.16.101.7:7000 64 124063916
> - 1 172.16.101.7:7001 64 124063916
> - 2 172.16.101.7:7002 64 124063916
> - 3 172.16.101.9:7000 64 157618348
> - 4 172.16.101.9:7001 64 157618348
> - 5 172.16.101.9:7002 64 157618348
> - 6 172.16.101.11:7000 64 191172780
> - 7 172.16.101.11:7001 64 191172780
> - 8 172.16.101.11:7002 64 191172780
>
> It's still doing this ten minutes after I took eth1 down. However,
>
> 0028# collie node info
> Id Size Used Use%
> [hang]
> failed to connect to 172.16.101.7:7000: Connection timed out
> 0028# time collie node info
> Id Size Used Use%
> failed to connect to 172.16.101.7:7000: Connection timed out
>
> real 3m9.419s
> user 0m0.001s
> sys 0m0.001s
>
> Nothing has appeared in the sheep.log for any of the three sheep on the host
> since I took the network interface down.
Sorry, I couldn't reproduce it. I'll keep in mind this problem, but I
think of releasing 0.3.0 because it seems that the fatal blocking
problem doesn't happen if you use corosync 1.3.x.
>
> I'm also a little baffled that no node appears to be the master in the above
> node lists. I checked the code. It definitely is supposed to print "*" for node
> master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd.
Currently, it is up to the using cluster driver whether we need to
elect a master node. In addition, I thought that the administrator
doesn't need to know which node is master, so I removed the info from
outputs. Is it inconvenient for you?
Thanks,
Kazutaka
More information about the sheepdog
mailing list