[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes

Wed Dec 28 17:42:14 CET 2011

At Fri, 16 Dec 2011 11:30:38 +0000,
Chris Webb wrote:
> 
> MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> 
> > I've sent some fixes related to network failure.  Can you try with the
> > devel branch again?
> 
> Hi. I've just retried with this updated version.
> 
> When I ran with corosync-1.4.2, the remaining cluster just hung (apparently
> forever) without ever timing out and detecting that the disconnected node
> had died. Nothing in the sheep.log after the disconnection to give any clue
> as to why, I'm afraid.
> 
> However, with corosync-1.3.4, it correctly hung for a minute and then
> started responding, with However, with corosync-1.3.4, it correctly hung for
> a minute and then started responding, with However, with corosync-1.3.4, it
> correctly hung for a minute and then started responding, with However, with
> corosync-1.3.4, it correctly hung for a minute and then started responding,
> with a node listing that looks like
> 
> 0026# collie node list
> M   Id   Host:Port         V-Nodes       Zone
> -    0   172.16.101.7:7000      64  124063916
> -    1   172.16.101.7:7001      64  124063916
> -    2   172.16.101.7:7002      64  124063916
> -    3   172.16.101.9:7000      64  157618348
> -    4   172.16.101.9:7001      64  157618348
> -    5   172.16.101.9:7002      64  157618348
> 
> so it has correctly detected that the third machine has vanished.
> 
> Bizarrely, collie vdi list is succeeding on the node that I've disconnected,
> but presumably that's because it happens to have a copy of the superblock?
> 
> 0028# collie vdi list
>   Name        Id    Size    Used  Shared    Creation time   VDI id
>   7c53c905-d279-4dd2-95be-37e20dbfc494     1  515 MB  300 MB  0.0 MB 2011-12-16 10:58   7d7c86
> 0028# collie node list
> M   Id   Host:Port         V-Nodes       Zone
> -    0   172.16.101.7:7000      64  124063916
> -    1   172.16.101.7:7001      64  124063916
> -    2   172.16.101.7:7002      64  124063916
> -    3   172.16.101.9:7000      64  157618348
> -    4   172.16.101.9:7001      64  157618348
> -    5   172.16.101.9:7002      64  157618348
> -    6   172.16.101.11:7000     64  191172780
> -    7   172.16.101.11:7001     64  191172780
> -    8   172.16.101.11:7002     64  191172780
> 
> It's still doing this ten minutes after I took eth1 down. However,
> 
> 0028# collie node info 
> Id      Size    Used    Use%
> [hang]
> failed to connect to 172.16.101.7:7000: Connection timed out
> 0028# time collie node info
> Id      Size    Used    Use%
> failed to connect to 172.16.101.7:7000: Connection timed out
> 
> real    3m9.419s
> user    0m0.001s
> sys     0m0.001s
> 
> Nothing has appeared in the sheep.log for any of the three sheep on the host
> since I took the network interface down.

Sorry, I couldn't reproduce it.  I'll keep in mind this problem, but I
think of releasing 0.3.0 because it seems that the fatal blocking
problem doesn't happen if you use corosync 1.3.x.

> 
> I'm also a little baffled that no node appears to be the master in the above
> node lists. I checked the code. It definitely is supposed to print "*" for node
> master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd.

Currently, it is up to the using cluster driver whether we need to
elect a master node.  In addition, I thought that the administrator
doesn't need to know which node is master, so I removed the info from
outputs.  Is it inconvenient for you?

Thanks,

Kazutaka