[Sheepdog] [PATCH 0/2] fix collie command errors during node member changes

Fri Dec 16 12:30:38 CET 2011

MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:

> I've sent some fixes related to network failure.  Can you try with the
> devel branch again?

Hi. I've just retried with this updated version.

When I ran with corosync-1.4.2, the remaining cluster just hung (apparently
forever) without ever timing out and detecting that the disconnected node
had died. Nothing in the sheep.log after the disconnection to give any clue
as to why, I'm afraid.

However, with corosync-1.3.4, it correctly hung for a minute and then
started responding, with However, with corosync-1.3.4, it correctly hung for
a minute and then started responding, with However, with corosync-1.3.4, it
correctly hung for a minute and then started responding, with However, with
corosync-1.3.4, it correctly hung for a minute and then started responding,
with a node listing that looks like

0026# collie node list
M   Id   Host:Port         V-Nodes       Zone
-    0   172.16.101.7:7000      64  124063916
-    1   172.16.101.7:7001      64  124063916
-    2   172.16.101.7:7002      64  124063916
-    3   172.16.101.9:7000      64  157618348
-    4   172.16.101.9:7001      64  157618348
-    5   172.16.101.9:7002      64  157618348

so it has correctly detected that the third machine has vanished.

Bizarrely, collie vdi list is succeeding on the node that I've disconnected,
but presumably that's because it happens to have a copy of the superblock?

0028# collie vdi list
  Name        Id    Size    Used  Shared    Creation time   VDI id
  7c53c905-d279-4dd2-95be-37e20dbfc494     1  515 MB  300 MB  0.0 MB 2011-12-16 10:58   7d7c86
0028# collie node list
M   Id   Host:Port         V-Nodes       Zone
-    0   172.16.101.7:7000      64  124063916
-    1   172.16.101.7:7001      64  124063916
-    2   172.16.101.7:7002      64  124063916
-    3   172.16.101.9:7000      64  157618348
-    4   172.16.101.9:7001      64  157618348
-    5   172.16.101.9:7002      64  157618348
-    6   172.16.101.11:7000     64  191172780
-    7   172.16.101.11:7001     64  191172780
-    8   172.16.101.11:7002     64  191172780

It's still doing this ten minutes after I took eth1 down. However,

0028# collie node info 
Id      Size    Used    Use%
[hang]
failed to connect to 172.16.101.7:7000: Connection timed out
0028# time collie node info
Id      Size    Used    Use%
failed to connect to 172.16.101.7:7000: Connection timed out

real    3m9.419s
user    0m0.001s
sys     0m0.001s

Nothing has appeared in the sheep.log for any of the three sheep on the host
since I took the network interface down.

I'm also a little baffled that no node appears to be the master in the above
node lists. I checked the code. It definitely is supposed to print "*" for node
master_idx, and my recent tidy-up patch hasn't broken that behaviour. Very odd.

Cheers,

Chris.