[sheepdog-users] About node cable disconnection

Thu Dec 19 17:52:38 CET 2013

I think there's something wrong how sheepdog manages a disconnected cable
in dual NIC environment.

I formatted the cluster with '-c 2 --strict' but I still get the same
behavior.

On the disconnected node
root at test004:~# tail -5 /var/sheep/sheep.log
Dec 19 15:59:48  ERROR [gway 3111] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 15:59:48  ERROR [gway 3113] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 15:59:48  ERROR [gway 3096] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 15:59:48  ERROR [gway 3110] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 15:59:48  ERROR [gway 3113] sheep_exec_req(1008) failed Request has
an old epoch

On the other nodes:
root at test005:/usr/src# tail -5 /var/sheep/sheep.log
Dec 19 15:59:41  ERROR [main] check_request_epoch(151) old node version 2,
1 (READ_PEER)
Dec 19 15:59:41  ERROR [main] check_request_epoch(151) old node version 2,
1 (READ_PEER)
Dec 19 15:59:41  ERROR [main] check_request_epoch(151) old node version 2,
1 (READ_PEER)
Dec 19 15:59:41  ERROR [main] check_request_epoch(151) old node version 2,
1 (READ_PEER)
Dec 19 15:59:41  ERROR [main] check_request_epoch(151) old node version 2,
1 (READ_PEER)

- sheep process cpu usage rises to 50-80% on all nodes.

- In less than 2 minutes sheep.log grows 10-48M.

parallel-ssh -i -h pssh.conf 'ls -lh /var/sheep/sheep.log'
[1] 17:14:28 [SUCCESS] test005
-rw-r--r-- 1 root root 15M dic 19 17:12 /var/sheep/sheep.log
[2] 17:14:28 [SUCCESS] test006
-rw-r--r-- 1 root root 9,5M dic 19 17:12 /var/sheep/sheep.log
[3] 17:14:28 [SUCCESS] test007
-rw-r--r-- 1 root root 25M dic 19 17:12 /var/sheep/sheep.log
[4] 17:14:28 [SUCCESS] test004
-rw-r--r-- 1 root root 48M dic 19 17:12 /var/sheep/sheep.log

1) Once the node exits the cluster, any write request on the disconnected
node shouldn't affect the other nodes.

How to reproduce:

all 4 nodes on line;
remove cable from eth0 from test004;
after 30 seconds, test004 leaves the cluster;
recovery begins and ends;
till now nothing strange happens;
when the guest on the disconnected node tries to write something, it's able
to communicate with the other nodes by the I/O nic (eth1), and sheep starts
writing on sheep.log etc...

If you confirm this is not the right behavior, I'll file a bug in launchpad.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131219/424f73f9/attachment-0005.html>