<div dir="ltr"><div><div><div><div><div><div><div>I think there's something wrong how sheepdog manages a disconnected cable in dual NIC environment.<br><br></div>I formatted the cluster with '-c 2 --strict' but I still get the same behavior.</div>
<div><br></div>On the disconnected node<br>root@test004:~# tail -5 /var/sheep/sheep.log <br>Dec 19 15:59:48 ERROR [gway 3111] sheep_exec_req(1008) failed Request has an old epoch<br>Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch<br>
Dec 19 15:59:48 ERROR [gway 3096] sheep_exec_req(1008) failed Request has an old epoch<br>Dec 19 15:59:48 ERROR [gway 3110] sheep_exec_req(1008) failed Request has an old epoch<br>Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch<br>
<br><br></div>On the other nodes:<br>root@test005:/usr/src# tail -5 /var/sheep/sheep.log <br>Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)<br>Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)<br>
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)<br>Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)<br>Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)<br>
<br></div>- sheep process cpu usage rises to 50-80% on all nodes.<br><br>- In less than 2 minutes sheep.log grows 10-48M.<br><br>parallel-ssh -i -h pssh.conf 'ls -lh /var/sheep/sheep.log'<br>[1] 17:14:28 [SUCCESS] test005<br>
-rw-r--r-- 1 root root 15M dic 19 17:12 /var/sheep/sheep.log<br>[2] 17:14:28 [SUCCESS] test006<br>-rw-r--r-- 1 root root 9,5M dic 19 17:12 /var/sheep/sheep.log<br>[3] 17:14:28 [SUCCESS] test007<br>-rw-r--r-- 1 root root 25M dic 19 17:12 /var/sheep/sheep.log<br>
[4] 17:14:28 [SUCCESS] test004<br>-rw-r--r-- 1 root root 48M dic 19 17:12 /var/sheep/sheep.log<br><br><br></div>1) Once the node exits the cluster, any write request on the disconnected node shouldn't affect the other nodes.<br>
<br></div><div>How to reproduce:<br><br></div><div>all 4 nodes on line;<br></div><div>remove cable from eth0 from test004;<br></div><div>after 30 seconds, test004 leaves the cluster;<br>recovery begins and ends;<br></div>
<div>till now nothing strange happens;<br></div><div>when the guest on the disconnected node tries to write something, it's able to communicate with the other nodes by the I/O nic (eth1), and sheep starts writing on sheep.log etc...<br>
<br></div>If you confirm this is not the right behavior, I'll file a bug in launchpad.<br><br></div></div>