<div dir="ltr"><div>2 node using 2 nics each (eth0, eth1).<br></div>sheep 0.7.0_197_g9f718d2, corosync 1.4.6, debian wheezy 64 bit.<br><div><br>1) shutdown of the switch (and power back)<br>- all node still in the cluster<br>
- no recovery<br>- nothing in sheep.log<br>- corosync still alive on bot nodes<br></div><div><br>2) remove cable from eth0 of node id 1 (the nic used by corosync):<br>- 'dog node list' on node id 0 was still showing node id 1<br>
- no recovery has started<br>- nothing reported in sheep.log<br>- once the cable was back, node id 1 was still in te cluster and nothing on its sheep.log<br>- after some minutes I noticed it made a check<br> (this is a previous recovery)<br>
Nov 13 15:50:30 INFO [main] recover_object_main(841) object 53941900001268 is recovered (17609/17610)<br> Nov 13 15:50:30 INFO [main] recover_object_main(841) object 539419000005e1 is recovered (17610/17610)<br> (this is the new one)<br>
Nov 13 16:07:00 INFO [main] recover_object_main(841) object 5394190000089e is recovered (1/17610)<br> Nov 13 16:07:00 INFO [main] recover_object_main(841) object 539419000003d5 is recovered (2/17610)<br> Nov 13 16:07:00 INFO [main] recover_object_main(841) object 687c40000000bf is recovered (3/17610)<br>
...many others...<br><br><br>3) remove both cable form node id 1 and isert them back after ~ 10 seconds.<br><br>root@test004:~# dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.2.44:7000">192.168.2.44:7000</a> 128 738371776<br>
<br>root@test005:~# dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.2.44:7000">192.168.2.44:7000</a> 128 738371776<br> 1 <a href="http://192.168.2.45:7000">192.168.2.45:7000</a> 128 755148992<br>
<br>Sheep and corosync are alive on both nodes but they are not realy aware of each other.<br>Nothing shows up in both sheep.log.<br>Both node are showing the right 'vdi list'.<br><br>I try to check a small vdi:<br>
<br>root@test004:~# dog vdi check boot_iso<br>ABORT: Not enough active nodes for consistency-check<br><br>(That's obvious: node id 1 can't check for other object copies if it's alone in the cluster).<br><br>root@test005:~# dog vdi check boot_iso<br>
100.0 %<br>finish check&repair boot_iso<br><br>I was expecting the check to fail the same way but it completed it successfully.<br><br>It's like node id 1 is aware of node id 0 but not viceversa.<br><br>This is the same (or almost) situation I got with a production cluster.<br>
<br><br>I run 'cluster shutdown' on node id 1 but only its sheep daemon died.<br>I run 'cluster shutdown' also on node id 0 and the daemon died.<br>(I was expecting to be forced of using kill -9).<br><br>Now I run sheep on both nodes to restart the cluster but they are like in split brain (aware only of them self).<br>
<br>root@test004:~# dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.2.44:7000">192.168.2.44:7000</a> 128 738371776<br><br>root@test004:~# dog cluster info<br>Cluster status: running, auto-recovery enabled<br>
Cluster created at Tue Nov 5 15:35:17 2013<br>Epoch Time Version<br>2013-11-13 16:07:00 9 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>]<br>2013-11-13 15:40:18 8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>
<br><br>root@test005:~# dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.2.45:7000">192.168.2.45:7000</a> 128 755148992<br><br>root@test005:~# dog cluster info<br>Cluster status: running, auto-recovery enabled<br>
Cluster created at Tue Nov 5 15:35:17 2013<br>Epoch Time Version<br>2013-11-13 16:24:09 9 [<a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>2013-11-13 15:40:18 8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>
<br>I stopped sheep and corosync on both nodes.<br>I went looking to /var/log/syslog.<br><br>I see many of<br>Nov 13 16:07:01 test004 /USR/SBIN/CRON[5369]: (root) CMD (/root/script/monitor_ram.sh >> /var/log/monitor_ram.log 2>&1)<br>
Nov 13 16:07:04 test004 corosync[2959]: [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>Nov 13 16:07:04 test004 corosync[2959]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.44) ; members(old:1 left:0)<br>
Nov 13 16:07:04 test004 corosync[2959]: [MAIN ] Completed service synchronization, ready to provide service.<br>Nov 13 16:07:08 test004 corosync[2959]: [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>
Nov 13 16:07:08 test004 corosync[2959]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.44) ; members(old:1 left:0)<br><br>And that may be correct because nodes left the cluster.<br><br>On node id 1 there are some messages I haven't seen before:<br>
<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>
...<br><br>And I see also many of<br><br>Nov 13 15:53:28 test005 corosync[2634]: [TOTEM ] Retransmit List: 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 23 24 25 26 27 28 29 2a 2b 2c <br>Nov 13 15:53:29 test005 corosync[2634]: [TOTEM ] Retransmit List: 23 24 25 26 27 28 29 2a 2b 2c 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 <br>
Nov 13 15:53:29 test005 corosync[2634]: [TOTEM ] Retransmit List: 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 23 24 25 26 27 28 29 2a 2b 2c <br>Nov 13 15:53:29 test005 corosync[2634]: [TOTEM ] Retransmit List: 23 24 25 26 27 28 29 2a 2b 2c 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36<br>
<br>and I think it's right: I removed both its cables.<br><br>Restarting both corosync and sheep I was able to restart the cluster.<br><br>dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.2.44:7000">192.168.2.44:7000</a> 128 738371776<br>
1 <a href="http://192.168.2.45:7000">192.168.2.45:7000</a> 128 755148992<br><br>And a recovery has started on node id 1.<br><br>root@test004:~# dog node recovery<br>Nodes In Recovery:<br> Id Host:Port V-Nodes Zone Progress<br>
1 <a href="http://192.168.2.45:7000">192.168.2.45:7000</a> 128 755148992 4.1%<br><br>I was not expecting to see a recovery starting.<br>In this case node id 1 recoverying, so it means node id 0 is "the good one".<br>
I wonder if this simply depend on the order I run the sheeps.<br>If I was starting sheep daemon first on test005, then on test004, maybe test004 was going to be recovered...?<br><br><br></div><div>Finally<br><br>root@test005:~# dog cluster info<br>
Cluster status: running, auto-recovery enabled<br>Cluster created at Tue Nov 5 15:35:17 2013<br>Epoch Time Version<br>2013-11-13 16:41:28 10 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>
2013-11-13 16:24:09 9 [<a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>2013-11-13 15:40:18 8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>
<br></div></div>