<div dir="ltr"><div>2 node using 2 nics each (eth0, eth1).<br></div>sheep 0.7.0_197_g9f718d2, corosync 1.4.6, debian wheezy 64 bit.<br><div><br>1) shutdown of the switch (and power back)<br>- all node still in the cluster<br>

- no recovery<br>- nothing in sheep.log<br>- corosync still alive on bot nodes<br></div><div><br>2) remove cable from eth0 of node id 1 (the nic used by corosync):<br>- 'dog node list' on node id 0 was still showing node id 1<br>

- no recovery has started<br>- nothing reported in sheep.log<br>- once the cable was back, node id 1 was still in te cluster and nothing on its sheep.log<br>- after some minutes I noticed it made a check<br>  (this is a previous recovery)<br>

  Nov 13 15:50:30   INFO [main] recover_object_main(841) object 53941900001268 is recovered (17609/17610)<br>  Nov 13 15:50:30   INFO [main] recover_object_main(841) object 539419000005e1 is recovered (17610/17610)<br>  (this is the new one)<br>

  Nov 13 16:07:00   INFO [main] recover_object_main(841) object 5394190000089e is recovered (1/17610)<br>  Nov 13 16:07:00   INFO [main] recover_object_main(841) object 539419000003d5 is recovered (2/17610)<br>  Nov 13 16:07:00   INFO [main] recover_object_main(841) object 687c40000000bf is recovered (3/17610)<br>

  ...many others...<br><br><br>3) remove both cable form node id 1 and isert them back after ~ 10 seconds.<br><br>root@test004:~# dog node list<br>  Id   Host:Port         V-Nodes       Zone<br>   0   <a href="http://192.168.2.44:7000">192.168.2.44:7000</a>        128  738371776<br>

<br>root@test005:~# dog node list<br>  Id   Host:Port         V-Nodes       Zone<br>   0   <a href="http://192.168.2.44:7000">192.168.2.44:7000</a>        128  738371776<br>   1   <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>        128  755148992<br>

<br>Sheep and corosync are alive on both nodes but they are not realy aware of each other.<br>Nothing shows up in both sheep.log.<br>Both node are showing the right 'vdi list'.<br><br>I try to check a small vdi:<br>

<br>root@test004:~# dog vdi check boot_iso<br>ABORT: Not enough active nodes for consistency-check<br><br>(That's obvious: node id 1 can't check for other object copies if it's alone in the cluster).<br><br>root@test005:~# dog vdi check boot_iso<br>

100.0 %<br>finish check&repair boot_iso<br><br>I was expecting the check to fail the same way but it completed it successfully.<br><br>It's like node id 1 is aware of node id 0 but not viceversa.<br><br>This is the same (or almost) situation I got with a production cluster.<br>

<br><br>I run 'cluster shutdown' on node id 1 but only its sheep daemon died.<br>I run 'cluster shutdown' also on node id 0 and the daemon died.<br>(I was expecting to be forced of using kill -9).<br><br>Now I run sheep on both nodes to restart the cluster but they are like in split brain (aware only of them self).<br>

<br>root@test004:~# dog node list<br>  Id   Host:Port         V-Nodes       Zone<br>   0   <a href="http://192.168.2.44:7000">192.168.2.44:7000</a>        128  738371776<br><br>root@test004:~# dog cluster info<br>Cluster status: running, auto-recovery enabled<br>

Cluster created at Tue Nov  5 15:35:17 2013<br>Epoch Time           Version<br>2013-11-13 16:07:00      9 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>]<br>2013-11-13 15:40:18      8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>

<br><br>root@test005:~# dog node list<br>  Id   Host:Port         V-Nodes       Zone<br>   0   <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>        128  755148992<br><br>root@test005:~# dog cluster info<br>Cluster status: running, auto-recovery enabled<br>

Cluster created at Tue Nov  5 15:35:17 2013<br>Epoch Time           Version<br>2013-11-13 16:24:09      9 [<a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>2013-11-13 15:40:18      8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>

<br>I stopped sheep and corosync on both nodes.<br>I went looking to /var/log/syslog.<br><br>I see many of<br>Nov 13 16:07:01 test004 /USR/SBIN/CRON[5369]: (root) CMD (/root/script/monitor_ram.sh >> /var/log/monitor_ram.log 2>&1)<br>

Nov 13 16:07:04 test004 corosync[2959]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>Nov 13 16:07:04 test004 corosync[2959]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.2.44) ; members(old:1 left:0)<br>

Nov 13 16:07:04 test004 corosync[2959]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>Nov 13 16:07:08 test004 corosync[2959]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>

Nov 13 16:07:08 test004 corosync[2959]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.2.44) ; members(old:1 left:0)<br><br>And that may be correct because nodes left the cluster.<br><br>On node id 1 there are some messages I haven't seen before:<br>

<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>Nov 13 16:24:10 test005 sheep: enqueue: log area overrun, dropping message<br>

...<br><br>And I see also many of<br><br>Nov 13 15:53:28 test005 corosync[2634]:   [TOTEM ] Retransmit List: 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 23 24 25 26 27 28 29 2a 2b 2c <br>Nov 13 15:53:29 test005 corosync[2634]:   [TOTEM ] Retransmit List: 23 24 25 26 27 28 29 2a 2b 2c 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 <br>

Nov 13 15:53:29 test005 corosync[2634]:   [TOTEM ] Retransmit List: 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36 23 24 25 26 27 28 29 2a 2b 2c <br>Nov 13 15:53:29 test005 corosync[2634]:   [TOTEM ] Retransmit List: 23 24 25 26 27 28 29 2a 2b 2c 19 1a 1b 1c 1d 1e 1f 20 21 22 2d 2e 2f 30 31 32 33 34 35 36<br>

<br>and I think it's right: I removed both its cables.<br><br>Restarting both corosync and sheep I was able to restart the cluster.<br><br>dog node list<br>  Id   Host:Port         V-Nodes       Zone<br>   0   <a href="http://192.168.2.44:7000">192.168.2.44:7000</a>        128  738371776<br>

   1   <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>        128  755148992<br><br>And a recovery has started on node id 1.<br><br>root@test004:~# dog node recovery<br>Nodes In Recovery:<br>  Id   Host:Port         V-Nodes       Zone       Progress<br>

   1   <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>     128  755148992        4.1%<br><br>I was not expecting to see a recovery starting.<br>In this case node id 1 recoverying, so it means node id 0 is "the good one".<br>

I wonder if this simply depend on the order I run the sheeps.<br>If I was starting sheep daemon first on test005, then on test004, maybe test004 was going to be recovered...?<br><br><br></div><div>Finally<br><br>root@test005:~# dog cluster info<br>

Cluster status: running, auto-recovery enabled<br>Cluster created at Tue Nov  5 15:35:17 2013<br>Epoch Time           Version<br>2013-11-13 16:41:28     10 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>

2013-11-13 16:24:09      9 [<a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>2013-11-13 15:40:18      8 [<a href="http://192.168.2.44:7000">192.168.2.44:7000</a>, <a href="http://192.168.2.45:7000">192.168.2.45:7000</a>]<br>

<br></div></div>