<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div>Hi all, I have a production cluster with<br> Sheepdog daemon version 0.7.0_131_g88f0024<br></div> Corosync 1.4.6<br></div><div> Qemu 1.6.0.<br><br>
</div>Today it crashed, probably because of a bridge malfunction (see error below).<br></div>All sheep daemons were dead.<br></div>I killed all qemu processes, shutdown and restarted all 4 servers (and removed the bridge that may have cause the issue).<br>
<br></div>Now node id 0 is recovering.<br><br>dog node list<br> Id Host:Port V-Nodes Zone<br> 0 <a href="http://192.168.6.41:7000">192.168.6.41:7000</a> 126 688302272<br> 1 <a href="http://192.168.6.42:7000">192.168.6.42:7000</a> 124 705079488<br>
2 <a href="http://192.168.6.43:7000">192.168.6.43:7000</a> 147 721856704<br> 3 <a href="http://192.168.6.44:7000">192.168.6.44:7000</a> 115 738633920<br><br>Nodes In Recovery:<br> Id Host:Port V-Nodes Zone Progress<br>
0 <a href="http://192.168.6.41:7000">192.168.6.41:7000</a> 126 688302272 7.8%<br><br></div>I tried to run 2 guests that I need to be up, but in both cases they stop boot complaining about file system inconsistency (press ctrl + d to reboot or give root password for maintenance).<br>
<br></div>On guest vdi is called 'backup' and is 10G.<br></div>I killed the guest and run vdi check on it but at the next boot it behaves the same.<br></div>The other vdi is 100G so I it would take long time to check.<br>
<br></div>May it because node id0 is still recovering?<br></div><br></div>Any other idea...?<br></div>