[sheepdog-users] vdi stalled with 2 out of 3 node online

Thu Mar 27 08:48:27 CET 2014

On Thu, Mar 27, 2014 at 08:37:04AM +0100, richter at ecos.de wrote:
> > > last Friday one of my three nodes has left the cluster (I don't have
> > > any idea why). A "dog node recovery" shows "Waiting for other nodes to
> > > join cluster".
> > 
> > I guess you are using corosync, which is said to be easily running into
> network
> > partition problem.
> > 
> 
> Yes, I am using corosync. It might be the case that corosync had a problem,
> but at the time the VM stalled, corosync was in sync and sheep wasn't able
> to rejoin the cluster. I had to restart sheep (and only sheep) to rejoin the
> cluster. I would exptect that sheep rejoins the cluster as soon as corosync
> is in sync again and as far as I can tell this was the case in the past
> (that was with 0.7.x).
> 

As far as I can tell, corosync driver in 0.7.x and in 0.8.x are pratically same.
Sometimes corosync can manage to re-join the failed node automatically but
sheepdog itself doesn't have code to re-join failed node with corosync driver.
This means that drop-rejoin process is transparent to sheepdog.

We only have code for zookeeper monitored cluster to rejoin the drop-out node
automatically to the cluster.

> 
> > >
> > > If I a have a cluster with three node and number of copies is set to
> > > three, why does a VM stop working, if two nodes up and running?
> > > Moreover, it is strange that it stopped after three days.
> > >
> > 
> > If you don't add '-t' for 'dog cluster format', sheepdog cluster will run
> without
> > any problem even if your nodes number < redundancy number.
> > 
> 
> I do not have used -t during format. So it should work, but it didn't work.
> To be precise, it worked for 3 days without problems, but then it suddenly
> stopped (there was heavy I/O in the VM during this time). Maybe it's again a
> corner case in the local cache, like I reported it before for other
> situations, but this time even restarting the VM didn't change anything. It
> worked again, when the third node rejoined the cluster.
> 

Ummm, I have no idea why this happens.

Thanks
Yuan