[sheepdog-users] vdi stalled with 2 out of 3 node online

richter at ecos.de richter at ecos.de
Thu Mar 27 08:37:04 CET 2014


> > last Friday one of my three nodes has left the cluster (I don't have
> > any idea why). A "dog node recovery" shows "Waiting for other nodes to
> > join cluster".
> 
> I guess you are using corosync, which is said to be easily running into
network
> partition problem.
> 

Yes, I am using corosync. It might be the case that corosync had a problem,
but at the time the VM stalled, corosync was in sync and sheep wasn't able
to rejoin the cluster. I had to restart sheep (and only sheep) to rejoin the
cluster. I would exptect that sheep rejoins the cluster as soon as corosync
is in sync again and as far as I can tell this was the case in the past
(that was with 0.7.x).


> >
> > If I a have a cluster with three node and number of copies is set to
> > three, why does a VM stop working, if two nodes up and running?
> > Moreover, it is strange that it stopped after three days.
> >
> 
> If you don't add '-t' for 'dog cluster format', sheepdog cluster will run
without
> any problem even if your nodes number < redundancy number.
> 

I do not have used -t during format. So it should work, but it didn't work.
To be precise, it worked for 3 days without problems, but then it suddenly
stopped (there was heavy I/O in the VM during this time). Maybe it's again a
corner case in the local cache, like I reported it before for other
situations, but this time even restarting the VM didn't change anything. It
worked again, when the third node rejoined the cluster.

Regards

Gerald





More information about the sheepdog-users mailing list