[Sheepdog] [PATCH v2] sheep: tame sheep to recover the

Tue Sep 27 06:52:51 CEST 2011

At Tue, 27 Sep 2011 11:43:27 +0800,
Yibin Shen wrote:
> If the latest epoch is unrecoverable , or is a  transient epoch,
> should it fall back to the last available epoch?

Yes.  There is no other way to recover the cluster.

Note that "collie cluster recover" would be a dangerous operation.
For example,

 $ sheep /store/0 -p 7000
 $ sheep /store/1 -p 7001
 $ collie cluster format
 $ pkill -f "sheep /store/0"
 $ collie vdi create test 4G    # vdi will be created only on the secon node
 $ collie cluster shutdown

 $ sheep /store/0 -p 7000
 $ collie cluster recover       # start Sheepdog with only the first node

then, Sheepdog starts working, but the vdi "test" will be discarded.

In future, I want a force option for "cluster format" and "cluster
recover".

Thanks,

Kazutaka

> 
> Yibin Shen
> 
> On Tue, Sep 27, 2011 at 11:13 AM, MORITA Kazutaka <
> morita.kazutaka at lab.ntt.co.jp> wrote:
> 
> > At Tue, 27 Sep 2011 09:45:49 +0800,
> > Liu Yuan wrote:
> > >
> > > On 09/27/2011 06:09 AM, MORITA Kazutaka wrote:
> > > > At Mon, 26 Sep 2011 11:43:34 -0700 (PDT),
> > > > Ski Mountain wrote:
> > > >> What happens if one of the nodes in the cluster is not recoverable at
> > all.  IE fried motherboard, can you just start up the vm's that were on the
> > dead machine on another machine in the cluster?
> > > > If the unrecoverable node doesn't have the latest epoch info, we need
> > > > to do nothing special.  If you start the sheep daemon on all other
> > > > machines, then the cluster will work again.
> > > >
> > > > But if the failed node has the latest epoch, this is the case we need
> > > > a manual recovery.  It is because there is a risk of data loss in this
> > > > case, though I think this rarely happens.
> > > >
> > > >
> > >
> > > Hi Kazutaka,
> > >      I do have some idea like 'collie cluster recover' hanging over in
> > > my head. This kind of brutal force manual recovery would be the last
> > > resort to handle physical highest-epoch node failure in crashed cluster
> > > or physical nodes failure in shutdown cluster.
> >
> > Good point.
> >
> > >
> > >      The implementation might be rather easy. I am thinking of adding a
> > > new SD_MSG_RECOVERY event and broadcast this event to recovery the
> > > cluster with the epoch incremented by 1. how do you think of it?
> >
> > How about adding a new operation SD_OP_CLUSTER_RECOVERY and
> > broadcasting it with SD_MSG_VDI_OP?  I think It should work like a
> > "collie cluster format" command.
> >
> >
> > Thanks,
> >
> > Kazutaka
> > --
> > sheepdog mailing list
> > sheepdog at lists.wpkg.org
> > http://lists.wpkg.org/mailman/listinfo/sheepdog
> >
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog