[Sheepdog] [PATCH V2 2/2] sheep: teach sheepdog to better recovery the cluster

Sat Sep 24 07:45:08 CEST 2011

At Sat, 24 Sep 2011 12:20:28 +0800,
Liu Yuan wrote:
> 
> On 09/23/2011 07:49 PM, MORITA Kazutaka wrote:
> > At Thu, 22 Sep 2011 15:05:27 +0800,
> > Liu Yuan wrote:
> >> On 09/22/2011 02:34 PM, Liu Yuan wrote:
> >>> On 09/22/2011 02:01 PM, MORITA Kazutaka wrote:
> >>>> At Wed, 21 Sep 2011 14:59:26 +0800,
> >>>> Liu Yuan wrote:
> >>>>> Kazutaka,
> >>>>>        I guess this patch addresses inconsistency problem you mentioned.
> >>>>> other comments are addressed too.
> >>>> Thanks, this solves the inconsistency problem in a nice way!  I've
> >>>> applied 3 patches in the v3 patchset.
> >>>>
> >>> Umm, actually, this just resolve some special case as you mentioned
> >>> (the first node we start up should be firstly down, because in its
> >>> epoch, there are full nodes information stored)
> >>>
> >>> Currently, we cannot recovery the cluster if we start up nodes other
> >>> than the firstly-down node *correctly* and in my option, we even
> >>> cannot handle this situation by software. Sheepdog itself cannot
> >>> determine who has the epoch with the full nodes information. however,
> >>> from outside, the admin can find it by hand. so to be afraid, sheepdog
> >>> will rely on the knowledge outside to handle some recovery cases.
> >>>
> >> For e.g. below we get the inconsistent epoch history, though the cluster
> >> gets up. as you mentioned, inconsistent epoch history will result in
> >> data loss.
> >>
> >> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./sheep/sheep
> >> /store/$i -z $i -p 700$i;sleep 1;done
> >> root at taobao:/home/dev/sheepdog# collie/collie cluster format
> >> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do pkill -f "sheep
> >> /store/$i"; sleep 1; done
> >> root at taobao:/home/dev/sheepdog# for i in 1 0 0 2; do ./sheep/sheep
> >> /store/$i -z $i -p 700$i;sleep 1;done
> >> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./collie/collie
> >> cluster info -p 700$i;done
> >> Cluster status: running
> >>
> >> Creation time        Epoch Nodes
> >> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> >> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> >> 2011-09-22 15:03:22      2 [192.168.0.1:7001]
> >> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> >> Cluster status: running
> >>
> >> Creation time        Epoch Nodes
> >> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> >> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> >> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
> >> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> >> Cluster status: running
> >>
> >> Creation time        Epoch Nodes
> >> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> >> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> >> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
> >> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
> >> 192.168.0.1:7002]
> > Hi Yuan,
> >
> > How about the below patch?  I guess this would solve all the problem
> > we've discussed.
> Hi Kazutaka,
>      Your patch fixes the problem. But I think it is a bit too complex. 
> I came up with an much simpler patch, which just add two checks in 
> add_node_to_leave_list(). And I also further the leave node idea for the 
> crash cluster recovery. It seems that leave nodes concept copes with 
> crash  cluster as well. How do you think of it?
> 
>      I have sent the patch set in a new thread.

Thanks, I like your simpler approach.  But how to deal with the case
that the master node's epoch doesn't contain the node which has the
latest epoch?  I think this is the most complicated situation.

For example:

    for i in 0 1; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
    ./collie/collie cluster format
    for i in 2 3 4; do
        pkill -f "sheep /store/$((i - 2))"
        ./sheep/sheep /store/$i -z $i -p 700$i
        sleep 1
    done
    for i in 3 4; do pkill -f "sheep /store/$i"; sleep 1; done
    for i in 0 1 2 3 4; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
    for i in 1 2 3 4; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
    for i in 0 1 2 3 4; do ./collie/collie cluster info -p 700$i; done

My patch handles this, but your one doesn't.  Is it possible to handle
this with a simple change?  Or, perhaps, don't we need to consider
this case?

Thanks,

Kazutaka