[Sheepdog] [PATCH V2 2/2] sheep: teach sheepdog to better recovery the cluster

Sat Sep 24 06:20:28 CEST 2011

On 09/23/2011 07:49 PM, MORITA Kazutaka wrote:
> At Thu, 22 Sep 2011 15:05:27 +0800,
> Liu Yuan wrote:
>> On 09/22/2011 02:34 PM, Liu Yuan wrote:
>>> On 09/22/2011 02:01 PM, MORITA Kazutaka wrote:
>>>> At Wed, 21 Sep 2011 14:59:26 +0800,
>>>> Liu Yuan wrote:
>>>>> Kazutaka,
>>>>>        I guess this patch addresses inconsistency problem you mentioned.
>>>>> other comments are addressed too.
>>>> Thanks, this solves the inconsistency problem in a nice way!  I've
>>>> applied 3 patches in the v3 patchset.
>>>>
>>> Umm, actually, this just resolve some special case as you mentioned
>>> (the first node we start up should be firstly down, because in its
>>> epoch, there are full nodes information stored)
>>>
>>> Currently, we cannot recovery the cluster if we start up nodes other
>>> than the firstly-down node *correctly* and in my option, we even
>>> cannot handle this situation by software. Sheepdog itself cannot
>>> determine who has the epoch with the full nodes information. however,
>>> from outside, the admin can find it by hand. so to be afraid, sheepdog
>>> will rely on the knowledge outside to handle some recovery cases.
>>>
>> For e.g. below we get the inconsistent epoch history, though the cluster
>> gets up. as you mentioned, inconsistent epoch history will result in
>> data loss.
>>
>> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./sheep/sheep
>> /store/$i -z $i -p 700$i;sleep 1;done
>> root at taobao:/home/dev/sheepdog# collie/collie cluster format
>> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do pkill -f "sheep
>> /store/$i"; sleep 1; done
>> root at taobao:/home/dev/sheepdog# for i in 1 0 0 2; do ./sheep/sheep
>> /store/$i -z $i -p 700$i;sleep 1;done
>> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./collie/collie
>> cluster info -p 700$i;done
>> Cluster status: running
>>
>> Creation time        Epoch Nodes
>> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
>> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
>> 2011-09-22 15:03:22      2 [192.168.0.1:7001]
>> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
>> Cluster status: running
>>
>> Creation time        Epoch Nodes
>> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
>> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
>> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
>> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
>> Cluster status: running
>>
>> Creation time        Epoch Nodes
>> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
>> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
>> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
>> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001,
>> 192.168.0.1:7002]
> Hi Yuan,
>
> How about the below patch?  I guess this would solve all the problem
> we've discussed.
Hi Kazutaka,
     Your patch fixes the problem. But I think it is a bit too complex. 
I came up with an much simpler patch, which just add two checks in 
add_node_to_leave_list(). And I also further the leave node idea for the 
crash cluster recovery. It seems that leave nodes concept copes with 
crash  cluster as well. How do you think of it?

     I have sent the patch set in a new thread.

Thanks,
Yuan