[sheepdog] [PATCH V2 00/11] INTRODUCE

Yunkai Zhang yunkai.me at gmail.com
Tue Aug 21 05:03:16 CEST 2012


On Tue, Aug 21, 2012 at 10:46 AM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 08/21/2012 02:29 AM, MORITA Kazutaka wrote:
>> I think delaying recovery for a few seconds always is useful for many
>> users.  Under heavy network load, sheep can wrongly detect node
>> failure and node membership can change frequently.  Delaying recovery
>> for a short time makes Sheepdog tolerant against such situation.
>
> I think your example is very vague, what kind of driver you use? Sheep
> itself won't sense membership and rely on cluster drivers to maintain
> membership. Could you detail how it happen exactly in real case?
>
> If you are talking about network partition problem, I don't think delay
> recovery will help solve it. We have met network partition when we used
> corosync driver, for zookeeper driver, we haven't met it yet. (I guess
> we won't meet it with zookeeper as a central membership control).
>
> Suppose we have 6 nodes in a cluster, A,B,C,D,E,F one copy with epoch =
> 1. For time t1, we get network partitioned, and three partitions show
> up, c1(A,B,C), c2(D,E),c3(F). So epoch for this three partitions is
> respectively epoch(c1=4, c2=5, c3=6) and all 3 partitions progress to
> recover and get updates to its local object.
>
> In your above example, suppose we might have these 3 partition
> automatically merge into one partition, this means, after merging
> 1) epoch(c1=7, c2=9, c3=11)
> 2) no code to handle different version objects which all nodes think his
> own local version is correct.
>
> So I think we have to handle epoch mismatch and object multi-version
> problems before evaluating delay recovery for network partition.
>
> If you are not talking about network partition problem, I think we can
> only meet stop/restart node case for manual maintenance, where I think
> manual recovery could really be helpful.


Delay recovery couldn't solve network partition problem, and as you
mentioned above, if sheep break internal protocol, delay recovery
could not help to sheep's upgrade.

But if sheep don't break internal protocol, for example, we just fix
memory leak bug/add some useful log/fix a corner case, it's very
useful for us.

>
> Thanks,
> Yuan
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



-- 
Yunkai Zhang
Work at Taobao



More information about the sheepdog mailing list