[sheepdog] [PATCH] zookeeper: hande node joining race

Mon May 28 19:07:46 CEST 2012

On Tue, May 29, 2012 at 12:04 AM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 05/28/2012 11:54 PM, Yunkai Zhang wrote:
>
>>> I got a bug report where the nr_sd_nodes == nr_zk_nodes assert in
>>> > build_node_list is trigger by a larger number of sheep joining at the same
>>> > time.
>> We should not start sheeps at the same time. Are you read this commit log from
>> this patch:8567aae281c75502c0a267bf76b771a2af8392f2 ?
>
>
> Does Christoph's patch remove this constraint? We really should remove
No! If we start sheeps at the some time, it may cause another problem:
some sheeps will get a incomplete member-list that this patch can't
fix.

> this constraint, it is kind of a bug.
I must say again, this is not a *bug*. We should know the *essential*
difference between corosync and zookeeper:

1) Corosync will push member list to each sheep when it joins into cluster.
2) But zookeeper-server would not do this, the joining sheep should
fetch member list on its own.

For this difference, when we use zookeeper driver, we should face a
new problem:

How to package these two steps:
  a. fetch member list from zookeeper-server,
  b. update member list in zookeeper-server (add itself to the member list)
into one transaction?

There is *not* way to fix this problem if we do not use lock but also
star sheeps at the same time.

In fact, we *just* need to start _the first_ sheep separately, after
that, we can start other sheeps *concurrently*. That is say, after we
start the first sheep, this problem is not exist!
At 99.9% time, it will not bother me.

If you want to fix this problem completely, the only one method is to
hack zookeeper-server's code, but I don't think this is worth to do.

>
> Also, zookeeper has a very hideous defect that it needs a very long
> window (currently 30 seconds) to detect failed nodes. This would be
> catastrophic if the IO are routed to those failed nodes which sheep
> think of still alive. The fix I can think of is to add a active
> notification to the cluster when any sheep get a confuse fused error,
> plus current passive membership detection.

I plan to fix this problem. Maybe the best way it to turn down the
SESSION_TIMEOUT's value.

>
> Thanks,
> Yuan

-- 
Yunkai Zhang
Work at Taobao