[sheepdog] [PATCH] zookeeper: hande node joining race

Liu Yuan namei.unix at gmail.com
Mon May 28 18:04:59 CEST 2012


On 05/28/2012 11:54 PM, Yunkai Zhang wrote:

>> I got a bug report where the nr_sd_nodes == nr_zk_nodes assert in
>> > build_node_list is trigger by a larger number of sheep joining at the same
>> > time.
> We should not start sheeps at the same time. Are you read this commit log from
> this patch:8567aae281c75502c0a267bf76b771a2af8392f2 ?


Does Christoph's patch remove this constraint? We really should remove
this constraint, it is kind of a bug.

Also, zookeeper has a very hideous defect that it needs a very long
window (currently 30 seconds) to detect failed nodes. This would be
catastrophic if the IO are routed to those failed nodes which sheep
think of still alive. The fix I can think of is to add a active
notification to the cluster when any sheep get a confuse fused error,
plus current passive membership detection.

Thanks,
Yuan



More information about the sheepdog mailing list