On 05/28/2012 11:54 PM, Yunkai Zhang wrote: >> I got a bug report where the nr_sd_nodes == nr_zk_nodes assert in >> > build_node_list is trigger by a larger number of sheep joining at the same >> > time. > We should not start sheeps at the same time. Are you read this commit log from > this patch:8567aae281c75502c0a267bf76b771a2af8392f2 ? Does Christoph's patch remove this constraint? We really should remove this constraint, it is kind of a bug. Also, zookeeper has a very hideous defect that it needs a very long window (currently 30 seconds) to detect failed nodes. This would be catastrophic if the IO are routed to those failed nodes which sheep think of still alive. The fix I can think of is to add a active notification to the cluster when any sheep get a confuse fused error, plus current passive membership detection. Thanks, Yuan |