[sheepdog] [PATCH] zookeeper: fix cluster hang by giving priority to process LEAVE event

Liu Yuan namei.unix at gmail.com
Thu Jul 19 04:45:15 CEST 2012


On 07/16/2012 01:07 PM, Yunkai Zhang wrote:
> From: Yunkai Zhang <qiushu.zyk at taobao.com>
> 
> V2:
> - fix zk_queue_pop() when it's called by zk_unblock():
>   continue to process block event when is_zk_unblock is True
> -------------------------------------------------------- >8
> 
> As cluster request may retry infinitely when some sheeps left, than
> cluster_op_done could not to be called forever, so it will cause cluster
> hang problem.
> 
> By giving priority to process LEAVE event when there is unfinished BLOCK
> event, we can fix this issue, but also comply with the rule which is very
> important for distributed system I think:
> 
> All sheeps should process all events in the same order.
> 

This actually relax the order though not as aggressive as the patch in
corosync does.


> Signed-off-by: Yunkai Zhang <qiushu.zyk at taobao.com>
> ---
>  sheep/cluster/zookeeper.c |   17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/sheep/cluster/zookeeper.c b/sheep/cluster/zookeeper.c
> index 7bd20bd..e03fd22 100644
> --- a/sheep/cluster/zookeeper.c
> +++ b/sheep/cluster/zookeeper.c
> @@ -71,6 +71,7 @@ static struct zk_event zk_levents[SD_MAX_NODES];
>  static int nr_zk_levents;
>  static unsigned zk_levent_head;
>  static unsigned zk_levent_tail;
> +static bool is_zk_unblock;
>  
>  static void *zk_node_btroot;
>  static struct zk_node *zk_master;
> @@ -239,9 +240,11 @@ static int zk_queue_pop(zhandle_t *zh, struct zk_event *ev)
>  	struct zk_event *lev;
>  	eventfd_t value = 1;
>  
> -	/* process leave event */
> -	if (uatomic_read(&zk_notify_blocked) <= 0 &&
> -	     uatomic_read(&nr_zk_levents)) {
> +	/*
> +	 * Continue to process LEAVE event even if
> +	 * we have an unfinished BLOCK event.
> +	 */
> +	if (!is_zk_unblock && uatomic_read(&nr_zk_levents)) {
>  		nr_levents = uatomic_sub_return(&nr_zk_levents, 1) + 1;
>  		dprintf("nr_zk_levents:%d, head:%u\n", nr_levents, zk_levent_head);
>  
> @@ -282,6 +285,9 @@ static int zk_queue_pop(zhandle_t *zh, struct zk_event *ev)
>  		return 0;
>  	}
>  
> +	if (!is_zk_unblock && uatomic_read(&zk_notify_blocked) > 0)
> +		return -1;
> +

both is_zk_unblock and zk_notify_blocked are blocking related, this need
at least a documentation what each variables means, if you can't manage
to fold these two into one. Also, for the variable itself, better to
rename is_zk_unblock as zk_blocked or similar. Having one var
zk_notify_blocked and the other is_zk_unblock will cause a lot of confusion.

Thanks,
Yuan




More information about the sheepdog mailing list