[sheepdog] [PATCH] zookeeper: fix cluster hang by giving priority to process LEAVE event
Liu Yuan
namei.unix at gmail.com
Thu Jul 19 04:45:15 CEST 2012
On 07/16/2012 01:07 PM, Yunkai Zhang wrote:
> From: Yunkai Zhang <qiushu.zyk at taobao.com>
>
> V2:
> - fix zk_queue_pop() when it's called by zk_unblock():
> continue to process block event when is_zk_unblock is True
> -------------------------------------------------------- >8
>
> As cluster request may retry infinitely when some sheeps left, than
> cluster_op_done could not to be called forever, so it will cause cluster
> hang problem.
>
> By giving priority to process LEAVE event when there is unfinished BLOCK
> event, we can fix this issue, but also comply with the rule which is very
> important for distributed system I think:
>
> All sheeps should process all events in the same order.
>
This actually relax the order though not as aggressive as the patch in
corosync does.
> Signed-off-by: Yunkai Zhang <qiushu.zyk at taobao.com>
> ---
> sheep/cluster/zookeeper.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/sheep/cluster/zookeeper.c b/sheep/cluster/zookeeper.c
> index 7bd20bd..e03fd22 100644
> --- a/sheep/cluster/zookeeper.c
> +++ b/sheep/cluster/zookeeper.c
> @@ -71,6 +71,7 @@ static struct zk_event zk_levents[SD_MAX_NODES];
> static int nr_zk_levents;
> static unsigned zk_levent_head;
> static unsigned zk_levent_tail;
> +static bool is_zk_unblock;
>
> static void *zk_node_btroot;
> static struct zk_node *zk_master;
> @@ -239,9 +240,11 @@ static int zk_queue_pop(zhandle_t *zh, struct zk_event *ev)
> struct zk_event *lev;
> eventfd_t value = 1;
>
> - /* process leave event */
> - if (uatomic_read(&zk_notify_blocked) <= 0 &&
> - uatomic_read(&nr_zk_levents)) {
> + /*
> + * Continue to process LEAVE event even if
> + * we have an unfinished BLOCK event.
> + */
> + if (!is_zk_unblock && uatomic_read(&nr_zk_levents)) {
> nr_levents = uatomic_sub_return(&nr_zk_levents, 1) + 1;
> dprintf("nr_zk_levents:%d, head:%u\n", nr_levents, zk_levent_head);
>
> @@ -282,6 +285,9 @@ static int zk_queue_pop(zhandle_t *zh, struct zk_event *ev)
> return 0;
> }
>
> + if (!is_zk_unblock && uatomic_read(&zk_notify_blocked) > 0)
> + return -1;
> +
both is_zk_unblock and zk_notify_blocked are blocking related, this need
at least a documentation what each variables means, if you can't manage
to fold these two into one. Also, for the variable itself, better to
rename is_zk_unblock as zk_blocked or similar. Having one var
zk_notify_blocked and the other is_zk_unblock will cause a lot of confusion.
Thanks,
Yuan
More information about the sheepdog
mailing list