[Sheepdog] [PATCH] sheep: handle CPG_EVENT_REQUEST even if CPG_EVENT_CONCHG exists

Yibin Shen zituan at taobao.com
Thu Sep 15 05:03:46 CEST 2011

you can reproduce this hang-up  by following  steps:
1) do some intensive CPG_EVENT_DELIVER event operation, such as vdi lookup/add/del
2) then stop some node's corosync sequentially

On Thu, Sep 15, 2011 at 10:30 AM, MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp<mailto:morita.kazutaka at lab.ntt.co.jp>> wrote:
At Wed, 14 Sep 2011 11:24:14 +0800,
zituan at taobao.com<mailto:zituan at taobao.com> wrote:
> From: Yibin Shen <zituan at taobao.com<mailto:zituan at taobao.com>>
> This patch prevents a CPG_EVENT_CONCHG event from blocking VM I/Os.
> for more details, if a CPG_EVENT_CONCHG event occured inside the
> CPG_EVENT_DELIVER and CPG_EVENT_REQUEST event pair(for example:
> a vdi lookup oreration followed by a meta object read operation),
> then whole cluster will hang forever for the meta object operation
> be blocked. this patch delays a CPG_EVENT_CONCHG event handling.
> Signed-off-by: Yibin Shen <zituan at taobao.com<mailto:zituan at taobao.com>>
> ---
>  sheep/group.c |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> diff --git a/sheep/group.c b/sheep/group.c
> index eb0c4e2..b9dd9d7 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -1487,10 +1487,8 @@ do_retry:
>       list_for_each_entry_safe(cevent, n, &sys->cpg_event_siblings, cpg_event_list) {
>               struct request *req = container_of(cevent, struct request, cev);
> -             if (cevent->ctype == CPG_EVENT_DELIVER)
> +             if (cevent->ctype == CPG_EVENT_DELIVER || cevent->ctype == CPG_EVENT_CONCHG)
>                       continue;
> -             if (cevent->ctype == CPG_EVENT_CONCHG)
> -                     break;

The intention of this code is to flush all outstanding I/Os before
processing CPG_EVENT_CONCHG.  CPG_EVENT_CONCHG causes a epoch update,
and we want to avoid it while processing I/O requests to ensure a
strong data consistency.

The pended CPG_EVENT_CONCHG will be resumed after all outstanding I/Os
are finished, so I think this code isn't a problem.  If the event
isn't resumed properly, there should be a bug in another area.  Are
there steps to reproduce the hang-up?

Anyway, start_cpg_event_work() should be refactored to be more
readable, I think.


sheepdog mailing list
sheepdog at lists.wpkg.org<mailto:sheepdog at lists.wpkg.org>


This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20110915/c9a90d7f/attachment-0003.html>

More information about the sheepdog mailing list