<div>this hang-up can be reproduced by following steps:</div><div>1) do intensive CPG_EVENT_DELIVER event operation, such as vdi lookup/add/del.</div><div>   you can run some sheepdog storage based VM  simultaneously.</div>

<div>2) then stop some node's corosync sequentially</div><br><div class="gmail_quote">On Thu, Sep 15, 2011 at 10:30 AM, MORITA Kazutaka <span dir="ltr"><<a href="mailto:morita.kazutaka@lab.ntt.co.jp">morita.kazutaka@lab.ntt.co.jp</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">At Wed, 14 Sep 2011 11:24:14 +0800,<br>

<div><div></div><div class="h5"><a href="mailto:zituan@taobao.com">zituan@taobao.com</a> wrote:<br>

><br>

> From: Yibin Shen <<a href="mailto:zituan@taobao.com">zituan@taobao.com</a>><br>

><br>

> This patch prevents a CPG_EVENT_CONCHG event from blocking VM I/Os.<br>

><br>

> for more details, if a CPG_EVENT_CONCHG event occured inside the<br>

> CPG_EVENT_DELIVER and CPG_EVENT_REQUEST event pair(for example:<br>

> a vdi lookup oreration followed by a meta object read operation),<br>

> then whole cluster will hang forever for the meta object operation<br>

> be blocked. this patch delays a CPG_EVENT_CONCHG event handling.<br>

><br>

> Signed-off-by: Yibin Shen <<a href="mailto:zituan@taobao.com">zituan@taobao.com</a>><br>

> ---<br>

>  sheep/group.c |    4 +---<br>

>  1 files changed, 1 insertions(+), 3 deletions(-)<br>

><br>

> diff --git a/sheep/group.c b/sheep/group.c<br>

> index eb0c4e2..b9dd9d7 100644<br>

> --- a/sheep/group.c<br>

> +++ b/sheep/group.c<br>

> @@ -1487,10 +1487,8 @@ do_retry:<br>

>       list_for_each_entry_safe(cevent, n, &sys->cpg_event_siblings, cpg_event_list) {<br>

>               struct request *req = container_of(cevent, struct request, cev);<br>

><br>

> -             if (cevent->ctype == CPG_EVENT_DELIVER)<br>

> +             if (cevent->ctype == CPG_EVENT_DELIVER || cevent->ctype == CPG_EVENT_CONCHG)<br>

>                       continue;<br>

> -             if (cevent->ctype == CPG_EVENT_CONCHG)<br>

> -                     break;<br>

<br>

</div></div>The intention of this code is to flush all outstanding I/Os before<br>

processing CPG_EVENT_CONCHG.  CPG_EVENT_CONCHG causes a epoch update,<br>

and we want to avoid it while processing I/O requests to ensure a<br>

strong data consistency.<br>

<br>

The pended CPG_EVENT_CONCHG will be resumed after all outstanding I/Os<br>

are finished, so I think this code isn't a problem.  If the event<br>

isn't resumed properly, there should be a bug in another area.  Are<br>

there steps to reproduce the hang-up?<br>

<br>

Anyway, start_cpg_event_work() should be refactored to be more<br>

readable, I think.<br>

<br>

<br>

Thanks,<br>

<br>

Kazutaka<br>

<font color="#888888">--<br>

sheepdog mailing list<br>

<a href="mailto:sheepdog@lists.wpkg.org">sheepdog@lists.wpkg.org</a><br>

<a href="http://lists.wpkg.org/mailman/listinfo/sheepdog" target="_blank">http://lists.wpkg.org/mailman/listinfo/sheepdog</a><br>

</font></blockquote></div><br>