[Sheepdog] [PATCH v2] sheep: fix a network partition issue

Mon Oct 31 11:29:45 CET 2011

On Mon, Oct 31, 2011 at 6:15 PM, Liu Yuan <namei.unix at gmail.com> wrote:
>
> On 10/31/2011 06:00 PM, MORITA Kazutaka wrote:
>
> > At Tue, 25 Oct 2011 15:06:05 +0800,
> > Liu Yuan wrote:
> >>
> >> On 10/25/2011 02:55 PM, zituan at taobao.com wrote:
> >>
> >>> From: Yibin Shen <zituan at taobao.com>
> >>>
> >>> In some situation, sheep may disconnected from corosync instantaneously,
> >>> at the same time, both sheep and corosync will keep running but
> >>> none of them exit, then the disconnected sheep may receive a confchg
> >>> message from corosync which notify this sheep has left.
> >>> that will lead to a network partition, this patch fix it.
> >>>
> >>> Signed-off-by: Yibin Shen <zituan at taobao.com>
> >>> ---
> >>>  sheep/group.c |    3 +++
> >>>  1 files changed, 3 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/sheep/group.c b/sheep/group.c
> >>> index e22dabc..ab5a9f0 100644
> >>> --- a/sheep/group.c
> >>> +++ b/sheep/group.c
> >>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
> >>>     struct work_leave *w = NULL;
> >>>     int i, size;
> >>>
> >>> +   if (node_cmp(left, &sys->this_node) == 0)
> >>> +           panic("BUG: this node can't be on the left list\n");
> >>> +
> >>
> >>
> >> Hmm, the panic output looks confusing. how about "Network Patition Bug:
> >> I should have exited.\n"? since the output will be seen by
> >> administrators, not only programmer.
> >
> > Applied after modifying output text, thanks!
> >
> > Kazutaka
>
>
> Kazutaka,
>        Maybe we should not panic out when it becomes a single node cluster.
> The node will change into HALT state which doesn't any harm to its data.
>
>        This is introduced by a corosync bug which is fixed by Yunkai in latest
> corosync. During patch fixing, Yunkai found that corosync will likely
> re-join the old configuration soon after a short break-out with other
> corosync nodes.
they are two different bugs,
and this one is somewhat more sophisticated than YunKai's fixed bug
(http://lists.corosync.org/pipermail/discuss/2011-October/000150.html),

anyway, before We know what exactly occurred inside corosync,
I think let sheep die in such situation is safe.

>
> for example, (a,b,c) is a corosync ring.
> for some time n(c) breaks out, and becomes a single ring itself.later,
> n(c) rejoins
>
> (a,b,c) -> (a,b), (c) -> (a,b,c)
>         |             |
>     confchg1      confchg2
>
> so the question is , do we have to panic out n(c) at confchg1 in this
> case? since n(c) does no harm to data, after confchg2, I think, IIUC,
> n(c) will see the view as (a,b,c). no?
>
> Thanks,
> Yuan
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog

________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人，请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。