[Sheepdog] [PATCH] sheep: fix a network partition issue

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Mon Oct 24 13:46:58 CEST 2011


At Mon, 24 Oct 2011 17:59:07 +0800,
Yibin Shen wrote:
> 
> On Mon, Oct 24, 2011 at 5:43 PM, Liu Yuan <namei.unix at gmail.com> wrote:
> > On 10/24/2011 05:29 PM, zituan at taobao.com wrote:
> >
> >> From: Yibin Shen <zituan at taobao.com>
> >>
> >> In some situation, the left node can also receive the confchg event,
> >> which may be caused by corosync, that will lead to a network partition,
> >> this patch fix it.
> >>
> >
> >
> > So what kind of situation, and why left node can still process requeset
> > as it is already being 'left'?
> actually, none of they have left, both corosync & sheep are kept running,
> and I think sheep is disconnected from corosync temporary.
> in node '10.232.134.8', we found such log:
> Oct 19 17:35:49 sd_leave_handler(1852) leave ip: 10.232.134.8, pid: 25813
> Oct 19 17:35:49 sd_leave_handler(1854) [0] ip: 10.232.134.1, pid: 2882
> Oct 19 17:35:49 sd_leave_handler(1854) [1] ip: 10.232.134.2, pid: 27161
> Oct 19 17:35:49 sd_leave_handler(1854) [2] ip: 10.232.134.3, pid: 25154
> Oct 19 17:35:49 sd_leave_handler(1854) [3] ip: 10.232.134.4, pid: 24918
> Oct 19 17:35:49 sd_leave_handler(1854) [4] ip: 10.232.134.5, pid: 7697
> Oct 19 17:35:49 sd_leave_handler(1854) [5] ip: 10.232.134.6, pid: 7533
> Oct 19 17:35:49 sd_leave_handler(1854) [6] ip: 10.232.134.7, pid: 25044
> Oct 19 17:35:49 sd_leave_handler(1854) [7] ip: 10.232.134.9, pid: 25224
> Oct 19 17:35:49 sd_leave_handler(1854) [8] ip: 10.232.134.10, pid: 24326
> Oct 19 17:35:49 sd_leave_handler(1854) [9] ip: 10.232.134.11, pid: 24071
> Oct 19 17:35:49 sd_leave_handler(1854) [a] ip: 10.232.134.12, pid: 24390
> Oct 19 17:35:49 sd_leave_handler(1854) [b] ip: 10.232.134.13, pid: 25168
> Oct 19 17:35:49 sd_leave_handler(1854) [c] ip: 10.232.134.14, pid: 20454
> Oct 19 17:35:49 sd_leave_handler(1854) [d] ip: 10.232.134.15, pid: 19971
> Oct 19 17:35:49 sd_leave_handler(1854) [e] ip: 10.232.134.17, pid: 20015
> Oct 19 17:35:49 sd_leave_handler(1854) [f] ip: 10.232.134.19, pid: 20610
> Oct 19 17:35:49 sd_leave_handler(1867) allow new confchg, 0xff8ba0
> Oct 19 17:35:49 start_cpg_event_work(1655) 0 1
> Oct 19 17:35:49 cpg_event_fn(1463) 0xff8ba0, 1 2
> Oct 19 17:35:49 check_majority(1307) majority nodes are alive
> Oct 19 17:35:49 cpg_event_done(1501) 0xff8ba0
> 
> CC Yunkai Zhang
> He can give us some detailed information inside corosync
> >
> >> Signed-off-by: Yibin Shen <zituan at taobao.com>
> >> ---
> >>  sheep/group.c |    4 ++++
> >>  1 files changed, 4 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/sheep/group.c b/sheep/group.c
> >> index e22dabc..155247d 100644
> >> --- a/sheep/group.c
> >> +++ b/sheep/group.c
> >> @@ -1467,6 +1467,10 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
> >>       struct work_leave *w = NULL;
> >>       int i, size;
> >>
> >> +     if (!memcmp(left, &sys->this_node, sizeof(struct sheepdog_node_list_entry))) {
> >> +             eprintf("BUG: this node can't be on the left list\n");
> >> +             abort();
> >> +     }

Use panic() for unrecoverable errors.

Thanks,

Kazutaka

> >
> >
> > Use node_cmp() to check node.
> OK, I will send a V2 patch later
> >
> > Thanks,
> > Yuan
> > --
> > sheepdog mailing list
> > sheepdog at lists.wpkg.org
> > http://lists.wpkg.org/mailman/listinfo/sheepdog
> >
> 
> ________________________________
> 
> This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.
> 
> 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list