[Sheepdog] [PATCH] sheep: fix a network partition issue

Yibin Shen zituan at taobao.com
Mon Oct 24 11:59:07 CEST 2011


On Mon, Oct 24, 2011 at 5:43 PM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 10/24/2011 05:29 PM, zituan at taobao.com wrote:
>
>> From: Yibin Shen <zituan at taobao.com>
>>
>> In some situation, the left node can also receive the confchg event,
>> which may be caused by corosync, that will lead to a network partition,
>> this patch fix it.
>>
>
>
> So what kind of situation, and why left node can still process requeset
> as it is already being 'left'?
actually, none of they have left, both corosync & sheep are kept running,
and I think sheep is disconnected from corosync temporary.
in node '10.232.134.8', we found such log:
Oct 19 17:35:49 sd_leave_handler(1852) leave ip: 10.232.134.8, pid: 25813
Oct 19 17:35:49 sd_leave_handler(1854) [0] ip: 10.232.134.1, pid: 2882
Oct 19 17:35:49 sd_leave_handler(1854) [1] ip: 10.232.134.2, pid: 27161
Oct 19 17:35:49 sd_leave_handler(1854) [2] ip: 10.232.134.3, pid: 25154
Oct 19 17:35:49 sd_leave_handler(1854) [3] ip: 10.232.134.4, pid: 24918
Oct 19 17:35:49 sd_leave_handler(1854) [4] ip: 10.232.134.5, pid: 7697
Oct 19 17:35:49 sd_leave_handler(1854) [5] ip: 10.232.134.6, pid: 7533
Oct 19 17:35:49 sd_leave_handler(1854) [6] ip: 10.232.134.7, pid: 25044
Oct 19 17:35:49 sd_leave_handler(1854) [7] ip: 10.232.134.9, pid: 25224
Oct 19 17:35:49 sd_leave_handler(1854) [8] ip: 10.232.134.10, pid: 24326
Oct 19 17:35:49 sd_leave_handler(1854) [9] ip: 10.232.134.11, pid: 24071
Oct 19 17:35:49 sd_leave_handler(1854) [a] ip: 10.232.134.12, pid: 24390
Oct 19 17:35:49 sd_leave_handler(1854) [b] ip: 10.232.134.13, pid: 25168
Oct 19 17:35:49 sd_leave_handler(1854) [c] ip: 10.232.134.14, pid: 20454
Oct 19 17:35:49 sd_leave_handler(1854) [d] ip: 10.232.134.15, pid: 19971
Oct 19 17:35:49 sd_leave_handler(1854) [e] ip: 10.232.134.17, pid: 20015
Oct 19 17:35:49 sd_leave_handler(1854) [f] ip: 10.232.134.19, pid: 20610
Oct 19 17:35:49 sd_leave_handler(1867) allow new confchg, 0xff8ba0
Oct 19 17:35:49 start_cpg_event_work(1655) 0 1
Oct 19 17:35:49 cpg_event_fn(1463) 0xff8ba0, 1 2
Oct 19 17:35:49 check_majority(1307) majority nodes are alive
Oct 19 17:35:49 cpg_event_done(1501) 0xff8ba0

CC Yunkai Zhang
He can give us some detailed information inside corosync
>
>> Signed-off-by: Yibin Shen <zituan at taobao.com>
>> ---
>>  sheep/group.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/sheep/group.c b/sheep/group.c
>> index e22dabc..155247d 100644
>> --- a/sheep/group.c
>> +++ b/sheep/group.c
>> @@ -1467,6 +1467,10 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
>>       struct work_leave *w = NULL;
>>       int i, size;
>>
>> +     if (!memcmp(left, &sys->this_node, sizeof(struct sheepdog_node_list_entry))) {
>> +             eprintf("BUG: this node can't be on the left list\n");
>> +             abort();
>> +     }
>
>
> Use node_cmp() to check node.
OK, I will send a V2 patch later
>
> Thanks,
> Yuan
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
>

________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。



More information about the sheepdog mailing list