[Sheepdog] [PATCH v2] sheep: fix a network partition issue

Mon Oct 31 11:36:23 CET 2011

On 10/31/2011 06:29 PM, Yibin Shen wrote:

> On Mon, Oct 31, 2011 at 6:15 PM, Liu Yuan <namei.unix at gmail.com> wrote:
>>
>> On 10/31/2011 06:00 PM, MORITA Kazutaka wrote:
>>
>>> At Tue, 25 Oct 2011 15:06:05 +0800,
>>> Liu Yuan wrote:
>>>>
>>>> On 10/25/2011 02:55 PM, zituan at taobao.com wrote:
>>>>
>>>>> From: Yibin Shen <zituan at taobao.com>
>>>>>
>>>>> In some situation, sheep may disconnected from corosync instantaneously,
>>>>> at the same time, both sheep and corosync will keep running but
>>>>> none of them exit, then the disconnected sheep may receive a confchg
>>>>> message from corosync which notify this sheep has left.
>>>>> that will lead to a network partition, this patch fix it.
>>>>>
>>>>> Signed-off-by: Yibin Shen <zituan at taobao.com>
>>>>> ---
>>>>>  sheep/group.c |    3 +++
>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/sheep/group.c b/sheep/group.c
>>>>> index e22dabc..ab5a9f0 100644
>>>>> --- a/sheep/group.c
>>>>> +++ b/sheep/group.c
>>>>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
>>>>>     struct work_leave *w = NULL;
>>>>>     int i, size;
>>>>>
>>>>> +   if (node_cmp(left, &sys->this_node) == 0)
>>>>> +           panic("BUG: this node can't be on the left list\n");
>>>>> +
>>>>
>>>>
>>>> Hmm, the panic output looks confusing. how about "Network Patition Bug:
>>>> I should have exited.\n"? since the output will be seen by
>>>> administrators, not only programmer.
>>>
>>> Applied after modifying output text, thanks!
>>>
>>> Kazutaka
>>
>>
>> Kazutaka,
>>        Maybe we should not panic out when it becomes a single node cluster.
>> The node will change into HALT state which doesn't any harm to its data.
>>
>>        This is introduced by a corosync bug which is fixed by Yunkai in latest
>> corosync. During patch fixing, Yunkai found that corosync will likely
>> re-join the old configuration soon after a short break-out with other
>> corosync nodes.
> they are two different bugs,
> and this one is somewhat more sophisticated than YunKai's fixed bug
> (http://lists.corosync.org/pipermail/discuss/2011-October/000150.html),
> 
> anyway, before We know what exactly occurred inside corosync,
> I think let sheep die in such situation is safe.

Yes, panic-out right now is sufficient. If corosync sorts out later, we
should consider not panic-out.

Thanks,
Yuan