[Sheepdog] [PATCH v2] sheep: fix a network partition issue

Mon Oct 31 13:36:25 CET 2011

On 10/31/2011 07:49 PM, MORITA Kazutaka wrote:

> At Mon, 31 Oct 2011 19:37:51 +0800,
> Liu Yuan wrote:
>>
>> On 10/31/2011 07:10 PM, MORITA Kazutaka wrote:
>>
>>> At Mon, 31 Oct 2011 18:15:06 +0800,
>>> Liu Yuan wrote:
>>>>
>>>> On 10/31/2011 06:00 PM, MORITA Kazutaka wrote:
>>>>
>>>>> At Tue, 25 Oct 2011 15:06:05 +0800,
>>>>> Liu Yuan wrote:
>>>>>>
>>>>>> On 10/25/2011 02:55 PM, zituan at taobao.com wrote:
>>>>>>
>>>>>>> From: Yibin Shen <zituan at taobao.com>
>>>>>>>
>>>>>>> In some situation, sheep may disconnected from corosync instantaneously,
>>>>>>> at the same time, both sheep and corosync will keep running but
>>>>>>> none of them exit, then the disconnected sheep may receive a confchg
>>>>>>> message from corosync which notify this sheep has left.
>>>>>>> that will lead to a network partition, this patch fix it.
>>>>>>>
>>>>>>> Signed-off-by: Yibin Shen <zituan at taobao.com>
>>>>>>> ---
>>>>>>>  sheep/group.c |    3 +++
>>>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>>>
>>>>>>> diff --git a/sheep/group.c b/sheep/group.c
>>>>>>> index e22dabc..ab5a9f0 100644
>>>>>>> --- a/sheep/group.c
>>>>>>> +++ b/sheep/group.c
>>>>>>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
>>>>>>>  	struct work_leave *w = NULL;
>>>>>>>  	int i, size;
>>>>>>>  
>>>>>>> +	if (node_cmp(left, &sys->this_node) == 0)
>>>>>>> +		panic("BUG: this node can't be on the left list\n");
>>>>>>> +
>>>>>>
>>>>>>
>>>>>> Hmm, the panic output looks confusing. how about "Network Patition Bug:
>>>>>> I should have exited.\n"? since the output will be seen by
>>>>>> administrators, not only programmer.
>>>>>
>>>>> Applied after modifying output text, thanks!
>>>>>
>>>>> Kazutaka
>>>>
>>>>
>>>> Kazutaka,
>>>> 	Maybe we should not panic out when it becomes a single node cluster.
>>>> The node will change into HALT state which doesn't any harm to its data.
>>>
>>> It is much better.  Currently, Sheepdog kills a minority cluster in
>>> __sd_leave() when network partition occurs because it is the simplest
>>> solution to keep data consistency.
>>>
>>> But this looks a different issue from this patch.  Does corosync
>>> include local node in the left list when network partition occurs?  If
>>> so, we should handle it in the corosync cluster driver because it
>>> looks a corosync specific issue to me.
>>>
>>
>> I am not sure, but if corosync include local node in the left list, it
>> should be a bug in corosync.
>>
>> let's assume (a,b,c) three nodes.
>> I am suspecting that that left message is for n(b,c), but after n(a)
>> rejoins, for whatever reason, the message is being broadcasting, and
>> n(a) just gets it wrongly.
> 
> IIUC, n(a) should receive the left massage of n(b,c).

Yes, but n(a) should not receive the leave_message(a) which is intended
for n(b,c).

so the correct sequence should be:

network partition happens
lm(a) -> n(b,c), lm(b,c) -> n(a).
then n(a) rejoins.
jm(a) -> n(a,b,c)

anyway, I am not sure, cause I didn't look at the log.

Thanks,
Yuan