[Sheepdog] [PATCH v2] sheep: fix a network partition issue

Liu Yuan namei.unix at gmail.com
Mon Oct 31 11:15:06 CET 2011


On 10/31/2011 06:00 PM, MORITA Kazutaka wrote:

> At Tue, 25 Oct 2011 15:06:05 +0800,
> Liu Yuan wrote:
>>
>> On 10/25/2011 02:55 PM, zituan at taobao.com wrote:
>>
>>> From: Yibin Shen <zituan at taobao.com>
>>>
>>> In some situation, sheep may disconnected from corosync instantaneously,
>>> at the same time, both sheep and corosync will keep running but
>>> none of them exit, then the disconnected sheep may receive a confchg
>>> message from corosync which notify this sheep has left.
>>> that will lead to a network partition, this patch fix it.
>>>
>>> Signed-off-by: Yibin Shen <zituan at taobao.com>
>>> ---
>>>  sheep/group.c |    3 +++
>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/sheep/group.c b/sheep/group.c
>>> index e22dabc..ab5a9f0 100644
>>> --- a/sheep/group.c
>>> +++ b/sheep/group.c
>>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left,
>>>  	struct work_leave *w = NULL;
>>>  	int i, size;
>>>  
>>> +	if (node_cmp(left, &sys->this_node) == 0)
>>> +		panic("BUG: this node can't be on the left list\n");
>>> +
>>
>>
>> Hmm, the panic output looks confusing. how about "Network Patition Bug:
>> I should have exited.\n"? since the output will be seen by
>> administrators, not only programmer.
> 
> Applied after modifying output text, thanks!
> 
> Kazutaka


Kazutaka,
	Maybe we should not panic out when it becomes a single node cluster.
The node will change into HALT state which doesn't any harm to its data.

	This is introduced by a corosync bug which is fixed by Yunkai in latest
corosync. During patch fixing, Yunkai found that corosync will likely
re-join the old configuration soon after a short break-out with other
corosync nodes.

for example, (a,b,c) is a corosync ring.
for some time n(c) breaks out, and becomes a single ring itself.later,
n(c) rejoins

(a,b,c) -> (a,b), (c) -> (a,b,c)
         |             |
     confchg1      confchg2

so the question is , do we have to panic out n(c) at confchg1 in this
case? since n(c) does no harm to data, after confchg2, I think, IIUC,
n(c) will see the view as (a,b,c). no?

Thanks,
Yuan



More information about the sheepdog mailing list