[Sheepdog] [PATCH] sheep: fix a network partition issue

Mon Oct 24 14:16:19 CEST 2011

On 10/24/2011 07:46 PM, MORITA Kazutaka wrote:

> At Mon, 24 Oct 2011 17:59:07 +0800,
> Yibin Shen wrote:
>>
>> On Mon, Oct 24, 2011 at 5:43 PM, Liu Yuan <namei.unix at gmail.com> wrote:
>>> On 10/24/2011 05:29 PM, zituan at taobao.com wrote:
>>>
>>>> From: Yibin Shen <zituan at taobao.com>
>>>>
>>>> In some situation, the left node can also receive the confchg event,
>>>> which may be caused by corosync, that will lead to a network partition,
>>>> this patch fix it.
>>>>
>>>
>>>
>>> So what kind of situation, and why left node can still process requeset
>>> as it is already being 'left'?
>> actually, none of they have left, both corosync & sheep are kept running,
>> and I think sheep is disconnected from corosync temporary.
>> in node '10.232.134.8', we found such log:
>> Oct 19 17:35:49 sd_leave_handler(1852) leave ip: 10.232.134.8, pid: 25813
>> Oct 19 17:35:49 sd_leave_handler(1854) [0] ip: 10.232.134.1, pid: 2882
>> Oct 19 17:35:49 sd_leave_handler(1854) [1] ip: 10.232.134.2, pid: 27161
>> Oct 19 17:35:49 sd_leave_handler(1854) [2] ip: 10.232.134.3, pid: 25154
>> Oct 19 17:35:49 sd_leave_handler(1854) [3] ip: 10.232.134.4, pid: 24918
>> Oct 19 17:35:49 sd_leave_handler(1854) [4] ip: 10.232.134.5, pid: 7697
>> Oct 19 17:35:49 sd_leave_handler(1854) [5] ip: 10.232.134.6, pid: 7533
>> Oct 19 17:35:49 sd_leave_handler(1854) [6] ip: 10.232.134.7, pid: 25044
>> Oct 19 17:35:49 sd_leave_handler(1854) [7] ip: 10.232.134.9, pid: 25224
>> Oct 19 17:35:49 sd_leave_handler(1854) [8] ip: 10.232.134.10, pid: 24326
>> Oct 19 17:35:49 sd_leave_handler(1854) [9] ip: 10.232.134.11, pid: 24071
>> Oct 19 17:35:49 sd_leave_handler(1854) [a] ip: 10.232.134.12, pid: 24390
>> Oct 19 17:35:49 sd_leave_handler(1854) [b] ip: 10.232.134.13, pid: 25168
>> Oct 19 17:35:49 sd_leave_handler(1854) [c] ip: 10.232.134.14, pid: 20454
>> Oct 19 17:35:49 sd_leave_handler(1854) [d] ip: 10.232.134.15, pid: 19971
>> Oct 19 17:35:49 sd_leave_handler(1854) [e] ip: 10.232.134.17, pid: 20015
>> Oct 19 17:35:49 sd_leave_handler(1854) [f] ip: 10.232.134.19, pid: 20610
>> Oct 19 17:35:49 sd_leave_handler(1867) allow new confchg, 0xff8ba0
>> Oct 19 17:35:49 start_cpg_event_work(1655) 0 1
>> Oct 19 17:35:49 cpg_event_fn(1463) 0xff8ba0, 1 2
>> Oct 19 17:35:49 check_majority(1307) majority nodes are alive
>> Oct 19 17:35:49 cpg_event_done(1501) 0xff8ba0
>>
>> CC Yunkai Zhang
>> He can give us some detailed information inside corosync

So can the accidentally left but healthy node (for whatever reason)
rejoin the cluster instead of being panic-out?

Thanks,
Yuan