[sheepdog] [PATCH v3] corosync: fix cluster hang by cluster requests blocking confchg

Yunkai Zhang yunkai.me at gmail.com
Thu Jul 5 17:27:06 CEST 2012


On Thu, Jul 5, 2012 at 11:13 PM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 07/05/2012 11:08 PM, Yunkai Zhang wrote:
>> Yes, leave event is delivered by corosync to sheep one by one.
>>
>> But the order processed by sheep depends on when sheep read it from
>> corosync_event_list when you add leave event to the head of the list.
>> The time processing confchg event may different between each sheeps,
>> so the order maybe broken.
>>
>> If we need to give priority to process leave event and keep same
>> processing order in each sheep, we can add each leave event in front
>> of all other events but keep leave event in its delivered order in
>> corosync_event_list.
>
> This is what my v2 does. Seems that even with this method, we can't keep
> the order between join and leave events in some corner cases.
>
> I am considering use separate lists for notification and confchg events.

I think zookeeper driver will have the same problem, but there is a
little different between corosync and zookeeper, I'll try to fix this
issue in zookeeper driver later.

>
> Thanks,
> Yuan
>



-- 
Yunkai Zhang
Work at Taobao



More information about the sheepdog mailing list