[sheepdog] Segfault for 0.4.0 branch

Tue Jul 10 03:59:51 CEST 2012

On Tue, Jul 10, 2012 at 9:04 AM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 07/09/2012 08:43 PM, Yunkai Zhang wrote:
>> This patch:
>>
>> <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg
>>
>> will cause new problem, 99% crash was caused by this patch in my
>> testing when start/stop sheep concurrently.
>>
>
> No, why are you so sure of this when you don't catch exact lines of
> code? I have to remind you that op=0x0 case exists long before the
> 982d5ab patch. Actually 982d5ab patch does reduce the chances of
> segfault. I got the 982d5ab patch when I have been debugging op=0x0 bug,
> where I found there might be multiple bugs inside.
>
>> Splitting cluster event into notify/cfgchg event lists and giving
>> priority to process cfgchg event will break the event order which is
>> the most important thing in distributed system.
>>
>> For example:
>>
>> One sheep send a notify message to the cluster, but at the same time,
>> there are sheeps joining/leaving into the cluster, then the notify
>> message was pushed back by these join/leave events, so the notify
>> handler could not be executed opportunely, it will cause some
>> variables not to be initialized correctly.
>>
>
> Please specify what is the problem. 'not executed opportunely' speaks
> nothing.
>
> For now, total order of confchg is kept and priority of handle confchg
> doesn't break things unless We find any real proofs.
>
> Again, please come to any conclusion before you get the real evidence.
>
> Thanks,
> Yuan
>
>> In my testing, if we start/stop sheep concurrently, this segment fault
>> will nearly be caused:
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
>> 981             return !!op->process_main;
>> (gdb) where
>> #0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
>> #1  0x00000000004057e7 in prepare_cluster_msg (req=0xb03ca0,
>> sizep=0x7fff129c3640) at group.c:275
>> #2  0x000000000040585c in cluster_op_done (work=0xb03d60) at group.c:290
>> #3  0x000000000040ebaf in bs_thread_request_done (fd=12, events=1,
>>
>>
>> In fact, I have completed zookeeper patch which will also split event
>> list into cfgchg/notify list, but it has the similar problems.

Let me dig more.

-- 
Yunkai Zhang
Work at Taobao