[sheepdog] Segfault for 0.4.0 branch

Tue Jul 10 03:04:22 CEST 2012

On 07/09/2012 08:43 PM, Yunkai Zhang wrote:
> This patch:
> 
> <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg
> 
> will cause new problem, 99% crash was caused by this patch in my
> testing when start/stop sheep concurrently.
> 

No, why are you so sure of this when you don't catch exact lines of
code? I have to remind you that op=0x0 case exists long before the
982d5ab patch. Actually 982d5ab patch does reduce the chances of
segfault. I got the 982d5ab patch when I have been debugging op=0x0 bug,
where I found there might be multiple bugs inside.

> Splitting cluster event into notify/cfgchg event lists and giving
> priority to process cfgchg event will break the event order which is
> the most important thing in distributed system.
> 
> For example:
> 
> One sheep send a notify message to the cluster, but at the same time,
> there are sheeps joining/leaving into the cluster, then the notify
> message was pushed back by these join/leave events, so the notify
> handler could not be executed opportunely, it will cause some
> variables not to be initialized correctly.
> 

Please specify what is the problem. 'not executed opportunely' speaks
nothing.

For now, total order of confchg is kept and priority of handle confchg
doesn't break things unless We find any real proofs.

Again, please come to any conclusion before you get the real evidence.

Thanks,
Yuan

> In my testing, if we start/stop sheep concurrently, this segment fault
> will nearly be caused:
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
> 981             return !!op->process_main;
> (gdb) where
> #0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
> #1  0x00000000004057e7 in prepare_cluster_msg (req=0xb03ca0,
> sizep=0x7fff129c3640) at group.c:275
> #2  0x000000000040585c in cluster_op_done (work=0xb03d60) at group.c:290
> #3  0x000000000040ebaf in bs_thread_request_done (fd=12, events=1,
> 
> 
> In fact, I have completed zookeeper patch which will also split event
> list into cfgchg/notify list, but it has the similar problems.