[sheepdog] Segfault for 0.4.0 branch

Mon Jul 9 14:43:27 CEST 2012

On Mon, Jul 9, 2012 at 5:49 PM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 07/09/2012 05:45 PM, Liu Yuan wrote:
>> On 07/09/2012 09:58 AM, Liu Yuan wrote:
>>> Got an weird segfault,
>>>
>>> (gdb) where
>>> #0  0x0000000000411936 in do_process_work (work=0xd13c70) at ops.c:992
>>> #1  0x000000000040ed05 in worker_routine (arg=0xd12a20) at work.c:171
>>> #2  0x00007f43f992c971 in start_thread (arg=<value optimized out>) at
>>> pthread_create.c:304
>>> #3  0x00007f43f8eeef3d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>> #4  0x0000000000000000 in ?? ()
>>>
>>> sheep.log:
>>> ...
>>> Jul 09 09:47:23 [main] client_handler(764) connection seems to be dead
>>> Jul 09 09:47:23 [main] clear_client(703) refcnt:0, fd:14, ::1:43328
>>> Jul 09 09:47:23 [main] destroy_client(672) connection from: ::1:43328
>>> Jul 09 09:47:23 [main] cdrv_cpg_deliver(448) 5
>>> Jul 09 09:47:23 [main] sd_notify_handler(851) size: 96, from: IPv4
>>> ip:127.0.0.1 port:7000
>>> Jul 09 09:47:23 [main] client_tx_handler(663) connection from: 13, ::1:43330
>>> Jul 09 09:47:23 [main] client_handler(764) connection seems to be dead
>>> Jul 09 09:47:23 [main] clear_client(703) refcnt:0, fd:13, ::1:43330
>>> Jul 09 09:47:23 [main] destroy_client(672) connection from: ::1:43330
>>> Jul 09 09:47:23 [main] listen_handler(819) accepted a new connection: 13
>>> Jul 09 09:47:23 [main] listen_handler(819) accepted a new connection: 14
>>> Jul 09 09:47:23 [block] do_process_work(990) 80, 0 , 32579 <--- XXX
>>> Jul 09 09:47:23 [main] client_rx_handler(577) connection from: 14, ::1:43337
>>> Jul 09 09:47:23 [main] queue_request(323) 2
>>> Jul 09 09:47:23 [main] crash_handler(408) sheep pid 5326 exited
>>> unexpectedly.
>>>
>>> Thanks,
>>> Yuan
>>>
>>
>> These segmentation fault is suspected to be caused by
>>
>> * <cc458b9> 2012-07-06 sheep: free all requests when connection is dead
>> * <7ce7048> 2012-07-06 sheep: simplify client_decref() and move it into
>> free_request() and add a helper function
>>
>> set.
>>
>
> And other patch set too, I tries
>
> <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg
>
> it works well.

This patch:

<982d5ab> corosync: fix cluster hang by cluster requests blocking confchg

will cause new problem, 99% crash was caused by this patch in my
testing when start/stop sheep concurrently.

Splitting cluster event into notify/cfgchg event lists and giving
priority to process cfgchg event will break the event order which is
the most important thing in distributed system.

For example:

One sheep send a notify message to the cluster, but at the same time,
there are sheeps joining/leaving into the cluster, then the notify
message was pushed back by these join/leave events, so the notify
handler could not be executed opportunely, it will cause some
variables not to be initialized correctly.

In my testing, if we start/stop sheep concurrently, this segment fault
will nearly be caused:
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
981             return !!op->process_main;
(gdb) where
#0  0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981
#1  0x00000000004057e7 in prepare_cluster_msg (req=0xb03ca0,
sizep=0x7fff129c3640) at group.c:275
#2  0x000000000040585c in cluster_op_done (work=0xb03d60) at group.c:290
#3  0x000000000040ebaf in bs_thread_request_done (fd=12, events=1,

In fact, I have completed zookeeper patch which will also split event
list into cfgchg/notify list, but it has the similar problems.

>
> Thanks,
> Yuan
>
>
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog

-- 
Yunkai Zhang
Work at Taobao