[sheepdog] [PATCH v2] sheep/recovery: multi-threading recovery process

Liu Yuan namei.unix at gmail.com
Thu Jan 30 03:09:54 CET 2014


On Thu, Jan 30, 2014 at 10:11:43AM +0900, Hitoshi Mitake wrote:
> At Thu, 30 Jan 2014 02:53:27 +0800,
> Liu Yuan wrote:
> > 
> > On Wed, Jan 29, 2014 at 06:04:15PM +0900, Hitoshi Mitake wrote:
> > > At Wed, 29 Jan 2014 17:38:34 +0900,
> > > Hitoshi Mitake wrote:
> > > > 
> > > > At Wed, 29 Jan 2014 16:29:03 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Wed, Jan 29, 2014 at 05:19:09PM +0900, Hitoshi Mitake wrote:
> > > > > > At Wed, 29 Jan 2014 16:14:56 +0800,
> > > > > > Liu Yuan wrote:
> > > > > > > 
> > > > > > > On Wed, Jan 29, 2014 at 05:01:52PM +0900, Hitoshi Mitake wrote:
> > > > > > > > At Wed, 29 Jan 2014 15:53:57 +0800,
> > > > > > > > Liu Yuan wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, Jan 29, 2014 at 03:32:34PM +0800, Liu Yuan wrote:
> > > > > > > > > > On Wed, Jan 29, 2014 at 04:28:35PM +0900, Hitoshi Mitake wrote:
> > > > > > > > > > > At Tue, 28 Jan 2014 18:01:42 +0800,
> > > > > > > > > > > Liu Yuan wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > Rationale for multi-threaded recovery:
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. If one node is added, we find that all the VMs on other nodes will get
> > > > > > > > > > > >    noticeably affected until 50% data is transferred to the new node.
> > > > > > > > > > > > 
> > > > > > > > > > > > 2. For node failure, we might not have problems of running VM but the
> > > > > > > > > > > >    recovery process boost will benefit IO operation of VM with less
> > > > > > > > > > > >    chances to be blocked for write and also improve reliability.
> > > > > > > > > > > > 
> > > > > > > > > > > > 3. For disk failure in node, this is similar to adding a node. All
> > > > > > > > > > > >    the data on the broken disk will be recovered on other disks in
> > > > > > > > > > > >    this node. Speedy recoery not only improve data reliability but
> > > > > > > > > > > >    also cause less writing blocking on the lost data.
> > > > > > > > > > > > 
> > > > > > > > > > > > Our oid scheduling algorithm is intact and simply add multi-threading onto top
> > > > > > > > > > > > of current recovery algorithm with minimal changes.
> > > > > > > > > > > > 
> > > > > > > > > > > > - we still have ->oids array to denote oids to be recovered
> > > > > > > > > > > > - we start up 2 * nr_disks threads for recovery
> > > > > > > > > > > > - the tricky part is that we need to wait all the running threads to
> > > > > > > > > > > >   completion before start next recovery events for multiple nodes/disks events
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch passes "./check -g md -md" on my local box
> > > > > > > > > > > 
> > > > > > > > > > > On my box, at least 32 and 33 failed. I'm seeking the root cause now
> > > > > > > > > > > but this patch seems to be a little bit dangerous.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Yes, this shouldn't go to stable-0.8, but master is okay and at least we need to
> > > > > > > > > > pass all the tests before it can goto master.
> > > > > > > > > 
> > > > > > > > > 32 and 33 isn't md-ready tests. Please use 
> > > > > > > > > 
> > > > > > > > > ./check -g md -md
> > > > > > > > > 
> > > > > > > > > to test this patch.
> > > > > > > > 
> > > > > > > > 32 and 33 are tests for recovery. So I think we shouldn't exclude them
> > > > > > > > for testing your patch, no?
> > > > > > > 
> > > > > > > I run 'sudo tests/functional/check 32 33' several times and no failure.
> > > > > > 
> > > > > > On my environment, 32 and 33 fail every time I run (sudo DRIVER=local
> > > > > > ./check 32 33).
> > > > > > 
> > > > > > Below is 32.out.bad and 33.out.bad:
> > > > > > 
> > > > > > 32.out.bad:
> > > > > > QA output created by 032
> > > > > > using backend plain store
> > > > > > 9c7766570b3be3aff2724f587c2f4107  -
> > > > > > STORE/1/obj/807c2b2500000000
> > > > > > STORE/2/obj/807c2b2500000000
> > > > > > STORE/4/obj/807c2b2500000000
> > > > > > STORE/5/obj/807c2b2500000000
> > > > > > STORE/6/obj/807c2b2500000000
> > > > > > STORE/7/obj/807c2b2500000000
> > > > > > STORE/0/obj/007c2b2500000000
> > > > > > STORE/1/obj/007c2b2500000000
> > > > > > STORE/4/obj/007c2b2500000000
> > > > > > STORE/5/obj/007c2b2500000000
> > > > > > STORE/6/obj/007c2b2500000000
> > > > > > STORE/7/obj/007c2b2500000000
> > > > > > STORE/0/obj/007c2b2500000001
> > > > > > STORE/1/obj/007c2b2500000001
> > > > > > STORE/2/obj/007c2b2500000001
> > > > > > STORE/5/obj/007c2b2500000001
> > > > > > STORE/6/obj/007c2b2500000001
> > > > > > STORE/7/obj/007c2b2500000001
> > > > > > STORE/0/obj/007c2b2500000002
> > > > > > STORE/2/obj/007c2b2500000002
> > > > > > STORE/4/obj/007c2b2500000002
> > > > > > STORE/5/obj/007c2b2500000002
> > > > > > STORE/6/obj/007c2b2500000002
> > > > > > STORE/7/obj/007c2b2500000002
> > > > > > STORE/0/obj/007c2b2500000003
> > > > > > STORE/1/obj/007c2b2500000003
> > > > > > STORE/2/obj/007c2b2500000003
> > > > > > STORE/5/obj/007c2b2500000003
> > > > > > STORE/6/obj/007c2b2500000003
> > > > > > STORE/7/obj/007c2b2500000003
> > > > > > STORE/0/obj/007c2b2500000004
> > > > > > STORE/1/obj/007c2b2500000004
> > > > > > STORE/2/obj/007c2b2500000004
> > > > > > STORE/4/obj/007c2b2500000004
> > > > > > STORE/5/obj/007c2b2500000004
> > > > > > STORE/6/obj/007c2b2500000004
> > > > > > STORE/1/obj/007c2b2500000005
> > > > > > STORE/2/obj/007c2b2500000005
> > > > > > STORE/4/obj/007c2b2500000005
> > > > > > STORE/5/obj/007c2b2500000005
> > > > > > STORE/6/obj/007c2b2500000005
> > > > > > STORE/7/obj/007c2b2500000005
> > > > > > STORE/0/obj/007c2b2500000006
> > > > > > STORE/1/obj/007c2b2500000006
> > > > > > STORE/4/obj/007c2b2500000006
> > > > > > STORE/5/obj/007c2b2500000006
> > > > > > STORE/6/obj/007c2b2500000006
> > > > > > STORE/7/obj/007c2b2500000006
> > > > > > STORE/0/obj/007c2b2500000007
> > > > > > STORE/1/obj/007c2b2500000007
> > > > > > STORE/2/obj/007c2b2500000007
> > > > > > STORE/4/obj/007c2b2500000007
> > > > > > STORE/5/obj/007c2b2500000007
> > > > > > STORE/6/obj/007c2b2500000007
> > > > > > STORE/0/obj/007c2b2500000008
> > > > > > STORE/1/obj/007c2b2500000008
> > > > > > STORE/2/obj/007c2b2500000008
> > > > > > STORE/4/obj/007c2b2500000008
> > > > > > STORE/5/obj/007c2b2500000008
> > > > > > STORE/6/obj/007c2b2500000008
> > > > > > STORE/1/obj/007c2b2500000009
> > > > > > STORE/2/obj/007c2b2500000009
> > > > > > STORE/4/obj/007c2b2500000009
> > > > > > STORE/5/obj/007c2b2500000009
> > > > > > STORE/6/obj/007c2b2500000009
> > > > > > STORE/7/obj/007c2b2500000009
> > > > > > STORE/0/obj/007c2b250000000a
> > > > > > STORE/1/obj/007c2b250000000a
> > > > > > STORE/2/obj/007c2b250000000a
> > > > > > STORE/4/obj/007c2b250000000a
> > > > > > STORE/5/obj/007c2b250000000a
> > > > > > STORE/6/obj/007c2b250000000a
> > > > > > STORE/0/obj/007c2b250000000b
> > > > > > STORE/1/obj/007c2b250000000b
> > > > > > STORE/2/obj/007c2b250000000b
> > > > > > STORE/4/obj/007c2b250000000b
> > > > > > STORE/5/obj/007c2b250000000b
> > > > > > STORE/6/obj/007c2b250000000b
> > > > > > STORE/0/obj/007c2b250000000c
> > > > > > STORE/1/obj/007c2b250000000c
> > > > > > STORE/4/obj/007c2b250000000c
> > > > > > STORE/5/obj/007c2b250000000c
> > > > > > STORE/6/obj/007c2b250000000c
> > > > > > STORE/7/obj/007c2b250000000c
> > > > > > STORE/1/obj/007c2b250000000d
> > > > > > STORE/2/obj/007c2b250000000d
> > > > > > STORE/3/obj/007c2b250000000d
> > > > > > STORE/4/obj/007c2b250000000d
> > > > > > STORE/5/obj/007c2b250000000d
> > > > > > STORE/6/obj/007c2b250000000d
> > > > > > STORE/7/obj/007c2b250000000d
> > > > > > STORE/1/obj/007c2b250000000e
> > > > > > STORE/2/obj/007c2b250000000e
> > > > > > STORE/4/obj/007c2b250000000e
> > > > > > STORE/5/obj/007c2b250000000e
> > > > > > STORE/6/obj/007c2b250000000e
> > > > > > STORE/7/obj/007c2b250000000e
> > > > > > STORE/1/obj/007c2b250000000f
> > > > > > STORE/2/obj/007c2b250000000f
> > > > > > STORE/4/obj/007c2b250000000f
> > > > > > STORE/5/obj/007c2b250000000f
> > > > > > STORE/6/obj/007c2b250000000f
> > > > > > STORE/7/obj/007c2b250000000f
> > > > > > STORE/0/obj/007c2b2500000010
> > > > > > STORE/2/obj/007c2b2500000010
> > > > > > STORE/4/obj/007c2b2500000010
> > > > > > STORE/5/obj/007c2b2500000010
> > > > > > STORE/6/obj/007c2b2500000010
> > > > > > STORE/7/obj/007c2b2500000010
> > > > > > STORE/1/obj/007c2b2500000011
> > > > > > STORE/2/obj/007c2b2500000011
> > > > > > STORE/4/obj/007c2b2500000011
> > > > > > STORE/5/obj/007c2b2500000011
> > > > > > STORE/6/obj/007c2b2500000011
> > > > > > STORE/7/obj/007c2b2500000011
> > > > > > STORE/1/obj/007c2b2500000012
> > > > > > STORE/2/obj/007c2b2500000012
> > > > > > STORE/4/obj/007c2b2500000012
> > > > > > STORE/5/obj/007c2b2500000012
> > > > > > STORE/6/obj/007c2b2500000012
> > > > > > STORE/7/obj/007c2b2500000012
> > > > > > STORE/0/obj/007c2b2500000013
> > > > > > STORE/1/obj/007c2b2500000013
> > > > > > STORE/2/obj/007c2b2500000013
> > > > > > STORE/4/obj/007c2b2500000013
> > > > > > STORE/5/obj/007c2b2500000013
> > > > > > STORE/7/obj/007c2b2500000013
> > > > > > STORE/0/obj/007c2b2500000014
> > > > > > STORE/1/obj/007c2b2500000014
> > > > > > STORE/2/obj/007c2b2500000014
> > > > > > STORE/4/obj/007c2b2500000014
> > > > > > STORE/5/obj/007c2b2500000014
> > > > > > STORE/6/obj/007c2b2500000014
> > > > > > STORE/0/obj/007c2b2500000015
> > > > > > STORE/1/obj/007c2b2500000015
> > > > > > STORE/2/obj/007c2b2500000015
> > > > > > STORE/4/obj/007c2b2500000015
> > > > > > STORE/6/obj/007c2b2500000015
> > > > > > STORE/7/obj/007c2b2500000015
> > > > > > STORE/0/obj/007c2b2500000016
> > > > > > STORE/1/obj/007c2b2500000016
> > > > > > STORE/3/obj/007c2b2500000016
> > > > > > STORE/4/obj/007c2b2500000016
> > > > > > STORE/5/obj/007c2b2500000016
> > > > > > STORE/6/obj/007c2b2500000016
> > > > > > STORE/7/obj/007c2b2500000016
> > > > > > STORE/0/obj/007c2b2500000017
> > > > > > STORE/1/obj/007c2b2500000017
> > > > > > STORE/2/obj/007c2b2500000017
> > > > > > STORE/4/obj/007c2b2500000017
> > > > > > STORE/5/obj/007c2b2500000017
> > > > > > STORE/6/obj/007c2b2500000017
> > > > > > STORE/1/obj/007c2b2500000018
> > > > > > STORE/2/obj/007c2b2500000018
> > > > > > STORE/4/obj/007c2b2500000018
> > > > > > STORE/5/obj/007c2b2500000018
> > > > > > STORE/6/obj/007c2b2500000018
> > > > > > STORE/7/obj/007c2b2500000018
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 007c2b2500000000.5
> > > > > > 007c2b2500000001.5
> > > > > > 007c2b2500000002.5
> > > > > > 007c2b2500000003.5
> > > > > > 007c2b2500000004.5
> > > > > > 007c2b2500000005.5
> > > > > > 007c2b2500000006.5
> > > > > > 007c2b2500000007.5
> > > > > > 007c2b2500000008.5
> > > > > > 007c2b2500000009.5
> > > > > > 007c2b250000000a.5
> > > > > > 007c2b250000000b.5
> > > > > > 007c2b250000000c.5
> > > > > > 007c2b250000000d.5
> > > > > > 007c2b250000000e.5
> > > > > > 007c2b250000000f.5
> > > > > > 007c2b2500000010.5
> > > > > > 007c2b2500000011.5
> > > > > > 007c2b2500000012.5
> > > > > > 007c2b2500000013.5
> > > > > > 007c2b2500000014.5
> > > > > > 007c2b2500000015.5
> > > > > > 007c2b2500000016.5
> > > > > > 007c2b2500000017.5
> > > > > > 007c2b2500000018.5
> > > > > > 807c2b2500000000.5
> > > > > > STORE/0/obj/.stale:
> > > > > > STORE/1/obj/.stale:
> > > > > > STORE/2/obj/.stale:
> > > > > > STORE/3/obj/.stale:
> > > > > > STORE/4/obj/.stale:
> > > > > > STORE/5/obj/.stale:
> > > > > > STORE/6/obj/.stale:
> > > > > > STORE/7/obj/.stale:
> > > > > > 9c7766570b3be3aff2724f587c2f4107  -
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 33.out.bad:
> > > > > > QA output created by 033
> > > > > > using backend plain store
> > > > > > 9c7766570b3be3aff2724f587c2f4107  -
> > > > > > should have 2, but have 1 sheep
> > > > > > 
> > > > > 
> > > > > I am also using local driver and haven't met a single failure yet. Could you try
> > > > > make clean; then test?
> > > > > 
> > > > > Seems that 33 has core? if yes, it would be easy to debug.
> > > > 
> > > > Ah yes, I could find the core file. I'll look at it later.
> > > > 
> > > > BTW, below is a tail of the dead sheep's log:
> > > > 
> > > > Jan 29 17:33:15  DEBUG [rw 18759] fetch_object_list(971) 14
> > > > Jan 29 17:33:15  DEBUG [rw 18759] prepare_object_list(1039) go to the next recovery
> > > > Jan 29 17:33:15  DEBUG [main] run_next_rw(680) running threads nr 0
> > > > Jan 29 17:33:15  EMERG [rw 18759] crash_handler(267) sheep exits unexpectedly (Segmentation fault).
> > > > Jan 29 17:33:15  DEBUG [io 18806] do_process_work(1393) a1, 0, 7
> > > > Jan 29 17:33:15  EMERG [rw 18759] sd_backtrace(817) sheep.c:269: crash_handler
> > > > Jan 29 17:33:15  EMERG [rw 18759] sd_backtrace(831) /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7f5dcd64302f]
> > > > Jan 29 17:33:15  EMERG [rw 18759] sd_backtrace(817) work.c:336: worker_routine
> > > > Jan 29 17:33:15  EMERG [rw 18759] sd_backtrace(831) /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b4f) [0x7f5dcd63ab4f]
> > > > Jan 29 17:33:15  EMERG [rw 18759] sd_backtrace(831) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7f5dcccf7a7c]
> > > 
> > > I forgot to mention. I could reproduce the problem and above log even
> > > after cleaning and rebuild.
> > 
> > I think besides core file, there is gdb.txt, can you paste it here?
> 
> 
> OK, this is gdb.txt produced on my environment.
> 
> Thanks,
> Hitoshi
> 
> ==
> == Jan 30 10:07:26 
> == program: /home/mitake/github/sheepdog.git/sheep/sheep
> == command: thread apply all where full
> ==
> 
> Thread 17 (Thread 0x7f8b52fce700 (LWP 6717)):
> #0  0x00007f8b62f076bd in sendmsg () at ../sysdeps/unix/syscall-template.S:82
> No locals.
> #1  0x0000000000422a15 in do_write (max_count=4294967295, epoch=0, need_retry=0, len=4194352, msg=0x7f8b52fcc730, sockfd=22) at net.c:267
>         ret = <optimized out>
>         repeat = -1
> #2  send_req (sockfd=22, hdr=hdr at entry=0x7f8b52fcc7c0, data=<optimized out>, wlen=4194304, need_retry=need_retry at entry=0, epoch=epoch at entry=0, max_count=max_count at entry=4294967295) at net.c:316
>         ret = <optimized out>
>         msg = {msg_name = 0x0, msg_namelen = 0, msg_iov = 0x7f8b52fcc710, msg_iovlen = 2, msg_control = 0x0, msg_controllen = 0, msg_flags = 0}
>         iov = {{iov_base = 0x7f8b52fcc7c0, iov_len = 48}, {iov_base = 0x7f8b4c3c0000, iov_len = 4194304}}
>         __func__ = "send_req"
> #3  0x0000000000408e87 in tx_work (work=0x167cc70) at request.c:792
>         ci = 0x167cc00
>         ret = <optimized out>
>         conn = 0x167cc00
>         rsp = {proto_ver = 0 '\000', opcode = 164 '\244', flags = 0, epoch = 4, id = 0, data_length = 4194304, {result = 0, obj = {__pad = 0, copies = 0 '\000', reserved = "\000\000", offset = 0}, vdi = {__pad = 0, rsvd = 0, vdi_id = 0, attr_id = 0, copies = 0 '\000', reserved = "\000\000"}, node = {__pad = 0, nr_nodes = 0, __reserved = {0, 0}, store_size = 0, store_free = 0}, hash = {__pad1 = 0, __pad2 = 0, digest = '\000' <repeats 19 times>}, __pad = {0, 0, 0, 0, 0, 0, 0, 0}}}
>         req = <optimized out>
>         data = <optimized out>
>         __func__ = "tx_work"
> #4  0x00000000004273eb in worker_routine (arg=0x1665340) at work.c:348
>         wi = 0x1665340
>         work = 0x167cc70
>         tid = 6717
>         __func__ = "worker_routine"
> #5  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b52fce700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236369487616, 2993067626498392730, 140236637118880, 140236369488320, 140236641488960, 7, -2981621700169792870, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #6  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #7  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 16 (Thread 0x7f8b527cd700 (LWP 6718)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x16655f8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665580) at work.c:337
>         wi = 0x1665580
>         work = <optimized out>
>         tid = 6718
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b527cd700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236361094912, 2993067626498392730, 140236637118880, 140236361095616, 140236641488960, 7, -2981620600121294182, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 15 (Thread 0x7f8b51fcc700 (LWP 6719)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665838) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x16657c0) at work.c:337
>         wi = 0x16657c0
>         work = <optimized out>
>         tid = 6719
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b51fcc700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236352702208, 2993067626498392730, 140236637118880, 140236352702912, 140236641488960, 7, -2981628296165817702, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 14 (Thread 0x7f8b517cb700 (LWP 6720)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665a78) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665a00) at work.c:337
>         wi = 0x1665a00
>         work = <optimized out>
>         tid = 6720
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b517cb700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236344309504, 2993067626498392730, 140236637118880, 140236344310208, 140236641488960, 7, -2981627196117319014, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 13 (Thread 0x7f8b50fca700 (LWP 6721)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665cb8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665c40) at work.c:337
>         wi = 0x1665c40
>         work = <optimized out>
>         tid = 6721
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b50fca700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236335916800, 2993067626498392730, 140236637118880, 140236335917504, 140236641488960, 7, -2981626096068820326, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 12 (Thread 0x7f8b507c9700 (LWP 6722)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665ef8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665e80) at work.c:337
>         wi = 0x1665e80
>         work = <optimized out>
>         tid = 6722
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b507c9700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236327524096, 2993067626498392730, 140236637118880, 140236327524800, 140236641488960, 7, -2981624996020321638, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 11 (Thread 0x7f8b4ffc8700 (LWP 6723)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1666138) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x16660c0) at work.c:337
>         wi = 0x16660c0
>         work = <optimized out>
>         tid = 6723
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b4ffc8700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236319131392, 2993067626498392730, 140236637118880, 140236319132096, 140236641488960, 7, -2981632692064845158, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 10 (Thread 0x7f8b4f7c7700 (LWP 6724)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1672718) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x16726a0) at work.c:337
>         wi = 0x16726a0
>         work = <optimized out>
>         tid = 6724
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b4f7c7700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236310738688, 2993067626498392730, 140236637118880, 140236310739392, 140236641488960, 7, -2981631592016346470, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 9 (Thread 0x7f8b4efc6700 (LWP 6725)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1672958) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x16728e0) at work.c:337
>         wi = 0x16728e0
>         work = <optimized out>
>         tid = 6725
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b4efc6700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236302345984, 2993067626498392730, 140236637118880, 140236302346688, 140236641488960, 7, -2981630491967847782, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 8 (Thread 0x7f8b4e3c4700 (LWP 7024)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665a78) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665a00) at work.c:337
>         wi = 0x1665a00
>         work = <optimized out>
>         tid = 7024
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b4e3c4700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236289754880, 2993067626498392730, 140236637118880, 140236289755584, 140236641488960, 7, -2981629941138292070, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 7 (Thread 0x7f8b4d7c2700 (LWP 7034)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x16653b8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665340) at work.c:337
>         wi = 0x1665340
>         work = <optimized out>
>         tid = 7034
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b4d7c2700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236277163776, 2993067626498392730, 140236637118880, 140236277164480, 140236641488960, 7, -2981635987378503014, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 6 (Thread 0x7f8b47fff700 (LWP 7035)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x1665838) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x16657c0) at work.c:337
>         wi = 0x16657c0
>         work = <optimized out>
>         tid = 7035
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b47fff700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236185138944, 2993067626498392730, 140236637118880, 140236185139648, 140236641488960, 7, -2981650296598920550, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 5 (Thread 0x7f8b477fe700 (LWP 7036)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x16653b8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665340) at work.c:337
>         wi = 0x1665340
>         work = <optimized out>
>         tid = 7036
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b477fe700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236176746240, 2993067626498392730, 140236637118880, 140236176746944, 140236641488960, 7, -2981649196550421862, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 4 (Thread 0x7f8b46ffd700 (LWP 7037)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> No locals.
> #1  0x000000000042739b in sd_cond_wait (mutex=<optimized out>, cond=0x16653b8) at ../include/util.h:371
> No locals.
> #2  worker_routine (arg=0x1665340) at work.c:337
>         wi = 0x1665340
>         work = <optimized out>
>         tid = 7037
>         __func__ = "worker_routine"
> #3  0x00007f8b62effb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
>         __res = <optimized out>
>         pd = 0x7f8b46ffd700
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140236168353536, 2993067626498392730, 140236637118880, 140236168354240, 140236641488960, 7, -2981648096501923174, -2981586388538588518}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #4  0x00007f8b625bca7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> No locals.
> #5  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> Thread 3 (Thread 0x7f8b4cfc1700 (LWP 7045)):
> #0  0x00007f8b6258d7fd in __libc_waitpid (pid=7113, stat_loc=<optimized out>, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:41
>         _a3 = 0
>         _a1 = 7113
>         resultvar = <optimized out>
>         _a4 = 0
>         _a2 = 140236268490924
>         oldtype = 0
>         result = <optimized out>
> #1  0x00007f8b62521c99 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:149
>         __result = -512
>         _buffer = {__routine = 0x7f8b62521ff0 <cancel_handler>, __arg = 0x7f8b4cf7d0a8, __canceltype = 1, __prev = 0x0}
>         _avail = 1
>         status = <optimized out>
>         save = <optimized out>
>         pid = 7113
>         sa = {__sigaction_handler = {sa_handler = 0x1, sa_sigaction = 0x1}, sa_mask = {__val = {65536, 0 <repeats 15 times>}}, sa_flags = 0, sa_restorer = 0x7f8b4cf7d120}
>         omask = {__val = {16896, 6701, 140236268766924, 23484968, 91738, 140236626789651, 206158430256, 140236268491016, 140236268490800, 0, 91738, 23485048, 0, 140236268766924, 6701, 140236268491104}}
> #2  0x00007f8b62521fd0 in __libc_system (line=<optimized out>) at ../sysdeps/posix/system.c:190
>         oldtype = 0
>         result = <optimized out>
> #3  0x000000000042168e in gdb_cmd (cmd=0x436c32 "thread apply all where full") at logger.c:772
>         time_str = "Jan 30 10:07:26 ", '\000' <repeats 239 times>
>         cmd_str = "gdb -nw /home/mitake/github/sheepdog.git/sheep/sheep 6701 -batch >/dev/null 2>&1 -ex 'set logging on' -ex 'echo \\n' -ex 'echo ==\\n' -ex 'echo == Jan 30 10:07:26 \\n' -ex 'echo == program: /home/mitake/"...
>         ti = 1391044046
>         tm = {tm_sec = 26, tm_min = 7, tm_hour = 10, tm_mday = 30, tm_mon = 0, tm_year = 114, tm_wday = 4, tm_yday = 29, tm_isdst = 0, tm_gmtoff = 32400, tm_zone = 0x16663a0 "JST"}
> #4  0x00000000004224fc in dump_stack_frames () at logger.c:786
> No locals.
> #5  sd_backtrace () at logger.c:836
>         addrs = {0x4223cd, 0x4059a8, 0x7f8b62f08030, 0x4273a8, 0x7f8b62effb50, 0x7f8b625bca7d, 0x0 <repeats 1018 times>}
>         i = <optimized out>
>         n = <optimized out>
>         __func__ = "sd_backtrace"
> #6  0x00000000004059a8 in crash_handler (signo=11) at sheep.c:269
>         __func__ = "crash_handler"
> #7  <signal handler called>
> No symbol table info available.
> #8  list_del (entry=0x1665a) at ../include/list.h:105
> No locals.
> #9  worker_routine (arg=0x1665a00) at work.c:344

it indicates that my patch just reveal a possible bug in worker that core sheep.
There is nothing wrong with the recovery patch itself.

Thanks
Yuan



More information about the sheepdog mailing list