[sheepdog-users] [0.7-stable] SIGABRT when doing: dog vdi check

Hitoshi Mitake mitake.hitoshi at gmail.com
Tue Jan 7 15:46:53 CET 2014


At Tue, 07 Jan 2014 23:42:39 +0900,
Hitoshi Mitake wrote:
> 
> At Tue, 07 Jan 2014 15:40:09 +0100,
> Marcin Mirosław wrote:
> > 
> > W dniu 07.01.2014 15:32, Hitoshi Mitake pisze:
> > > At Tue, 07 Jan 2014 15:19:05 +0100,
> > > Marcin Mirosław wrote:
> > >>
> > >> W dniu 07.01.2014 14:19, Hitoshi Mitake pisze:
> > >>> At Tue, 07 Jan 2014 10:45:33 +0100,
> > >>> Marcin Mirosław wrote:
> > >>>>
> > >>>> W dniu 07.01.2014 02:57, Hitoshi Mitake pisze:
> > >>>>> At Mon, 06 Jan 2014 11:41:16 +0900,
> > >>>>> Hitoshi Mitake wrote:
> > >>>>>>
> > >>>>>> At Sat, 4 Jan 2014 13:28:27 +0800,
> > >>>>>> Liu Yuan wrote:
> > >>>>>>>
> > >>>>>>> On Fri, Jan 03, 2014 at 10:51:26PM +0100, Marcin Mirosław wrote:
> > >>>>>>>> Hi!
> > >>>>>>>> I'm new on "sheep-run";) I'm starting to try sheepdog so probably I'm
> > >>>>>>>> doing many things wrongly.
> > >>>>>>>> I'm playing with sheepdog-0.7.6.
> > >>>>>>>>
> > >>>>>>>> First problem (SIGABRT):
> > >>>>>>>> I started multi sheep daemeon on localhost:
> > >>>>>>>> # for x in 0 1 2 3 4; do sheep -c local -j size=128M -p 700$x
> > >>>>>>>> /mnt/sheep/metadata/$x,/mnt/sheep/storage/$x; done
> > >>>>>>>>
> > >>>>>>>> Next:
> > >>>>>>>> # dog cluster info
> > >>>>>>>> Cluster status: Waiting for cluster to be formatted
> > >>>>>>>>
> > >>>>>>>> # dog cluster format -c 2:1
> > >>>>>>>
> > >>>>>>> 0.7.6 doesn't support erasure code. Try latest master branch
> > >>>>>>
> > >>>>>> Current stable-0.7 doesn't treat copy policies like "2:1" as an
> > >>>>>> error. It would be confusing for users. I'll write a patch only for
> > >>>>>> stable-0.7 for treating these options as errors.
> > >>>>>>
> > >>>>>>>
> > >>>>>>>> using backend plain store
> > >>>>>>>> # dog cluster info
> > >>>>>>>> Cluster status: running, auto-recovery enabled
> > >>>>>>>>
> > >>>>>>>> Cluster created at Fri Jan  3 20:33:43 2014
> > >>>>>>>>
> > >>>>>>>> Epoch Time           Version
> > >>>>>>>> 2014-01-03 20:33:43      1 [127.0.0.1:7000, 127.0.0.1:7001,
> > >>>>>>>> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> > >>>>>>>> # dog vdi create testowy 5G
> > >>>>>>>> # gdb -q dog
> > >>>>>>>> Reading symbols from /usr/sbin/dog...Reading symbols from
> > >>>>>>>> /usr/lib64/debug/usr/sbin/dog.debug...done.
> > >>>>>>>> done.
> > >>>>>>>> (gdb)  set args  vdi check testowy
> > >>>>>>>> (gdb) run
> > >>>>>>>> Starting program: /usr/sbin/dog vdi check testowy
> > >>>>>>>> warning: Could not load shared library symbols for linux-vdso.so.1.
> > >>>>>>>> Do you need "set solib-search-path" or "set sysroot"?
> > >>>>>>>> warning: File "/lib64/libthread_db-1.0.so" auto-loading has been
> > >>>>>>>> declined by your `auto-load safe-path' set to
> > >>>>>>>> "$debugdir:$datadir/auto-load".
> > >>>>>>>> To enable execution of this file add
> > >>>>>>>>         add-auto-load-safe-path /lib64/libthread_db-1.0.so
> > >>>>>>>> line to your configuration file "/root/.gdbinit".
> > >>>>>>>> To completely disable this security protection add
> > >>>>>>>>         set auto-load safe-path /
> > >>>>>>>> line to your configuration file "/root/.gdbinit".
> > >>>>>>>> For more information about this security protection see the
> > >>>>>>>> "Auto-loading safe path" section in the GDB manual.  E.g., run from the
> > >>>>>>>> shell:
> > >>>>>>>>         info "(gdb)Auto-loading safe path"
> > >>>>>>>> warning: Unable to find libthread_db matching inferior's thread library,
> > >>>>>>>> thread debugging will not be available.
> > >>>>>>>> PANIC: can't find next new idx
> > >>>>>>>
> > >>>>>>> seems that 0.7.x series is cracky about it. Hitoshi, can you verify
> > >>>>>>> this?
> > >>>>>>
> > >>>>>> OK, I'll dig in the problem soon.
> > >>>>>
> > >>>>> Hi Marcin,
> > >>>>>
> > >>>>> I've fixed a serious bug which might produce the bug you've
> > >>>>> suffered. The commit is this:
> > >>>>> https://github.com/sheepdog/sheepdog/commit/b82632b47978315a50e9ba0bbad59f56453f63f5
> > >>>>>
> > >>>>> The patch is already backported to stable-0.7. Could you try the
> > >>>>> latest stable-0.7? You can obtain the source code by the below step:
> > >>>>>
> > >>>>> git clone https://github.com/sheepdog/sheepdog.git
> > >>>>> cd sheepdog
> > >>>>> git checkout -b 0.7 origin/stable-0.7
> > >>>>>
> > >>>>> But the problem might depend on timing heavily. So reproducing the
> > >>>>> problem will be not so easy... If you have time, I'd like you to try
> > >>>>> the latest change.
> > >>>>
> > >>>> Hi all!
> > >>>> Today I'm trying sheepdog on other computer and don;t have any problem
> > >>>> to reproduce SIGABRT. I tried latest 0.7 stable (at 9de5329978) and "dog
> > >>>> vdi check testowy" still crashes:
> > >>>> # dog  vdi check testowy
> > >>>> PANIC: can't find next new idx
> > >>>> dog exits unexpectedly (Aborted).
> > >>>> dog() [0x40536a]
> > >>>> /lib64/libpthread.so.0(+0xfd8f) [0x7f90aec31d8f]
> > >>>> /lib64/libc.so.6(gsignal+0x38) [0x7f90ae8b2368]
> > >>>> /lib64/libc.so.6(abort+0x147) [0x7f90ae8b36c7]
> > >>>> dog() [0x40893a]
> > >>>> dog() [0x40a3d9]
> > >>>> dog() [0x40363e]
> > >>>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x7f90ae89ec04]
> > >>>> dog() [0x4038a8]
> > >>>> #7  0x000000000040faa5 in sd_backtrace () at logger.c:914
> > >>>>
> > >>>> addrs = {0x40f95d <sd_backtrace+29>, 0x40536b <crash_handler+43>,
> > >>>> 0x7f90aec31d90 <__restore_rt>, 0x7f90ae8b2369 <__GI_raise+57>,
> > >>>> 0x7f90ae8b36c8 <__GI_abort+328>, 0x40893b, 0x40a3da <vdi_check+186>,
> > >>>> 0x40363f <main+1055>, 0x7f90ae89ec05 <__libc_start_main+2
> > >>>> 45>, 0x4038a9 <_start+41>, 0x0 <repeats 1014 times>}
> > >>>>
> > >>>> i = <optimized out>
> > >>>>
> > >>>> n = <optimized out>
> > >>>>
> > >>>> __func__ = "sd_backtrace"
> > >>>>
> > >>>> #8  0x000000000040536b in crash_handler (signo=6) at dog.c:330
> > >>>>
> > >>>> __func__ = "crash_handler"
> > >>>>
> > >>>> #9  <signal handler called>
> > >>>>
> > >>>> No locals.
> > >>>>
> > >>>> #10 0x00007f90ae8b2369 in __GI_raise (sig=sig at entry=6) at
> > >>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> > >>>>
> > >>>> resultvar = 0
> > >>>>
> > >>>> pid = 21938
> > >>>>
> > >>>> selftid = 21938
> > >>>>
> > >>>> #11 0x00007f90ae8b36c8 in __GI_abort () at abort.c:90
> > >>>>
> > >>>> save_stage = 2
> > >>>>
> > >>>> act = {__sigaction_handler = {sa_handler = 0x2, sa_sigaction = 0x2},
> > >>>> sa_mask = {__val = {29029392, 102, 140259386245127, 5, 0, 208,
> > >>>> 140259380182888, 0, 102, 320, 140259386270773, 0, 140259383961520, 0,
> > >>>> 140259383961520, 140259383955552}}, sa_flags = -13587
> > >>>> 68064, sa_restorer = 0x7f90ae9a3e80 <__memcpy_ssse3+9600>}
> > >>>>
> > >>>> sigs = {__val = {32, 0 <repeats 15 times>}}
> > >>>>
> > >>>> #12 0x000000000040893b in get_vnode_next_idx (nr_prev_idxs=<optimized
> > >>>> out>, prev_idxs=<optimized out>, nr_entries=<optimized out>,
> > >>>> entries=<optimized out>) at ../include/sheep.h:105
> > >>>>
> > >>>> i = <optimized out>
> > >>>>
> > >>>> idx = <optimized out>
> > >>>>
> > >>>> first_idx = <optimized out>
> > >>>>
> > >>>> found = <optimized out>
> > >>>>
> > >>>> #13 oid_to_vnodes (vnodes=0x7fffaefac8c0, nr_copies=2, oid=<optimized
> > >>>> out>, nr_entries=320, entries=<optimized out>) at ../include/sheep.h:174
> > >>>>
> > >>>> idxs = {102, 0, 29029184, 0, 1, 0, 29029264, 0}
> > >>>>
> > >>>> i = <optimized out>
> > >>>>
> > >>>> vnodes = 0x7fffaefac8c0
> > >>>>
> > >>>> nr_copies = 2
> > >>>>
> > >>>> oid = <optimized out>
> > >>>>
> > >>>> nr_entries = 320
> > >>>>
> > >>>> #14 queue_vdi_check_work (oid=<optimized out>, done=done at entry=0x0,
> > >>>> wq=wq at entry=0x1baf410, inode=0x7f90adc77010, inode=0x7f90adc77010) at
> > >>>> vdi.c:1600
> > >>>>
> > >>>> info = 0x1baf580
> > >>>>
> > >>>> tgt_vnodes = {0xc2fd70 <sd_vnodes+5712>, 0x1baf3b8, 0x7f90adc77010,
> > >>>> 0x414348 <create_worker_threads+104>, 0x7fffaefae207, 0x7f90adc76700,
> > >>>> 0x1baf340, 0x1baf390}
> > >>>>
> > >>>> nr_copies = 2
> > >>>>
> > >>>> #15 0x000000000040a3da in vdi_check (argc=<optimized out>,
> > >>>> argv=<optimized out>) at vdi.c:1634
> > >>>>
> > >>>> vdiname = 0x7fffaefae207 "testowy"
> > >>>>
> > >>>> ret = 0
> > >>>>
> > >>>> max_idx = <optimized out>
> > >>>>
> > >>>> done = 0
> > >>>>
> > >>>> vid = 13289526
> > >>>>
> > >>>> inode = 0x7f90adc77010
> > >>>>
> > >>>> wq = 0x1baf410
> > >>>>
> > >>>> __func__ = "vdi_check"
> > >>>>
> > >>>> #16 0x000000000040363f in main (argc=<optimized out>,
> > >>>> argv=0x7fffaefacad8) at dog.c:494
> > >>>>
> > >>>> ch = <optimized out>
> > >>>>
> > >>>> longindex = 0
> > >>>>
> > >>>> ret = <optimized out>
> > >>>>
> > >>>> flags = 2
> > >>>>
> > >>>> long_options = 0xc2c560 <lopts.2390>
> > >>>>
> > >>>> commands = 0x1ba3010
> > >>>>
> > >>>> short_options = 0xc2e580 <sopts.2381"s:a:p:h"
> > >>>>
> > >>>> p = 0x41594d
> > >>>> <__libc_csu_init+77"H\203\303\001H9\353u\352H\203\304\b[]A\\A]A^A_\303ff.\017\037\204"
> > >>>>
> > >>>> __func__ = "main"
> > >>>>
> > >>>>
> > >>>> Additionally I'm getting QA Notice from portage:
> > >>>>  * QA Notice: Package triggers severe warnings which indicate that it
> > >>>>  *            may exhibit random runtime failures.
> > >>>>  * group.c:348:45: warning: argument to ‘sizeof’ in ‘memcpy’ call is the
> > >>>> same expression as the destination; did you mean to dereference it?
> > >>>> [-Wsizeof-pointer-memaccess]
> > >>>>
> > >>>>
> > >>>>
> > >>>> Today it's gcc-4.8.2, but I tested with clang and it nothing changed.
> > >>>> I'm configuring sheepdog using:
> > >>>> # configure --prefix=/usr --build=x86_64-pc-linux-gnu
> > >>>> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man
> > >>>> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
> > >>>> --localstatedir=/var/lib --libdir=/usr/lib64 --disable-silent-rules
> > >>>> --disable-dependency-tracking --docdir=/usr/share/doc/sheepdog-0.7.9999
> > >>>> --disable-corosync --enable-sheepfs
> > >>>>
> > >>>> Dir "/mnt/sheep" is placed on one partition (I know this is slow but
> > >>>> speed isn't important for me at this moment).
> > >>>> If I can do something more to help please let me know.
> > >>>
> > >>> Thanks a lot for your testing and detailed report, Marcin.
> > >>>
> > >>> I backported an important patch for the above problem. Could you try
> > >>> the latest stable-0.7?
> > >>>
> > >>> BTW, it depends on sizeof(time_t). Could you try the below program on
> > >>> your environment?
> > >>>
> > >>> #include <stdio.h>
> > >>> #include <time.h>
> > >>>
> > >>> int main(void)
> > >>> {
> > >>> 	printf("sizeof(time_t): %lu\n", sizeof(time_t));
> > >>> 	printf("sizeof(time_t *): %lu\n", sizeof(time_t *));
> > >>> 	return 0;
> > >>> }
> > >>>
> > >>> If the sizes are different, the patch would fix the problem.
> > >>
> > >> Sizes are the same:
> > >>
> > >> sizeof(time_t): 8
> > >> sizeof(time_t *): 8
> > > 
> > > Hmm, thanks.
> > > 
> > >>
> > >>
> > >> I've tested sheepddog on commit 984afc61c16e69 and it's better:) :
> > >> # dog vdi check testowy
> > >> failed to connect to 0:0:7f00:1:591b:::0: Network is unreachable
> > >> failed to connect to 0:0:7f00:1:591b:::0: Network is unreachable
> > >>  13.6 % [==================>
> > >>
> > >>  ] 696 MB / 5.0 GB    failed to connect to 5b1b:::0: Network is unreachable
> > >> / # ============================================>
> > >>
> > >> So SIGABRT didn't appear, only terminal was polluted in this test;)
> > >> I run once again with defined address ip:
> > >> # dog vdi check  -a 127.0.0.1 testowy
> > >>  48.8 %
> > >> [==================================================================>
> > >>                                                                  ] 2.4
> > >> GB / 5.0 GB    failed to connect to 5b1b:::0: Network is unreachable
> > >> failed to connect to 5b1b:::0: Network is unreachable
> > >>  48.8 %
> > >> [==================================================================>
> > >>                                                                  ] 2.4
> > >> GB / 5.0 GB    failed to connect to 0:0:7f00:1:591b:::0: Network is
> > >> unreachable
> > >> failed to connect to 0:0:7f00:1:591b:::0: Network is unreachable
> > >>
> > >> Don't know why dog wants connect to ipv6.
> > > 
> > > Thanks for your testing!
> > > 
> > > The broken IP addresses can be seen when the latest dog and stable
> > > sheep are communicating. Aren't you using the latest dog?
> > 
> > I've made test once again and SIGABRT appears, it looks I've made
> > mistake in tests. I'm frequently changing branches, probably I didn't
> > kill sheeps in meantime.
> > Sorry.
> 
> No problem at all. BTW, did the invalid IPv6 problem disappear? If so,
> I'd like to conclude that this SIGABRT problem is solved.

Oops, sorry, I've read your email mistakenly. SIGABRT is still alive...

Thanks,
Hitoshi



More information about the sheepdog-users mailing list