[sheepdog-users] [master branch] SIGABRT when doing: dog vdi check

Wed Jan 8 09:53:01 CET 2014

On Wed, Jan 08, 2014 at 09:47:51AM +0100, Marcin Mirosław wrote:
> W dniu 08.01.2014 07:21, Liu Yuan pisze:
> > On Tue, Jan 07, 2014 at 03:40:44PM +0100, Marcin Mirosław wrote:
> >> W dniu 07.01.2014 14:38, Liu Yuan pisze:
> >>> On Tue, Jan 07, 2014 at 01:29:40PM +0100, Marcin Mirosław wrote:
> >>>> W dniu 07.01.2014 12:50, Liu Yuan pisze:
> >>>>> On Tue, Jan 07, 2014 at 11:14:09AM +0100, Marcin Mirosław wrote:
> >>>>>> W dniu 07.01.2014 11:05, Liu Yuan pisze:
> >>>>>>> On Tue, Jan 07, 2014 at 10:51:18AM +0100, Marcin Mirosław wrote:
> >>>>>>>> W dniu 07.01.2014 03:00, Liu Yuan pisze:
> >>>>>>>>> On Mon, Jan 06, 2014 at 05:38:41PM +0100, Marcin Mirosław wrote:
> >>>>>>>>>> W dniu 2014-01-06 08:27, Liu Yuan pisze:
> >>>>>>>>>>> On Sat, Jan 04, 2014 at 04:13:27PM +0100, Marcin Mirosław wrote:
> >>>>>>>>>>>> W dniu 2014-01-04 06:28, Liu Yuan pisze:
> >>>>>>>>>>>>> On Fri, Jan 03, 2014 at 10:51:26PM +0100, Marcin Mirosław wrote:
> >>>>>>>>>>>>>> Hi!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all!
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm new on "sheep-run";) I'm starting to try sheepdog so probably
> >>>>>>>>>>>>>> I'm doing many things wrongly. I'm playing with sheepdog-0.7.6.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> First problem (SIGABRT): I started multi sheep daemeon on
> >>>>>>>>>>>>>> localhost: # for x in 0 1 2 3 4; do sheep -c local -j size=128M
> >>>>>>>>>>>>>> -p 700$x /mnt/sheep/metadata/$x,/mnt/sheep/storage/$x; done
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Next: # dog cluster info Cluster status: Waiting for cluster to
> >>>>>>>>>>>>>> be formatted
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> # dog cluster format -c 2:1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0.7.6 doesn't support erasure code. Try latest master branch
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now I'm on 486ace8ccbb [master]. How I should check choosen redundancy?
> >>>>>>>>>>>>  # cat /mnt/test/vdi/list
> >>>>>>>>>>>>    Name        Id    Size    Used  Shared    Creation time   VDI id
> >>>>>>>>>>>> Copies  Tag
> >>>>>>>>>>>>    testowy      0  1.0 GB  0.0 MB  0.0 MB 2014-01-04 15:07   cac836     3
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here I can see 3 copies, can't see info about how many parity strips
> >>>>>>>>>>>> is configured. Probably this isn't implemented yet?
> >>>>>>>>>>>
> >>>>>>>>>>> Not yet. But currently you can 'dog cluster info -s' to see the global policy
> >>>>>>>>>>> scheme x:y (that you 'dog cluster format -c x:y').
> >>>>>>>>>>>
> >>>>>>>>>>> With erasure coding, 'copies' will have another meaning that the number of total
> >>>>>>>>>>> data + parity objects. In your case, it is 2+1=3. But as you said, this is
> >>>>>>>>>>> confusing, I think of adding a extra field to indicate redundancy scheme per vid.
> >>>>>>>>>>>
> >>>>>>>>>>> Well, for about issue, I can't reproduce it. Could you give me more envronment
> >>>>>>>>>>> information such as 32 or 64 bits of your OS? what is your distro?
> >>>>>>>>>>
> >>>>>>>>>> Hi!
> >>>>>>>>>> I'm using Gentoo 64bits, gcc version 4.7.3 (Gentoo Hardened 4.7.3-r1
> >>>>>>>>>> p1.4, pie-0.5.5), kernel 3.10 with Gentoo patches.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Does the problem still exist? I can't reproduce the issue yet. So how did you
> >>>>>>>>> reproduce it step by step?
> >>>>>>>>
> >>>>>>>> Hi!
> >>>>>>>> I'm installing sheepdog-0.7.x, next:
> >>>>>>>> # mkdir -p /mnt/sheep/{metadata,storage}
> >>>>>>>> # for x in 0 1 2 3 4; do sheep -c local -j size=128M -p 700$x
> >>>>>>>> /mnt/sheep/metadata/$x,/mnt/sheep/storage/$x; done
> >>>>>>>> # dog cluster format -c 2
> >>>>>>>> using backend plain store
> >>>>>>>> # dog vdi create testowy 5G
> >>>>>>>> # dog  vdi check testowy
> >>>>>>>> PANIC: can't find next new idx
> >>>>>>>> dog exits unexpectedly (Aborted).
> >>>>>>>> dog() [0x4058da]
> >>>>>>>> [...]
> >>>>>>>>
> >>>>>>>> I'm getting SIGABRT on every try.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> On the same machine, with master branch(not stable-0.7), you mentioned you can't
> >>>>>>> reproduce the problem?
> >>>>>>
> >>>>>> With master branch (commit  a79e69f9ad9c5) I'm getting such message:
> >>>>>> # dog  vdi check testowy
> >>>>>> PANIC: can't find a valid vnode
> >>>>>> dog exits unexpectedly (Aborted).
> >>>>>> dog() [0x4057fa]
> >>>>>> /lib64/libpthread.so.0(+0xfd8f) [0x7f6d43cd0d8f]
> >>>>>> /lib64/libc.so.6(gsignal+0x38) [0x7f6d43951368]
> >>>>>> /lib64/libc.so.6(abort+0x147) [0x7f6d439526c7]
> >>>>>> dog() [0x40336e]
> >>>>>> dog() [0x409d9f]
> >>>>>> dog() [0x40cea5]
> >>>>>> dog() [0x403927]
> >>>>>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x7f6d4393dc04]
> >>>>>> dog() [0x403c6c]
> >>>>>>
> >>>>>> Will be full gdb backtrace usefull?
> >>>>>
> >>>>> Hmm, before you run 'dog vdi check', what is output of 'dog cluster info',
> >>>>> 'dog node list', 'dog node md info --all'?
> >>>>
> >>>> Output using master branch:
> >>>> # dog cluster info
> >>>> Cluster status: running, auto-recovery enabled
> >>>>
> >>>> Cluster created at Tue Jan  7 13:21:53 2014
> >>>>
> >>>> Epoch Time           Version
> >>>> 2014-01-07 13:21:54      1 [127.0.0.1:7000, 127.0.0.1:7001,
> >>>> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> >>>>
> >>>> # dog node list
> >>>>   Id   Host:Port         V-Nodes       Zone
> >>>>    0   127.0.0.1:7000           128   16777343
> >>>>    1   127.0.0.1:7001           128   16777343
> >>>>    2   127.0.0.1:7002           128   16777343
> >>>>    3   127.0.0.1:7003           128   16777343
> >>>>    4   127.0.0.1:7004           128   16777343
> >>>>
> >>>> # dog node md info --all
> >>>> Id      Size    Used    Avail   Use%    Path
> >>>> Node 0:
> >>>>  0      4.4 GB  4.0 MB  4.4 GB    0%    /mnt/sheep/storage/0
> >>>> Node 1:
> >>>>  0      4.4 GB  0.0 MB  4.4 GB    0%    /mnt/sheep/storage/1
> >>>> Node 2:
> >>>>  0      4.4 GB  0.0 MB  4.4 GB    0%    /mnt/sheep/storage/2
> >>>> Node 3:
> >>>>  0      4.4 GB  0.0 MB  4.4 GB    0%    /mnt/sheep/storage/3
> >>>> Node 4:
> >>>>  0      4.4 GB  0.0 MB  4.4 GB    0%    /mnt/sheep/storage/4
> >>>>
> >>>
> >>> The very strange thing from your output is that only 1 copy was actually
> >>> written while you execute 'dog vdi create', but you formated the cluster with
> >>> two copy specified.
> >>>
> >>> You can verify this by
> >>>
> >>> ls /mnt/sheepdog/storage/*/
> >>>
> >>> I guess you can only see one object. Dunno why this happened.
> >>
> >> It is as you said:
> >> # ls /mnt/sheep/storage/*/
> >> /mnt/sheep/storage/0/:
> >> 80cac83600000000
> >>
> >> /mnt/sheep/storage/1/:
> >>
> >> /mnt/sheep/storage/2/:
> >>
> >> /mnt/sheep/storage/3/:
> >>
> >> /mnt/sheep/storage/4/:
> >>
> >>
> >> Now I'm on commit a79e69f9ad9c and problem still exists for me (in
> >> contrary to 0.7-stable). I noticed that in my /tmp appeared file
> >> "sheepdog_shm" and "lock" . Is it correct?
> >>

lock isn't created by sheep daemon as far as I know. we create sheepdog_locks for
local driver.

> > 
> > I suspect there is only actually one node in the cluster so 'vdi check' panic out.
> > 
> > before you run 'vdi check'
> > 
> > for i in `seq 0 5`;do dog cluster info -p 700$i;done
> > 
> > is every node output same?
> > 
> > 
> > for i in `seq 0 5`;do dog node list -p 700$i;done
> > 
> > same too?
> 
> Hi!
> Output is looks as below:
> 
> # for i in `seq 0 4`;do dog cluster info -p 700$i;done
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Wed Jan  8 09:42:40 2014
> 
> Epoch Time           Version
> 2014-01-08 09:42:41      1 [127.0.0.1:7000, 127.0.0.1:7001,
> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Wed Jan  8 09:42:40 2014
> 
> Epoch Time           Version
> 2014-01-08 09:42:40      1 [127.0.0.1:7000, 127.0.0.1:7001,
> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Wed Jan  8 09:42:40 2014
> 
> Epoch Time           Version
> 2014-01-08 09:42:41      1 [127.0.0.1:7000, 127.0.0.1:7001,
> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Wed Jan  8 09:42:40 2014
> 
> Epoch Time           Version
> 2014-01-08 09:42:40      1 [127.0.0.1:7000, 127.0.0.1:7001,
> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Wed Jan  8 09:42:40 2014
> 
> Epoch Time           Version
> 2014-01-08 09:42:40      1 [127.0.0.1:7000, 127.0.0.1:7001,
> 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004]
> 
> # for i in `seq 0 4`;do dog node list -p 700$i;done
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000           128   16777343
>    1   127.0.0.1:7001           128   16777343
>    2   127.0.0.1:7002           128   16777343
>    3   127.0.0.1:7003           128   16777343
>    4   127.0.0.1:7004           128   16777343
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000           128   16777343
>    1   127.0.0.1:7001           128   16777343
>    2   127.0.0.1:7002           128   16777343
>    3   127.0.0.1:7003           128   16777343
>    4   127.0.0.1:7004           128   16777343
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000           128   16777343
>    1   127.0.0.1:7001           128   16777343
>    2   127.0.0.1:7002           128   16777343
>    3   127.0.0.1:7003           128   16777343
>    4   127.0.0.1:7004           128   16777343
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000           128   16777343
>    1   127.0.0.1:7001           128   16777343
>    2   127.0.0.1:7002           128   16777343
>    3   127.0.0.1:7003           128   16777343
>    4   127.0.0.1:7004           128   16777343
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000           128   16777343
>    1   127.0.0.1:7001           128   16777343
>    2   127.0.0.1:7002           128   16777343
>    3   127.0.0.1:7003           128   16777343
>    4   127.0.0.1:7004           128   16777343
> 
> 

Everything looks fine. It is very weird. And with 5 nodes you just write 1 copy
succeed. I have no idea what happened and I can't reproduce the problem on my
local machine.

Thanks
Yuan