[sheepdog-users] sheepdog replication got stuck

Sat Jan 4 20:20:04 CET 2014

Hi,

I have done further investigation on that issue.

As long as I import image by image, only one at a time everything works as expected, but when I try to import all images at the same time, either the replication to the other node gets stuck and the  and the import never finish or sheep gets an segmentation fault (see below).

Looks to me like some kind of raise condition in the thread handling.

Upgrade to 0.7.6 doesn't change anything

Sheep is running with the following options:

/usr/sbin/sheep --pidfile /var/run/sheep.pid -l 6 --nosync /var/lib/sheepdog/ /var/lib/sheepdog//disc1/data,/var/lib/sheepdog//disc2/data -w dir=/var/lib/sheepdog//cache size=100000

Regards

Gerald

Crash from 0.7.5:

Jan 01 11:03:58  EMERG [gway 82801] crash_handler(250) sheep exits unexpectedly (Segmentation fault).
Jan 01 11:04:03  EMERG [gway 82801] sd_backtrace(843) sheep.c:252: crash_handler
Jan 01 11:04:03  EMERG [gway 82801] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7f491ba3502f]
Jan 01 11:04:03  EMERG [gway 82801] sd_backtrace(843) object_cache.c:643: find_object_cache
Jan 01 11:04:03  EMERG [gway 82801] sd_backtrace(843) object_cache.c:1098: bypass_object_cache
Jan 01 11:04:04  EMERG [gway 82801] sd_backtrace(843) gateway.c:39: gateway_read_obj
Jan 01 11:04:04  EMERG [gway 82801] sd_backtrace(843) ops.c:1337: do_process_work
Jan 01 11:04:04  EMERG [gway 82801] sd_backtrace(843) work.c:294: worker_routine
Jan 01 11:04:04  EMERG [gway 82801] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b4f) [0x7f491ba2cb4f]
Jan 01 11:04:04  EMERG [gway 82801] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7f491b0e9a7c]
Jan 01 11:04:19  ERROR [main] crash_handler(490) sheep pid 81829 exited unexpectedly.

> -----Ursprüngliche Nachricht-----
> Von: sheepdog-users-bounces at lists.wpkg.org [mailto:sheepdog-users-
> bounces at lists.wpkg.org] Im Auftrag von Gerald Richter - ECOS
> Gesendet: Montag, 23. Dezember 2013 14:16
> An: Liu Yuan
> Cc: Lista sheepdog user
> Betreff: Re: [sheepdog-users] sheepdog replication got stuck
> 
> Hi,
> 
> >
> > So what is the problem? 'qemu-img convert' get hung so that never finish?
> >
> 
> On the first host (where the qemu-img runs) I have:
> 
> vm-61025-disk-1     0   20 GB   15 GB  0.0 MB 2013-12-19 09:52   bb7a25     3
> 
> on the second one I have:
> 
> vm-61025-disk-1     0   20 GB   36 MB  0.0 MB 2013-12-19 09:52   bb7a25     3
> 
> Regardless if qemu-img hangs I expect that the second machine show the
> same "Used" value as the first one (after the time it takes to push the cached
> content over the network).
> 
> The other question is why qemu-img hangs. I guess (but this can be wrong) it
> has issued a flush at the end of the import and now is waiting until the cache
> has been flushed to all nodes. That is how I understand from the docs how it
> should work.
> 
> At least doing an strace and lsof on the qemu-img process shows that it is
> waiting for the sheepdog server (select on the sheepdog socket connection).
> 
> Maybe it's important that I run qemu 1.4 because that is part of the
> distribution (Proxmox) I use and it contains a bunch of patches, so it's not
> easy to compile from the source.
> 
> But regardsless if the hang of qemu-img is due to an old qemu, I would
> expect that the cache get flushed to the second node over time or am I
> wrong?
> 
> Regards
> 
> Gerald
> 
> 
> 
> 
> 
> 
> --
> sheepdog-users mailing lists
> sheepdog-users at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog-users