[Sheepdog] [Qemu-devel] coroutine bug?, was Re: [PATCH] sheepdog: use coroutines

Mon Jan 2 16:39:59 CET 2012

On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote:
> Are you building with gcc 4.5.3 or later?  (Earlier versions may
> mis-compile, see https://bugs.launchpad.net/qemu/+bug/902148.)

I'm using "gcc version 4.6.2 (Debian 4.6.2-9)", so that should not
be an issue. 

> If you can reproduce this bug and suspect coroutines are involved then I

It's entirely reproducable.

I've played around a bit and switched from the ucontext to the gthreads
coroutine implementation.  The result seems odd, but starts to make sense.

Running the workload I now get the following message from qemu:

Co-routine re-entered recursively

and the gdb backtrace looks like:

(gdb) bt
#0  0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f2fff372680 in *__GI_abort () at abort.c:92
#2  0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, opaque=0x0)
    at qemu-coroutine.c:53
#3  0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>)
    at qemu-coroutine-lock.c:43
#4  0x00007f30018d5a72 in qemu_bh_poll () at async.c:71
#5  0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>)
    at main-loop.c:472
#6  0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481
#7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at /home/hch/work/qemu/vl.c:3479

adding some printks suggest this happens when calling add_aio_request from
aio_read_response when either delaying creates, or updating metadata,
although not everytime one of these cases happens.

I've tried to understand how the recursive calling happens, but unfortunately
the whole coroutine code lacks any sort of documentation how it should
behave or what it asserts about the callers.

> I don't have a sheepdog setup here but if there's an easy way to
> reproduce please let me know and I'll take a look.

With the small patch below applied to the sheppdog source I can reproduce
the issue on my laptop using the following setup:

for port in 7000 7001 7002; do
    mkdir -p /mnt/sheepdog/$port
    /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port
    sleep 2
done

collie cluster format
collie vdi create test 20G

then start a qemu instance that uses the the sheepdog volume using the
following device and drive lines:

	-drive if=none,file=sheepdog:test,cache=none,id=test \
	-device virtio-blk-pci,drive=test,id=testdev \

finally, in the guest run:

	dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct