[Sheepdog] [Qemu-devel] coroutine bug?, was Re: [PATCH] sheepdog: use coroutines
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Fri Jan 6 12:16:16 CET 2012
At Tue, 3 Jan 2012 08:13:51 +0000,
Stefan Hajnoczi wrote:
>
> On Mon, Jan 02, 2012 at 10:38:11PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Jan 2, 2012 at 3:39 PM, Christoph Hellwig <hch at lst.de> wrote:
> > > On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote:
> > >> If you can reproduce this bug and suspect coroutines are involved then I
> > >
> > > It's entirely reproducable.
> > >
> > > I've played around a bit and switched from the ucontext to the gthreads
> > > coroutine implementation. The result seems odd, but starts to make sense.
> > >
> > > Running the workload I now get the following message from qemu:
> > >
> > > Co-routine re-entered recursively
> > >
> > > and the gdb backtrace looks like:
> > >
> > > (gdb) bt
> > > #0 0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>)
> > > at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> > > #1 0x00007f2fff372680 in *__GI_abort () at abort.c:92
> > > #2 0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, opaque=0x0)
> > > at qemu-coroutine.c:53
> > > #3 0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>)
> > > at qemu-coroutine-lock.c:43
> > > #4 0x00007f30018d5a72 in qemu_bh_poll () at async.c:71
> > > #5 0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>)
> > > at main-loop.c:472
> > > #6 0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481
> > > #7 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> > > at /home/hch/work/qemu/vl.c:3479
> > >
> > > adding some printks suggest this happens when calling add_aio_request from
> > > aio_read_response when either delaying creates, or updating metadata,
> > > although not everytime one of these cases happens.
> > >
> > > I've tried to understand how the recursive calling happens, but unfortunately
> > > the whole coroutine code lacks any sort of documentation how it should
> > > behave or what it asserts about the callers.
> > >
> > >> I don't have a sheepdog setup here but if there's an easy way to
> > >> reproduce please let me know and I'll take a look.
> > >
> > > With the small patch below applied to the sheppdog source I can reproduce
> > > the issue on my laptop using the following setup:
> > >
> > > for port in 7000 7001 7002; do
> > > mkdir -p /mnt/sheepdog/$port
> > > /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port
> > > sleep 2
> > > done
> > >
> > > collie cluster format
> > > collie vdi create test 20G
> > >
> > > then start a qemu instance that uses the the sheepdog volume using the
> > > following device and drive lines:
> > >
> > > -drive if=none,file=sheepdog:test,cache=none,id=test \
> > > -device virtio-blk-pci,drive=test,id=testdev \
> > >
> > > finally, in the guest run:
> > >
> > > dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct
> >
> > Thanks for these instructions. I can reproduce the issue here.
> >
> > It seems suspicious the way that BDRVSheepdogState->co_recv and
> > ->co_send work. The code adds select(2) read/write callback functions
> > on the sheepdog socket file descriptor. When the socket becomes
> > writeable or readable the co_send or co_recv coroutines are entered.
> > So far, so good, this is how a coroutine is integrated into the main
> > loop of QEMU.
> >
> > The problem is that this patch is mixing things. The co_recv
> > coroutine runs aio_read_response(), which invokes send_pending_req().
> > send_pending_req() invokes add_aio_request(). That function isn't
> > suitable for co_recv's context because it actually sends data and hits
> > a few blocking (yield) points. It takes a coroutine mutex - but the
> > select(2) read callback is still in place. We're now still in the
> > aio_read_response() call chain except we're actually not reading at
> > all, we're trying to write! And we'll get spurious wakeups if there
> > is any data readable on the socket.
> >
> > So the co_recv coroutine has two things in the system that will try to enter it:
> > 1. The select(2) read callback on the sheepdog socket.
> > 2. The aio_add_request() blocking operations, including a coroutine mutex.
> >
> > This is bad, a yielded coroutine should only have one thing that will
> > enter it. It's rare that it makes sense to have multiple things
> > entering a coroutine.
> >
> > It's late here but I hope this gives Kazutaka some thoughts on what is
> > causing the issue with this patch.
>
> Poked around some more this morning. Here is a patch that isolates the
> bug:
>
> ---8<---
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index 26ad76b..0d7a03f 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -59,6 +59,16 @@ void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
> QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
> qemu_coroutine_yield();
> assert(qemu_in_coroutine());
> + {
> + Coroutine *co;
> +
> + QTAILQ_FOREACH(co, &queue->entries, co_queue_next) {
> + assert(co != self);
> + }
> + QTAILQ_FOREACH(co, &unlock_bh_queue, co_queue_next) {
> + assert(co != self);
> + }
> + }
> }
>
> void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue)
> ---8<---
>
> When you run this with ucontext or GThread coroutines you hit this
> assertion failure:
>
> qemu-system-x86_64: qemu-coroutine-lock.c:66: qemu_co_queue_wait: Assertion `co != self' failed.
>
> Here is an explanation for what the asserts are checking and how we
> violate the constraint:
>
> qemu_co_queue_next() and qemu_co_queue_next_bh() take a waiting
> Coroutine off the wait queue and put it onto the unlock bh queue. When
> the BH executes the coroutines from the unlock bh queue are popped off
> that queue and entered. This is how coroutine wait queues work, and
> coroutine mutexes are built on coroutine wait queues.
>
> The problem is that due to the spurious wakeups introduced in the
> sheepdog patch we are acquiring the mutex earlier than normal with a BH
> unlock still pending. Our coroutine can actually terminate by returning
> from sheepdog.c:aio_read_respond() while the BH is still pending. When
> we get around to executing the BH we're going to operate on a freed
> coroutine - BOOM!
Thanks for your detailed explanation. I confirmed the following hack
fixes this problem.
---8<---
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 17a79be..b27c770 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -626,6 +626,9 @@ static void coroutine_fn aio_read_response(void *opaque)
switch (acb->aiocb_type) {
case AIOCB_WRITE_UDATA:
+ /* this coroutine context is no longer suitable for co_recv
+ * because we may send data to update vdi objects */
+ s->co_recv = NULL;
if (!is_data_obj(aio_req->oid)) {
break;
}
---8<---
>
> The reason why we get different failure behavior with ucontext and
> GThread is because they have different implementations and reuse
> coroutines different. We've still done something illegal but the
> undefined behavior happens to be different - both ucontext and GThread
> are working correctly, the problem is in the sheepdog patch.
>
> I suggest implementing coroutine socket I/O functions for both send
> *and* receive. Then sheepdog request processing becomes simple - we'll
> have less callback and coroutine trickery because it's possible to write
> synchronous coroutine code.
>
> Right now the code is a mix of callbacks and some coroutine code (which
> has the flaw I explained above). Things will be much easier if the code
> is converted fully to coroutines.
Yes, the original patch aimed to make Sheepdog I/O asynchronous with a
small change, so the current sheepdog code is not clean. It looks
like a good time to make the sheepdog driver fully coroutine-based
code.
Thanks,
Kazutaka
More information about the sheepdog
mailing list