[sheepdog] [PATCH] sheep: fix oid scheduling in recovery

Liu Yuan namei.unix at gmail.com
Tue Jun 5 13:17:37 CEST 2012


On 06/05/2012 07:14 PM, Liu Yuan wrote:

> Also block/sheepdog.c of QEMU have a fatal racy problem, which lead
> requests to be discarded by QEMU or segfault, in a high rate of requests
> bursting.


More info about his problem is:

  It is highly reproducible:
  1) start sheep with async flush
  2) start qemu with cache=writeback
  3) install a new OS from iso (I installed RHEL 6)

the problem is
qemu print error code or even segfault, but when I attach the gdb, the
problem is gone, so I think it is a race problem.

for e.g,

qemu-system-x86_64: cannot find aio_req 76e


from sheep.log(I have patched sheep):
diff --git a/sheep/sdnet.c b/sheep/sdnet.c
index 74d42f9..25242d9 100644
--- a/sheep/sdnet.c
+++ b/sheep/sdnet.c
@@ -502,6 +502,7 @@ static void init_tx_hdr(struct client_info *ci)

        rsp->epoch = sys->epoch;
        rsp->opcode = req->rq.opcode;
+       dprintf("0x%x\n", req->rq.id);
        rsp->id = req->rq.id;
 }

...
Apr 18 00:28:21 queue_request(275) 1
Apr 18 00:28:21 queue_request(275) 1
Apr 18 00:28:21 init_tx_hdr(505) 0x775
Apr 18 00:28:21 do_io_request(923) 1, 7c2b25000002ba , 1
Apr 18 00:28:21 init_tx_hdr(505) 0x76e
Apr 18 00:28:21 object_cache_rw(319) 000002ba, len 2424832, off 0
Apr 18 00:28:21 do_io_request(923) 1, 7c2b25000002b8 , 1
Apr 18 00:28:21 init_tx_hdr(505) 0x76e
Apr 18 00:28:21 object_cache_rw(319) 000002b8, len 1142784, off 3051520
...

we can see that aio-req->id is set twice for the same value (0x76e), so
the second return will cause qemu to error-print.

if IUUC, there should be no race for aioreq_seq_num(in
block/sheepdog.c)...but seems there IS...



More information about the sheepdog mailing list