[sheepdog] [PATCH] sheep: fix handling of too old epoch in check_request
levin li
levin108 at gmail.com
Thu May 31 03:59:55 CEST 2012
On 05/29/2012 05:33 PM, Christoph Hellwig wrote:
> If we hit a too old epoch in check_request (which can only happen
> when resume_pending_requests) we currently try to use io_op_done
> to decide what to do. For objects that are store locally on a node
> that also acts as a gateway this is currently fatal:
>
> (a) io_op_done tries to remove the request from a list it has never
> been added to
> (b) io_op_done completes the request despite adding it to the
> wait_rw_queue list
>
> In addition it seems that
>
> (c) io_op_done was checking for a too large epoch, not a too small one
>
> If that last one was intentional it should at least be documented in there,
> but I can't come up with a good explanation for it.
>
> Fix these issues by opencoding the action we want in check_request.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
>
> diff --git a/sheep/sdnet.c b/sheep/sdnet.c
> index 7fcff4b..7f501aa 100644
> --- a/sheep/sdnet.c
> +++ b/sheep/sdnet.c
> @@ -188,10 +188,18 @@ static int check_request(struct request *req)
> if (before(req->rq.epoch, sys->epoch)) {
> eprintf("old node version %u, %u, %x\n",
> sys->epoch, req->rq.epoch, req->rq.opcode);
> - /* ask gateway to retry. */
> +
> + /* ask gateway to retry unless this is a local object */
> + if (req->rq.flags & SD_FLAG_CMD_IO_LOCAL) {
> + req->rp.result = SD_RES_OLD_NODE_VER;
> + req->rp.epoch = sys->epoch;
> + req_done(req);
> + return -1;
> + }
> +
> req->rp.result = SD_RES_OLD_NODE_VER;
> - req->rp.epoch = sys->epoch;
> - req->work.done(&req->work);
> + list_add_tail(&req->request_list,
> + &sys->wait_rw_queue);
> return -1;
> } else if (after(req->rq.epoch, sys->epoch)) {
> eprintf("new node version %u, %u, %x\n",
Indeed, it's problem here, but I don't think it should be handled like this,
my original plan is that when a request with old epoch comes, we directly
send back a response with SD_RES_OLD_NODE_VER in check_request() to make the
sender to retry, argee this?
Note that in check_request(), for a gateway request without SD_FLAG_CMD_IO_LOCAL,
the epoch of request is always equal to system epoch, for we have set it to
system epoch in queue_request() before entering check_request().
/*
* we set epoch for non direct requests here. Note that we need to
* sample sys->epoch before passing requests to worker threads as
* it can change anytime we return to processing membership change
* events.
*/
if (!(hdr->flags & SD_FLAG_CMD_IO_LOCAL))
hdr->epoch = sys->epoch;
So, no need to check the flag IO_LOCAL here if epoch is not equal to system epoch.
My mistake is that I called req->work.done(&req->work) to post the request to
io_op_done() instead of just send back a response, so I think the problem you mentioned
can just be fixed by this:
- req->work.done(&req->work);
+ req_done(req);
thanks,
levin
More information about the sheepdog
mailing list