[sheepdog] [PATCH] sheep: handle recovery request in check_request_in_recovery()
Christoph Hellwig
hch at infradead.org
Sat Jun 2 17:24:10 CEST 2012
On Sat, Jun 02, 2012 at 11:09:53PM +0800, Liu Yuan wrote:
> Then why old master passes this test. It seems that old master passes
> just because it wastes time on this case by trying to read non-exist
> object on one node, which gives other nodes enough time to recover the
> objects meanwhile. Kind of timing problem.
>From looking at a bit more instrumentation the problem is the following:
- the cluster only has nr_copies = 2
- one zone already went down, leaving only one copy of the object
- the sheep that has the copy stays is in RW_INIT state for a while
so we get a SD_RES_OBJ_RECOVERING completion
> Well, I don't think we need special handling for this case, if other
> sheep can recover the targeted object, why can't this unfortunate one?
> Our recovery algorithm should assure to find the object if it exists
> either in working directory or snap cache of Farm.
The SD_RES_OBJ_RECOVERING case absolutely needs a special case - we
know the target sheep has the block, and it asks us to come back later.
giving up trying to get a valid copy of the object is not a good idea.
What is a bit troublesome is trying to figure out how to deal with the
fact that we might have to recover multiple oids out of order. I think
the best would be to add a bitmap of recovered oids to struct
recovery_work. As a benefit that also should allow to perform recovery
for multiple oids in parallel and thus greatly speed it up.
More information about the sheepdog
mailing list