[sheepdog] read/write during recovery

Tue Jul 24 09:27:39 CEST 2012

> >> What do you by 'reject'? We can't simply return EIO to Guest, that is
> >> why we have wait queues, which re-queue the requests after some
> conditions meet.
> >
> > That was the question - Why can't we reject? We already do:
> >
> > 		if (is_recovery_init()) {
> > 			req->rp.result = SD_RES_OBJ_RECOVERING;
> >
> > so we already rely on gateway retry?
> 
> Yes, gateway is supposed to retry.
> 
> >
> > All we need to do is to log write request, and apply them later after
> > object is recovered. IMHO, that would be much simpler than current code.
> >
> >> Basically, there are two mechanism: 1) use wait queues to retry when
> >> targeted object is being migrated/recovered 2) schedule objects that
> >> are being requested with higher priority than those aren't.
> >
> > My suggestion is to use a write journal for write during recovery. So
> > writes simply succeed and there is no need for queue/schedule code.
> >
> 
> Where do you store write journal? If writes simply succeed with journal
> stored only in one node, then what do you do if that node is permanently
> down? We'll lose the data for sure and this case is non-recoverable.

There is always at least one node with actual data (else you can't recover anyways)?
Write succeed at that node. This is also the node where you recover data from.

So let's do it by example. Say we have Node A, B and C, copies = 2

The 'obj' is stored at A and B, and node B fails now.

Node A and C start recovery now.

Now the gateway does a read request on 'obj'

1.) read 'obj' from C fails => retry on A
2.) read 'obj' from A => success

When the gateway writes 'obj'

1.) write 'obj' to C => write a journal, succeed
2.) write 'obj' to A => succeed

Then node C recovers 'obj' from node A. After that, node C need
to apply above write journal to 'obj' - just to make sure that
we do not lose any writes.

- Dietmar