[sheepdog] read/write during recovery

Wed Jul 25 13:52:54 CEST 2012

At Wed, 25 Jul 2012 08:01:56 +0000,
Dietmar Maurer wrote:
> 
> > > >> My naïve patch looks like this (can be optimized further):
> > >
> > > >IIUC, your patch does not handle write requests because write
> > > >journaling is not implemented yet, yes?  I think it is not easy to
> > > >implement journaling across nodes.  Do you have any ideas to
> > > >implement it simply?
> > >
> > > The idea is to simply discard those write request. We can do that,
> > > because there is at least one node which has data locally, and that
> > > node applies all writes (we sync data from that node later).
> > 
> > How do you handle the following case?
> > 
> >  1. There are two node A and B (redundancy level is 2), and each node
> >     has one object.
> >  2. Node C joins Sheepdog, and new placement of the object becomes
> >     node B and C.
> >  3. A VM writes data to the object, and node B completes the request
> >     but node C rejects it since recovery is not started.
> >  4. Node B crashes before node C gets the updated data from node B,
> >     and then the written data will be lost even though only one node
> >     fails.  In addtion, the VM can reads the old object after the
> >     failure, which breaks the block device semantics.
> 
> Sure, If all nodes with actual data crash you have a problem. So sheepdog
> tries to store data ASAP to make that unlikely? I guess I got it now ;-)
> 
> But using a journal for writes (during recovery) is still a good idea, because
> 
> - no delays on write when in recovery mode
> - use less memory
> 
> what do you think?

I like the idea but I think it makes the recovery more complex.

For example:

 1. There are two node, A and B.
 2. Node C joins Sheepdog, and journal data is written on node C until
    it finishes recovery.
 3. If node D joins Sheepdog before Node C finishes recovery, the node
    reads actual data from node A and B, and journal data from node C.
    At the same time, node C also needs to write journal data in local
    to handle write requests.
 4. If node E joins Sheepdog before node C and D finish recovery, node
    E needs to read journal data from node C and D.  Node E needs to
    know which journal is newer to apply journal in the correct order.

The situation becomes more complex if we have more nodes.  Do you have
any ideas to handle node failure with journal data simply?

Thanks,

Kazutaka