[Sheepdog] [PATCH 1/2] sheep: optimize sheep recovery logic

Fri Nov 25 07:59:12 CET 2011

At Fri, 25 Nov 2011 13:59:22 +0800,
Liu Yuan wrote:
> 
> On 11/25/2011 01:55 PM, Liu Yuan wrote:
> 
> > On 11/25/2011 01:04 AM, MORITA Kazutaka wrote:
> > 
> >> At Thu, 24 Nov 2011 20:03:17 +0800,
> >> Liu Yuan wrote:
> >>>
> >>> From: Liu Yuan <tailai.ly at taobao.com>
> >>>
> >>> We don't need to iterate from epoch 1 to hdr->tgt_epoch, since when the
> >>> node is recovered from view(membership) change, the current epoch objects
> >>> have all the object information that need for subsequent view change.
> >>>
> >>> prev_rw_epoch is needed, because we need to handle below situation:
> >>>
> >>> init: node A, B, C.
> >>>
> >>> then D, E joined the cluster.
> >>>
> >>>               t
> >>> epoch 1    2     3
> >>>       A    A     A
> >>>       B    B     B
> >>>       C    C     C
> >>>            D     D
> >>> 	         E
> >>>
> >>> at the time t:
> >>> Since now we have nodes recover in parallel, we might have A recovered fully,
> >>> while B C doesn't.
> >>>
> >>>    pre_rw_eopch  recovered_epoch
> >>> A      1               3
> >>> B      1               1
> >>> C      1               1
> >>>
> >>> Then B, C need to iterate from pre_rw_epoch to hdr->tgt_epoch, instead of from
> >>> recovered_epoch to hdr->tgt_epoch, to get the needed object list information.
> >>
> >> This is not correct.  Note that new nodes can be added before
> >> finishing recovery on all nodes.
> >>
> >> For example:
> >>
> >>  1. There is only one node A at epoch 1.  Node A has one object 'obj'.
> >>
> >>        pre_rw_oopch  recovered_epoch  epoch
> >>     A      -               -            1
> >>
> >>  2. Node B joins, and the store of 'obj' changes to node B.  Node A
> >>     finishes recovery, but node B does not yet.
> >>
> >>        pre_rw_epoch  recovered_epoch  epoch
> >>     A      -               2            2
> >>     B      -               -            2
> >>
> >>  3. Node C joins, and node A finishes recovery at epoch 3 soon, but
> >>     node B does not finish recovery at epoch 2 yet.
> > 
> > 
> > I doubt if it happens for real. In this case, A recovers successfully
> > twice while B doesn't at all for a single recovery.

This is a timing problem and can easily happen.  NodeA will finish
object recovery quickly in the above case because it has no object at
epoch 2 and 3.  Imagine there are 1000 objects nodeB needs to read
from nodeA at epoch 1, but there is no object nodeA needs to have at
epoch 2 and 3.

The above example looks unnatural, but similar situation can happen as
long as we can add nodes when some node is during recovery but the
others are not.

> > 
> > If this happens for real, I think we do need to have some recovery
> > information syncing between in nodes.

It looks a good start point if we can implement it simply.

Thanks,

Kazutaka

> 
> 
> To be more precise, when node C joins as you describe,  B have already
> read objects of epoch 1 from A, I think.