[Sheepdog] [PATCH 1/2] sheep: optimize sheep recovery logic

Fri Nov 25 06:55:56 CET 2011

On 11/25/2011 01:04 AM, MORITA Kazutaka wrote:

> At Thu, 24 Nov 2011 20:03:17 +0800,
> Liu Yuan wrote:
>>
>> From: Liu Yuan <tailai.ly at taobao.com>
>>
>> We don't need to iterate from epoch 1 to hdr->tgt_epoch, since when the
>> node is recovered from view(membership) change, the current epoch objects
>> have all the object information that need for subsequent view change.
>>
>> prev_rw_epoch is needed, because we need to handle below situation:
>>
>> init: node A, B, C.
>>
>> then D, E joined the cluster.
>>
>>               t
>> epoch 1    2     3
>>       A    A     A
>>       B    B     B
>>       C    C     C
>>            D     D
>> 	         E
>>
>> at the time t:
>> Since now we have nodes recover in parallel, we might have A recovered fully,
>> while B C doesn't.
>>
>>    pre_rw_eopch  recovered_epoch
>> A      1               3
>> B      1               1
>> C      1               1
>>
>> Then B, C need to iterate from pre_rw_epoch to hdr->tgt_epoch, instead of from
>> recovered_epoch to hdr->tgt_epoch, to get the needed object list information.
> 
> This is not correct.  Note that new nodes can be added before
> finishing recovery on all nodes.
> 
> For example:
> 
>  1. There is only one node A at epoch 1.  Node A has one object 'obj'.
> 
>        pre_rw_oopch  recovered_epoch  epoch
>     A      -               -            1
> 
>  2. Node B joins, and the store of 'obj' changes to node B.  Node A
>     finishes recovery, but node B does not yet.
> 
>        pre_rw_epoch  recovered_epoch  epoch
>     A      -               2            2
>     B      -               -            2
> 
>  3. Node C joins, and node A finishes recovery at epoch 3 soon, but
>     node B does not finish recovery at epoch 2 yet.

I doubt if it happens for real. In this case, A recovers successfully
twice while B doesn't at all for a single recovery.

If this happens for real, I think we do need to have some recovery
information syncing between in nodes.

Thanks,
Yuan