[Sheepdog] [PATCH 1/2] sheep: optimize sheep recovery logic

Liu Yuan namei.unix at gmail.com
Fri Nov 25 06:59:22 CET 2011


On 11/25/2011 01:55 PM, Liu Yuan wrote:

> On 11/25/2011 01:04 AM, MORITA Kazutaka wrote:
> 
>> At Thu, 24 Nov 2011 20:03:17 +0800,
>> Liu Yuan wrote:
>>>
>>> From: Liu Yuan <tailai.ly at taobao.com>
>>>
>>> We don't need to iterate from epoch 1 to hdr->tgt_epoch, since when the
>>> node is recovered from view(membership) change, the current epoch objects
>>> have all the object information that need for subsequent view change.
>>>
>>> prev_rw_epoch is needed, because we need to handle below situation:
>>>
>>> init: node A, B, C.
>>>
>>> then D, E joined the cluster.
>>>
>>>               t
>>> epoch 1    2     3
>>>       A    A     A
>>>       B    B     B
>>>       C    C     C
>>>            D     D
>>> 	         E
>>>
>>> at the time t:
>>> Since now we have nodes recover in parallel, we might have A recovered fully,
>>> while B C doesn't.
>>>
>>>    pre_rw_eopch  recovered_epoch
>>> A      1               3
>>> B      1               1
>>> C      1               1
>>>
>>> Then B, C need to iterate from pre_rw_epoch to hdr->tgt_epoch, instead of from
>>> recovered_epoch to hdr->tgt_epoch, to get the needed object list information.
>>
>> This is not correct.  Note that new nodes can be added before
>> finishing recovery on all nodes.
>>
>> For example:
>>
>>  1. There is only one node A at epoch 1.  Node A has one object 'obj'.
>>
>>        pre_rw_oopch  recovered_epoch  epoch
>>     A      -               -            1
>>
>>  2. Node B joins, and the store of 'obj' changes to node B.  Node A
>>     finishes recovery, but node B does not yet.
>>
>>        pre_rw_epoch  recovered_epoch  epoch
>>     A      -               2            2
>>     B      -               -            2
>>
>>  3. Node C joins, and node A finishes recovery at epoch 3 soon, but
>>     node B does not finish recovery at epoch 2 yet.
> 
> 
> I doubt if it happens for real. In this case, A recovers successfully
> twice while B doesn't at all for a single recovery.
> 
> If this happens for real, I think we do need to have some recovery
> information syncing between in nodes.


To be more precise, when node C joins as you describe,  B have already
read objects of epoch 1 from A, I think.

Thanks,
Yuan



More information about the sheepdog mailing list