[Sheepdog] [PATCH 1/2] sheep: optimize sheep recovery logic

Fri Nov 25 08:19:06 CET 2011

On 11/25/2011 02:59 PM, MORITA Kazutaka wrote:

> At Fri, 25 Nov 2011 13:59:22 +0800,
> Liu Yuan wrote:
>>
>> On 11/25/2011 01:55 PM, Liu Yuan wrote:
>>
>>> On 11/25/2011 01:04 AM, MORITA Kazutaka wrote:
>>>
>>>> At Thu, 24 Nov 2011 20:03:17 +0800,
>>>> Liu Yuan wrote:
>>>>>
>>>>> From: Liu Yuan <tailai.ly at taobao.com>
>>>>>
>>>>> We don't need to iterate from epoch 1 to hdr->tgt_epoch, since when the
>>>>> node is recovered from view(membership) change, the current epoch objects
>>>>> have all the object information that need for subsequent view change.
>>>>>
>>>>> prev_rw_epoch is needed, because we need to handle below situation:
>>>>>
>>>>> init: node A, B, C.
>>>>>
>>>>> then D, E joined the cluster.
>>>>>
>>>>>               t
>>>>> epoch 1    2     3
>>>>>       A    A     A
>>>>>       B    B     B
>>>>>       C    C     C
>>>>>            D     D
>>>>> 	         E
>>>>>
>>>>> at the time t:
>>>>> Since now we have nodes recover in parallel, we might have A recovered fully,
>>>>> while B C doesn't.
>>>>>
>>>>>    pre_rw_eopch  recovered_epoch
>>>>> A      1               3
>>>>> B      1               1
>>>>> C      1               1
>>>>>
>>>>> Then B, C need to iterate from pre_rw_epoch to hdr->tgt_epoch, instead of from
>>>>> recovered_epoch to hdr->tgt_epoch, to get the needed object list information.
>>>>
>>>> This is not correct.  Note that new nodes can be added before
>>>> finishing recovery on all nodes.
>>>>
>>>> For example:
>>>>
>>>>  1. There is only one node A at epoch 1.  Node A has one object 'obj'.
>>>>
>>>>        pre_rw_oopch  recovered_epoch  epoch
>>>>     A      -               -            1
>>>>
>>>>  2. Node B joins, and the store of 'obj' changes to node B.  Node A
>>>>     finishes recovery, but node B does not yet.
>>>>
>>>>        pre_rw_epoch  recovered_epoch  epoch
>>>>     A      -               2            2
>>>>     B      -               -            2
>>>>
>>>>  3. Node C joins, and node A finishes recovery at epoch 3 soon, but
>>>>     node B does not finish recovery at epoch 2 yet.
>>>
>>>
>>> I doubt if it happens for real. In this case, A recovers successfully
>>> twice while B doesn't at all for a single recovery.
> 
> This is a timing problem and can easily happen.  NodeA will finish
> object recovery quickly in the above case because it has no object at
> epoch 2 and 3.  Imagine there are 1000 objects nodeB needs to read
> from nodeA at epoch 1, but there is no object nodeA needs to have at
> epoch 2 and 3.
> 

Current impl can handle this case. There is a lock protecting it to be
removed when objects are read by get_obj_list().

Thanks,
Yuan