[Sheepdog] [PATCH 1/2] sheep: optimize sheep recovery logic
namei.unix at gmail.com
Fri Nov 25 06:55:56 CET 2011
On 11/25/2011 01:04 AM, MORITA Kazutaka wrote:
> At Thu, 24 Nov 2011 20:03:17 +0800,
> Liu Yuan wrote:
>> From: Liu Yuan <tailai.ly at taobao.com>
>> We don't need to iterate from epoch 1 to hdr->tgt_epoch, since when the
>> node is recovered from view(membership) change, the current epoch objects
>> have all the object information that need for subsequent view change.
>> prev_rw_epoch is needed, because we need to handle below situation:
>> init: node A, B, C.
>> then D, E joined the cluster.
>> epoch 1 2 3
>> A A A
>> B B B
>> C C C
>> D D
>> at the time t:
>> Since now we have nodes recover in parallel, we might have A recovered fully,
>> while B C doesn't.
>> pre_rw_eopch recovered_epoch
>> A 1 3
>> B 1 1
>> C 1 1
>> Then B, C need to iterate from pre_rw_epoch to hdr->tgt_epoch, instead of from
>> recovered_epoch to hdr->tgt_epoch, to get the needed object list information.
> This is not correct. Note that new nodes can be added before
> finishing recovery on all nodes.
> For example:
> 1. There is only one node A at epoch 1. Node A has one object 'obj'.
> pre_rw_oopch recovered_epoch epoch
> A - - 1
> 2. Node B joins, and the store of 'obj' changes to node B. Node A
> finishes recovery, but node B does not yet.
> pre_rw_epoch recovered_epoch epoch
> A - 2 2
> B - - 2
> 3. Node C joins, and node A finishes recovery at epoch 3 soon, but
> node B does not finish recovery at epoch 2 yet.
I doubt if it happens for real. In this case, A recovers successfully
twice while B doesn't at all for a single recovery.
If this happens for real, I think we do need to have some recovery
information syncing between in nodes.
More information about the sheepdog