[sheepdog] [PATCH V2 00/11] INTRODUCE

Mon Aug 20 18:00:08 CEST 2012

On 08/20/2012 11:34 PM, Yunkai Zhang wrote:
> On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
> <morita.kazutaka at lab.ntt.co.jp> wrote:
>> At Thu,  9 Aug 2012 16:43:38 +0800,
>> Yunkai Zhang wrote:
>>>
>>> From: Yunkai Zhang <qiushu.zyk at taobao.com>
>>>
>>> V2:
>>> - fix a typo
>>> - when an object is updated, delete it old version
>>> - reset cluster recovery state in finish_recovery()
>>>
>>> Yunkai Zhang (11):
>>>   sheep: enable variale-length of join_message in response of join
>>>     event
>>>   sheep: share joining nodes with newly added sheep
>>>   sheep: delay to process recovery caused by LEAVE event just like JOIN
>>>     event
>>>   sheep: don't cleanup working directory when sheep joined back
>>>   sheep: read objects only from live nodes
>>>   sheep: write objects only on live nodes
>>>   sheep: mark dirty object that belongs to the leaving nodes
>>>   sheep: send dirty object list to each sheep when cluster do recovery
>>>   sheep: do recovery with dirty object list
>>>   collie: update 'collie cluster recover info' commands
>>>   collie: update doc about 'collie cluster recover disable'
>>>
>>>  collie/cluster.c          |  46 ++++++++---
>>>  include/internal_proto.h  |  32 ++++++--
>>>  include/sheep.h           |  23 ++++++
>>>  man/collie.8              |   2 +-
>>>  sheep/cluster.h           |  29 +------
>>>  sheep/cluster/accord.c    |   2 +-
>>>  sheep/cluster/corosync.c  |   9 ++-
>>>  sheep/cluster/local.c     |   2 +-
>>>  sheep/cluster/zookeeper.c |   2 +-
>>>  sheep/farm/trunk.c        |   2 +-
>>>  sheep/gateway.c           |  39 ++++++++-
>>>  sheep/group.c             | 202 +++++++++++++++++++++++++++++++++++++++++-----
>>>  sheep/object_list_cache.c | 182 +++++++++++++++++++++++++++++++++++++++--
>>>  sheep/ops.c               |  85 ++++++++++++++++---
>>>  sheep/recovery.c          | 133 +++++++++++++++++++++++++++---
>>>  sheep/sheep_priv.h        |  57 ++++++++++++-
>>>  16 files changed, 743 insertions(+), 104 deletions(-)
>>
>> I've looked into this series, and IMHO the change is too complex.
>>
>> With this series, when recovery is disabled and there are left nodes,
>> sheep can succeed in a write operation even if the data is not fully
>> replicated.  But, if we allow it, it is difficult to prevent VMs from
>> reading old data.  Actually this series put a lot of effort into it.
> 
> We want to upgrade sheepdog while not impact all online VMs, so we
> need to allow all VMs to do write operation when recovery is disable
> (It is important for a big cluster, we can't assume users would stop
> their works during this time). And we also assume that this time is
> short, we should upgrade sheepdog as soon as possible(< 5 minutes).
> 

Upgrading cluster without stopping service is a nice feature, but I'm afraid in the near
future, Sheepdog won't meet this expectation due to fast growing development which is likely
to break the inter-sheep assumptions. Before we have this feature, we at least should do
following things before claiming to be online upgrading capable:
 1) inter-sheep protocol compatibility check logic
 2) has a relatively stable feature set and internal physical state (such as config file)

That is, it is time too early to talk online upgrading for now.

> This patch is implemented based on those assumption above. And maybe
> it's difficult, but it's algorithm is clear, just three steps(from the
> description from the 9th patch's commit log):
> 
> 1) If a sheep joined back to the cluster, but there are some objects which have
>    been deleted after this sheep left, such objects stay in its working
>    directory, after recovery start, this sheep will send its object list to
>    other sheeps. So after fetched all object list from cluster, each sheep
>    should screen out these deleted objects list.
> 
> 2) A sheep which have been left and joined back should drop the old version
>    objects and recover the new ones from other sheeps.
> 
> 3) The objects which have been updated should not recovered from a joined
>    back sheep.
> 
>>
>> I'd suggest allowing epoch increment even when recover is
>> disabled.  If recovery work recovers only rw->prio_oids and delays the
>> recovery of rw->oids, I think we can get the similar benefit with much
>> simpler way:
>>   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html
> 
> In fact, I have thought this method, but we should face nearly the same problem:
> 
> After sheep joined back, it should known which objects is dirty, and
> should do the clear work(because there are old version object stay in
> it's working directory). This method seems not save the steps, but
> will do extra recovery works.
> 

IMHO, I think the suggested method won't cause different version objects, because we actually
increment epoch and we do the same as is now for the objects in the rw->prio_oids,
which is being requested. So for this kind of objects, we can still use current code to handle
it. For those objects not being requested at all (which might account for majority of the objects
in a short time window), we can do the trick: delay recovering them as much as possible, so
that subsequent join node event will cancel their migration. Doing this trick, I think we can
a) achieve the goal nicely: to minimize the overhead of unnecessary object transfer between stop/join
events and also b) get another nice attribute: we don't sacrifice the object redundancy for the 
objects that are being requested (lower chance to lose updates) 

-- 
thanks,
Yuan