[sheepdog] [PATCH V2 00/11] INTRODUCE

Yunkai Zhang yunkai.me at gmail.com
Mon Aug 20 17:34:10 CEST 2012


On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
<morita.kazutaka at lab.ntt.co.jp> wrote:
> At Thu,  9 Aug 2012 16:43:38 +0800,
> Yunkai Zhang wrote:
>>
>> From: Yunkai Zhang <qiushu.zyk at taobao.com>
>>
>> V2:
>> - fix a typo
>> - when an object is updated, delete it old version
>> - reset cluster recovery state in finish_recovery()
>>
>> Yunkai Zhang (11):
>>   sheep: enable variale-length of join_message in response of join
>>     event
>>   sheep: share joining nodes with newly added sheep
>>   sheep: delay to process recovery caused by LEAVE event just like JOIN
>>     event
>>   sheep: don't cleanup working directory when sheep joined back
>>   sheep: read objects only from live nodes
>>   sheep: write objects only on live nodes
>>   sheep: mark dirty object that belongs to the leaving nodes
>>   sheep: send dirty object list to each sheep when cluster do recovery
>>   sheep: do recovery with dirty object list
>>   collie: update 'collie cluster recover info' commands
>>   collie: update doc about 'collie cluster recover disable'
>>
>>  collie/cluster.c          |  46 ++++++++---
>>  include/internal_proto.h  |  32 ++++++--
>>  include/sheep.h           |  23 ++++++
>>  man/collie.8              |   2 +-
>>  sheep/cluster.h           |  29 +------
>>  sheep/cluster/accord.c    |   2 +-
>>  sheep/cluster/corosync.c  |   9 ++-
>>  sheep/cluster/local.c     |   2 +-
>>  sheep/cluster/zookeeper.c |   2 +-
>>  sheep/farm/trunk.c        |   2 +-
>>  sheep/gateway.c           |  39 ++++++++-
>>  sheep/group.c             | 202 +++++++++++++++++++++++++++++++++++++++++-----
>>  sheep/object_list_cache.c | 182 +++++++++++++++++++++++++++++++++++++++--
>>  sheep/ops.c               |  85 ++++++++++++++++---
>>  sheep/recovery.c          | 133 +++++++++++++++++++++++++++---
>>  sheep/sheep_priv.h        |  57 ++++++++++++-
>>  16 files changed, 743 insertions(+), 104 deletions(-)
>
> I've looked into this series, and IMHO the change is too complex.
>
> With this series, when recovery is disabled and there are left nodes,
> sheep can succeed in a write operation even if the data is not fully
> replicated.  But, if we allow it, it is difficult to prevent VMs from
> reading old data.  Actually this series put a lot of effort into it.

We want to upgrade sheepdog while not impact all online VMs, so we
need to allow all VMs to do write operation when recovery is disable
(It is important for a big cluster, we can't assume users would stop
their works during this time). And we also assume that this time is
short, we should upgrade sheepdog as soon as possible(< 5 minutes).

This patch is implemented based on those assumption above. And maybe
it's difficult, but it's algorithm is clear, just three steps(from the
description from the 9th patch's commit log):

1) If a sheep joined back to the cluster, but there are some objects which have
   been deleted after this sheep left, such objects stay in its working
   directory, after recovery start, this sheep will send its object list to
   other sheeps. So after fetched all object list from cluster, each sheep
   should screen out these deleted objects list.

2) A sheep which have been left and joined back should drop the old version
   objects and recover the new ones from other sheeps.

3) The objects which have been updated should not recovered from a joined
   back sheep.

>
> I'd suggest allowing epoch increment even when recover is
> disabled.  If recovery work recovers only rw->prio_oids and delays the
> recovery of rw->oids, I think we can get the similar benefit with much
> simpler way:
>   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html

In fact, I have thought this method, but we should face nearly the same problem:

After sheep joined back, it should known which objects is dirty, and
should do the clear work(because there are old version object stay in
it's working directory). This method seems not save the steps, but
will do extra recovery works.

>
> Thanks,
>
> Kazutaka



-- 
Yunkai Zhang
Work at Taobao



More information about the sheepdog mailing list