[sheepdog] [PATCH V2 00/11] INTRODUCE

Mon Aug 20 15:00:42 CEST 2012

At Thu,  9 Aug 2012 16:43:38 +0800,
Yunkai Zhang wrote:
> 
> From: Yunkai Zhang <qiushu.zyk at taobao.com>
> 
> V2:
> - fix a typo
> - when an object is updated, delete it old version
> - reset cluster recovery state in finish_recovery()
> 
> Yunkai Zhang (11):
>   sheep: enable variale-length of join_message in response of join
>     event
>   sheep: share joining nodes with newly added sheep
>   sheep: delay to process recovery caused by LEAVE event just like JOIN
>     event
>   sheep: don't cleanup working directory when sheep joined back
>   sheep: read objects only from live nodes
>   sheep: write objects only on live nodes
>   sheep: mark dirty object that belongs to the leaving nodes
>   sheep: send dirty object list to each sheep when cluster do recovery
>   sheep: do recovery with dirty object list
>   collie: update 'collie cluster recover info' commands
>   collie: update doc about 'collie cluster recover disable'
> 
>  collie/cluster.c          |  46 ++++++++---
>  include/internal_proto.h  |  32 ++++++--
>  include/sheep.h           |  23 ++++++
>  man/collie.8              |   2 +-
>  sheep/cluster.h           |  29 +------
>  sheep/cluster/accord.c    |   2 +-
>  sheep/cluster/corosync.c  |   9 ++-
>  sheep/cluster/local.c     |   2 +-
>  sheep/cluster/zookeeper.c |   2 +-
>  sheep/farm/trunk.c        |   2 +-
>  sheep/gateway.c           |  39 ++++++++-
>  sheep/group.c             | 202 +++++++++++++++++++++++++++++++++++++++++-----
>  sheep/object_list_cache.c | 182 +++++++++++++++++++++++++++++++++++++++--
>  sheep/ops.c               |  85 ++++++++++++++++---
>  sheep/recovery.c          | 133 +++++++++++++++++++++++++++---
>  sheep/sheep_priv.h        |  57 ++++++++++++-
>  16 files changed, 743 insertions(+), 104 deletions(-)

I've looked into this series, and IMHO the change is too complex.

With this series, when recovery is disabled and there are left nodes,
sheep can succeed in a write operation even if the data is not fully
replicated.  But, if we allow it, it is difficult to prevent VMs from
reading old data.  Actually this series put a lot of effort into it.

I'd suggest allowing epoch increment even when recover is
disabled.  If recovery work recovers only rw->prio_oids and delays the
recovery of rw->oids, I think we can get the similar benefit with much
simpler way:
  http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html

Thanks,

Kazutaka