[sheepdog] [PATCH 0/9] add erasure code support

MORITA Kazutaka morita.kazutaka at gmail.com
Tue Sep 24 09:58:43 CEST 2013


At Thu, 19 Sep 2013 18:42:44 +0800,
Liu Yuan wrote:
> 
> Introductin:
> 
> This is the first round to add erasure code support for sheepdog. This patch set
> add basic read/write/remove code for erasure coded vdis. The full blowned
> version is planned to support all the current features, such as snapshot, clone,
> incremental backup and cluster-wide backup.
> 
> As always, we support random read/write to erasure coded vdis.
> 
> Instead of storing copy in the replica, erasure code tries to spread data on all
> the replica to achieve the same fault tolerance while reducing the redundancy to
> minimal (less than 0.5 redundancy level).
> 
> Sheepdog will transparently support erasure coding for read/write/remove
> opertaions on the fly while clients are storing/retrieving data in the sheepdog.
> No changes to the client APIs or protocols.
> 
> For a simple test on my box, aligned write get 1.5x faster than replication
> at most, while read performance drop to 60%, compared with copies=3 (4:2 scheme)
> 
> How It Works:
> 
> /*
>  * Stripe: data strips + parity strips, spread on all replica
>  * DS: data strip
>  * PS: parity strip
>  * R: Replica
>  *
>  *  +--------------------stripe ----------------------+
>  *  v                                                 v
>  * +----+----------------------------------------------+
>  * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
>  * +----+----------------------------------------------+
>  * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
>  * +----+----+----+----+----+ ... +----+----+-----+----+
>  *  R1    R2   R3   R4   R5   ...   Rn  Rn+1  Rn+2  Rn+3
>  */
> 
> We use replica to hold data and parity strips. Suppose we have a
> 4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
> so basically we'll generate 2k parites for each 4k write, we call this 6K as
> stripe as a whole. For write, we'll horizontally spread data, not vertically as
> replciation. So for read, we have to assemble the strip from all the data
> replica, this probably the reason why we get slowed down for read.
> 
> Usage:
> 
> just add one more option for 'dog vdi create'
> 
> $dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion
> 
> For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'. But
> I have '-e number' in plan, that users could specify how many parity replica he
> wants with different erasure scheme for different vdis. E.g, we can have
> 
> -e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
> -e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
> -e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
> 
> TODOs:
> 1. add recovery code
> 2. support snapshot/clone/backup
> 3. support user-defined redundancy level
> 4. add tests
> 
> Liu Yuan (9):
>   add forward error correction for erasure code
>   gateway: remove init_target_nodes
>   gateway: remove shortcut for local node in gateway_forward_request
>   gateway: rename write_info as forward_info
>   gateway: support forward read request
>   gateway: allow sending different requests to replica
>   sheep: use copy_policy to control erasure vdi
>   erasure: add basic read/write code proper
>   erasure: add unligned write support
> 
>  dog/dog.h                |    2 +-
>  dog/farm/farm.c          |   11 +-
>  dog/vdi.c                |   24 +-
>  include/Makefile.am      |    2 +-
>  include/fec.h            |  170 ++++++++++++++
>  include/sheepdog_proto.h |    7 +-
>  lib/Makefile.am          |    2 +-
>  lib/fec.c                |  578 ++++++++++++++++++++++++++++++++++++++++++++++
>  sheep/gateway.c          |  396 +++++++++++++++++++++++--------
>  sheep/group.c            |    3 +-
>  sheep/ops.c              |    9 +-
>  sheep/plain_store.c      |   13 +-
>  sheep/sheep_priv.h       |    9 +-
>  sheep/vdi.c              |   39 +++-
>  14 files changed, 1143 insertions(+), 122 deletions(-)
>  create mode 100644 include/fec.h
>  create mode 100644 lib/fec.c

Great job. :)

I'll try to give a review from now, but looks like tests/functional/30
cannot pass with this series.

Thanks,

Kazutaka



More information about the sheepdog mailing list