[sheepdog] [PATCH 0/9] add erasure code support
MORITA Kazutaka
morita.kazutaka at gmail.com
Tue Sep 24 09:58:43 CEST 2013
At Thu, 19 Sep 2013 18:42:44 +0800,
Liu Yuan wrote:
>
> Introductin:
>
> This is the first round to add erasure code support for sheepdog. This patch set
> add basic read/write/remove code for erasure coded vdis. The full blowned
> version is planned to support all the current features, such as snapshot, clone,
> incremental backup and cluster-wide backup.
>
> As always, we support random read/write to erasure coded vdis.
>
> Instead of storing copy in the replica, erasure code tries to spread data on all
> the replica to achieve the same fault tolerance while reducing the redundancy to
> minimal (less than 0.5 redundancy level).
>
> Sheepdog will transparently support erasure coding for read/write/remove
> opertaions on the fly while clients are storing/retrieving data in the sheepdog.
> No changes to the client APIs or protocols.
>
> For a simple test on my box, aligned write get 1.5x faster than replication
> at most, while read performance drop to 60%, compared with copies=3 (4:2 scheme)
>
> How It Works:
>
> /*
> * Stripe: data strips + parity strips, spread on all replica
> * DS: data strip
> * PS: parity strip
> * R: Replica
> *
> * +--------------------stripe ----------------------+
> * v v
> * +----+----------------------------------------------+
> * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
> * +----+----------------------------------------------+
> * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
> * +----+----+----+----+----+ ... +----+----+-----+----+
> * R1 R2 R3 R4 R5 ... Rn Rn+1 Rn+2 Rn+3
> */
>
> We use replica to hold data and parity strips. Suppose we have a
> 4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
> so basically we'll generate 2k parites for each 4k write, we call this 6K as
> stripe as a whole. For write, we'll horizontally spread data, not vertically as
> replciation. So for read, we have to assemble the strip from all the data
> replica, this probably the reason why we get slowed down for read.
>
> Usage:
>
> just add one more option for 'dog vdi create'
>
> $dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion
>
> For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'. But
> I have '-e number' in plan, that users could specify how many parity replica he
> wants with different erasure scheme for different vdis. E.g, we can have
>
> -e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
> -e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
> -e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
>
> TODOs:
> 1. add recovery code
> 2. support snapshot/clone/backup
> 3. support user-defined redundancy level
> 4. add tests
>
> Liu Yuan (9):
> add forward error correction for erasure code
> gateway: remove init_target_nodes
> gateway: remove shortcut for local node in gateway_forward_request
> gateway: rename write_info as forward_info
> gateway: support forward read request
> gateway: allow sending different requests to replica
> sheep: use copy_policy to control erasure vdi
> erasure: add basic read/write code proper
> erasure: add unligned write support
>
> dog/dog.h | 2 +-
> dog/farm/farm.c | 11 +-
> dog/vdi.c | 24 +-
> include/Makefile.am | 2 +-
> include/fec.h | 170 ++++++++++++++
> include/sheepdog_proto.h | 7 +-
> lib/Makefile.am | 2 +-
> lib/fec.c | 578 ++++++++++++++++++++++++++++++++++++++++++++++
> sheep/gateway.c | 396 +++++++++++++++++++++++--------
> sheep/group.c | 3 +-
> sheep/ops.c | 9 +-
> sheep/plain_store.c | 13 +-
> sheep/sheep_priv.h | 9 +-
> sheep/vdi.c | 39 +++-
> 14 files changed, 1143 insertions(+), 122 deletions(-)
> create mode 100644 include/fec.h
> create mode 100644 lib/fec.c
Great job. :)
I'll try to give a review from now, but looks like tests/functional/30
cannot pass with this series.
Thanks,
Kazutaka
More information about the sheepdog
mailing list