[sheepdog] [PATCH v2 00/11] add basic erasure code support
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Wed Oct 9 08:34:46 CEST 2013
At Thu, 26 Sep 2013 15:25:37 +0800,
Liu Yuan wrote:
>
> v2:
> - update fec.c based on the comments of Hitosh and Kazutaka
> - change strip data stripe size to 1K so as to work with VM
> - rename get_object_size as get_store_objsize
> - update some commits
> - remove unnecessary padding
> - change copy type from uint32_t to uint8_t
> - add one more patch to pass functinal/030 tests
>
> Introductin:
>
> This is the first round to add erasure code support for sheepdog. This patch set
> add basic read/write/remove code for erasure coded vdis. The full blowned
> version is planned to support all the current features, such as snapshot, clone,
> incremental backup and cluster-wide backup.
>
> As always, we support random read/write to erasure coded vdis.
>
> Instead of storing copy in the replica, erasure code tries to spread data on all
> the replica to achieve the same fault tolerance while reducing the redundancy to
> minimal (less than 0.5 redundancy level).
>
> Sheepdog will transparently support erasure coding for read/write/remove
> opertaions on the fly while clients are storing/retrieving data in the sheepdog.
> No changes to the client APIs or protocols.
>
> For a simple test on my box, aligned-4k write get 1.5x faster than replication
> at most, while read get 1.15x faster, compared with copies=3 (4:2 scheme)
>
> For 6 nodes cluster with 1000Mb/s NIC, I got the following result:
>
> replication(3 copies): write 36.5 MB/s, read 71.8 MB/s
> erasure code(4 : 2) : write 46.6 MB/s, read 82.9 MB/
>
> How It Works:
>
> /*
> * Stripe: data strips + parity strips, spread on all replica
> * DS: data strip
> * PS: parity strip
> * R: Replica
> *
> * +--------------------stripe ----------------------+
> * v v
> * +----+----------------------------------------------+
> * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
> * +----+----------------------------------------------+
> * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
> * +----+----+----+----+----+ ... +----+----+-----+----+
> * R1 R2 R3 R4 R5 ... Rn Rn+1 Rn+2 Rn+3
> */
>
> We use replica to hold data and parity strips. Suppose we have a
> 4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
> so basically we'll generate 2k parites for each 4k write, we call this 6K as
> stripe as a whole. For write, we'll horizontally spread data, not vertically as
> replciation. So for read, we have to assemble the strip from all the data
> replica.
>
> The downsize for erasure coding is:
> 1. for recovery, we have to recover 0.5x more data
> 2. if any replica fails, we have to wait for its recover for read.
> 3. it needs at least 6(4+2) nodes to work.
>
> Usage:
>
> just add one more option for 'dog vdi create'
>
> $dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion
>
> For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'.
> But I have '-e number' in plan, that users could specify how many parity replica
> he wants with different erasure scheme for different vdis. E.g, we can have
>
> -e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
> -e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
> -e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
>
> TODOs:
>
> 1. add recovery code
> 2. support snapshot/clone/backup
> 3. support user-defined redundancy level
> 4. add tests
Applied, thanks. I'd like to see the above implementation before the
0.8 release. :)
Kazutaka
More information about the sheepdog
mailing list