[sheepdog] [PATCH v2 00/11] add basic erasure code support

Wed Oct 9 08:34:46 CEST 2013

At Thu, 26 Sep 2013 15:25:37 +0800,
Liu Yuan wrote:
> 
> v2:
>  - update fec.c based on the comments of Hitosh and Kazutaka
>  - change strip data stripe size to 1K so as to work with VM
>  - rename get_object_size as get_store_objsize
>  - update some commits
>  - remove unnecessary padding
>  - change copy type from uint32_t to uint8_t
>  - add one more patch to pass functinal/030 tests
> 
> Introductin:  
> 	
> This is the first round to add erasure code support for sheepdog. This patch set
> add basic read/write/remove code for erasure coded vdis. The full blowned 
> version is planned to support all the current features, such as snapshot, clone,
> incremental backup and cluster-wide backup.
> 
> As always, we support random read/write to erasure coded vdis.
> 
> Instead of storing copy in the replica, erasure code tries to spread data on all
> the replica to achieve the same fault tolerance while reducing the redundancy to
> minimal (less than 0.5 redundancy level).
> 
> Sheepdog will transparently support erasure coding for read/write/remove 
> opertaions on the fly while clients are storing/retrieving data in the sheepdog.
> No changes to the client APIs or protocols.
> 
> For a simple test on my box, aligned-4k write get 1.5x faster than replication
> at most, while read get 1.15x faster, compared with copies=3 (4:2 scheme)
> 
> For 6 nodes cluster with 1000Mb/s NIC, I got the following result:
> 
> replication(3 copies): write 36.5 MB/s, read 71.8 MB/s
> erasure code(4 : 2)  : write 46.6 MB/s, read 82.9 MB/
> 
> How It Works:
> 
> /*
>  * Stripe: data strips + parity strips, spread on all replica
>  * DS: data strip
>  * PS: parity strip
>  * R: Replica
>  *
>  *  +--------------------stripe ----------------------+
>  *  v                                                 v
>  * +----+----------------------------------------------+
>  * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
>  * +----+----------------------------------------------+
>  * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
>  * +----+----+----+----+----+ ... +----+----+-----+----+
>  *  R1    R2   R3   R4   R5   ...   Rn  Rn+1  Rn+2  Rn+3
>  */
> 
> We use replica to hold data and parity strips. Suppose we have a
> 4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
> so basically we'll generate 2k parites for each 4k write, we call this 6K as
> stripe as a whole. For write, we'll horizontally spread data, not vertically as
> replciation. So for read, we have to assemble the strip from all the data
> replica.
> 
> The downsize for erasure coding is:
> 1. for recovery, we have to recover 0.5x more data
> 2. if any replica fails, we have to wait for its recover for read.
> 3. it needs at least 6(4+2) nodes to work.
> 
> Usage:  
> 
> just add one more option for 'dog vdi create'
> 
> $dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion
> 
> For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'.
> But I have '-e number' in plan, that users could specify how many parity replica
> he wants with different erasure scheme for different vdis. E.g, we can have
> 
> -e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
> -e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
> -e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
> 	
> TODOs: 
> 
> 1. add recovery code
> 2. support snapshot/clone/backup
> 3. support user-defined redundancy level
> 4. add tests

Applied, thanks.  I'd like to see the above implementation before the
0.8 release. :)

Kazutaka