[sheepdog] [PATCH v2 00/11] add basic erasure code support

Liu Yuan namei.unix at gmail.com
Thu Sep 26 09:25:37 CEST 2013


v2:
 - update fec.c based on the comments of Hitosh and Kazutaka
 - change strip data stripe size to 1K so as to work with VM
 - rename get_object_size as get_store_objsize
 - update some commits
 - remove unnecessary padding
 - change copy type from uint32_t to uint8_t
 - add one more patch to pass functinal/030 tests

Introductin:  
	
This is the first round to add erasure code support for sheepdog. This patch set
add basic read/write/remove code for erasure coded vdis. The full blowned 
version is planned to support all the current features, such as snapshot, clone,
incremental backup and cluster-wide backup.

As always, we support random read/write to erasure coded vdis.

Instead of storing copy in the replica, erasure code tries to spread data on all
the replica to achieve the same fault tolerance while reducing the redundancy to
minimal (less than 0.5 redundancy level).

Sheepdog will transparently support erasure coding for read/write/remove 
opertaions on the fly while clients are storing/retrieving data in the sheepdog.
No changes to the client APIs or protocols.

For a simple test on my box, aligned-4k write get 1.5x faster than replication
at most, while read get 1.15x faster, compared with copies=3 (4:2 scheme)

For 6 nodes cluster with 1000Mb/s NIC, I got the following result:

replication(3 copies): write 36.5 MB/s, read 71.8 MB/s
erasure code(4 : 2)  : write 46.6 MB/s, read 82.9 MB/

How It Works:

/*
 * Stripe: data strips + parity strips, spread on all replica
 * DS: data strip
 * PS: parity strip
 * R: Replica
 *
 *  +--------------------stripe ----------------------+
 *  v                                                 v
 * +----+----------------------------------------------+
 * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
 * +----+----------------------------------------------+
 * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
 * +----+----+----+----+----+ ... +----+----+-----+----+
 *  R1    R2   R3   R4   R5   ...   Rn  Rn+1  Rn+2  Rn+3
 */

We use replica to hold data and parity strips. Suppose we have a
4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
so basically we'll generate 2k parites for each 4k write, we call this 6K as
stripe as a whole. For write, we'll horizontally spread data, not vertically as
replciation. So for read, we have to assemble the strip from all the data
replica.

The downsize for erasure coding is:
1. for recovery, we have to recover 0.5x more data
2. if any replica fails, we have to wait for its recover for read.
3. it needs at least 6(4+2) nodes to work.

Usage:  

just add one more option for 'dog vdi create'

$dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion

For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'.
But I have '-e number' in plan, that users could specify how many parity replica
he wants with different erasure scheme for different vdis. E.g, we can have

-e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
-e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
-e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
	
TODOs: 

1. add recovery code
2. support snapshot/clone/backup
3. support user-defined redundancy level
4. add tests

Liu Yuan (11):
  add forward error correction for erasure code
  gateway: remove init_target_nodes
  gateway: remove shortcut for local node in gateway_forward_request
  gateway: rename write_info as forward_info
  gateway: support forward read request
  gateway: allow sending different requests to replica
  sheep: use copy_policy to control erasure vdi
  erasure: add basic read/write code proper
  erasure: add unligned write support
  change copy type from uint32_t to uint8_t
  make dog copy_policy aware

 dog/cluster.c            |   10 +-
 dog/common.c             |    4 +-
 dog/dog.h                |    4 +-
 dog/farm/farm.c          |   18 +-
 dog/farm/farm.h          |   10 +-
 dog/farm/object_tree.c   |   15 +-
 dog/vdi.c                |   67 ++++--
 include/Makefile.am      |    2 +-
 include/fec.h            |  173 ++++++++++++++
 include/sheepdog_proto.h |   24 +-
 include/util.h           |    2 +
 lib/Makefile.am          |    2 +-
 lib/fec.c                |  574 ++++++++++++++++++++++++++++++++++++++++++++++
 sheep/gateway.c          |  401 ++++++++++++++++++++++++--------
 sheep/group.c            |    3 +-
 sheep/ops.c              |    9 +-
 sheep/plain_store.c      |   13 +-
 sheep/sheep.c            |    2 +
 sheep/sheep_priv.h       |   11 +-
 sheep/vdi.c              |   40 +++-
 20 files changed, 1221 insertions(+), 163 deletions(-)
 create mode 100644 include/fec.h
 create mode 100644 lib/fec.c

-- 
1.7.9.5




More information about the sheepdog mailing list