[sheepdog] [PATCH 0/9] add erasure code support

Liu Yuan namei.unix at gmail.com
Thu Sep 19 12:42:44 CEST 2013


Introductin:

This is the first round to add erasure code support for sheepdog. This patch set
add basic read/write/remove code for erasure coded vdis. The full blowned
version is planned to support all the current features, such as snapshot, clone,
incremental backup and cluster-wide backup.

As always, we support random read/write to erasure coded vdis.

Instead of storing copy in the replica, erasure code tries to spread data on all
the replica to achieve the same fault tolerance while reducing the redundancy to
minimal (less than 0.5 redundancy level).

Sheepdog will transparently support erasure coding for read/write/remove
opertaions on the fly while clients are storing/retrieving data in the sheepdog.
No changes to the client APIs or protocols.

For a simple test on my box, aligned write get 1.5x faster than replication
at most, while read performance drop to 60%, compared with copies=3 (4:2 scheme)

How It Works:

/*
 * Stripe: data strips + parity strips, spread on all replica
 * DS: data strip
 * PS: parity strip
 * R: Replica
 *
 *  +--------------------stripe ----------------------+
 *  v                                                 v
 * +----+----------------------------------------------+
 * | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
 * +----+----------------------------------------------+
 * | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
 * +----+----+----+----+----+ ... +----+----+-----+----+
 *  R1    R2   R3   R4   R5   ...   Rn  Rn+1  Rn+2  Rn+3
 */

We use replica to hold data and parity strips. Suppose we have a
4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
so basically we'll generate 2k parites for each 4k write, we call this 6K as
stripe as a whole. For write, we'll horizontally spread data, not vertically as
replciation. So for read, we have to assemble the strip from all the data
replica, this probably the reason why we get slowed down for read.

Usage:

just add one more option for 'dog vdi create'

$dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion

For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'. But
I have '-e number' in plan, that users could specify how many parity replica he
wants with different erasure scheme for different vdis. E.g, we can have

-e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
-e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
-e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)

TODOs:
1. add recovery code
2. support snapshot/clone/backup
3. support user-defined redundancy level
4. add tests

Liu Yuan (9):
  add forward error correction for erasure code
  gateway: remove init_target_nodes
  gateway: remove shortcut for local node in gateway_forward_request
  gateway: rename write_info as forward_info
  gateway: support forward read request
  gateway: allow sending different requests to replica
  sheep: use copy_policy to control erasure vdi
  erasure: add basic read/write code proper
  erasure: add unligned write support

 dog/dog.h                |    2 +-
 dog/farm/farm.c          |   11 +-
 dog/vdi.c                |   24 +-
 include/Makefile.am      |    2 +-
 include/fec.h            |  170 ++++++++++++++
 include/sheepdog_proto.h |    7 +-
 lib/Makefile.am          |    2 +-
 lib/fec.c                |  578 ++++++++++++++++++++++++++++++++++++++++++++++
 sheep/gateway.c          |  396 +++++++++++++++++++++++--------
 sheep/group.c            |    3 +-
 sheep/ops.c              |    9 +-
 sheep/plain_store.c      |   13 +-
 sheep/sheep_priv.h       |    9 +-
 sheep/vdi.c              |   39 +++-
 14 files changed, 1143 insertions(+), 122 deletions(-)
 create mode 100644 include/fec.h
 create mode 100644 lib/fec.c

-- 
1.7.9.5




More information about the sheepdog mailing list