[sheepdog] [PATCH v2 00/11] add basic erasure code support
Liu Yuan
namei.unix at gmail.com
Thu Sep 26 09:25:37 CEST 2013
v2:
- update fec.c based on the comments of Hitosh and Kazutaka
- change strip data stripe size to 1K so as to work with VM
- rename get_object_size as get_store_objsize
- update some commits
- remove unnecessary padding
- change copy type from uint32_t to uint8_t
- add one more patch to pass functinal/030 tests
Introductin:
This is the first round to add erasure code support for sheepdog. This patch set
add basic read/write/remove code for erasure coded vdis. The full blowned
version is planned to support all the current features, such as snapshot, clone,
incremental backup and cluster-wide backup.
As always, we support random read/write to erasure coded vdis.
Instead of storing copy in the replica, erasure code tries to spread data on all
the replica to achieve the same fault tolerance while reducing the redundancy to
minimal (less than 0.5 redundancy level).
Sheepdog will transparently support erasure coding for read/write/remove
opertaions on the fly while clients are storing/retrieving data in the sheepdog.
No changes to the client APIs or protocols.
For a simple test on my box, aligned-4k write get 1.5x faster than replication
at most, while read get 1.15x faster, compared with copies=3 (4:2 scheme)
For 6 nodes cluster with 1000Mb/s NIC, I got the following result:
replication(3 copies): write 36.5 MB/s, read 71.8 MB/s
erasure code(4 : 2) : write 46.6 MB/s, read 82.9 MB/
How It Works:
/*
* Stripe: data strips + parity strips, spread on all replica
* DS: data strip
* PS: parity strip
* R: Replica
*
* +--------------------stripe ----------------------+
* v v
* +----+----------------------------------------------+
* | ds | ds | ds | ds | ds | ... | ps | ps | ... | ps |
* +----+----------------------------------------------+
* | .. | .. | .. | .. | .. | ... | .. | .. | ... | .. |
* +----+----+----+----+----+ ... +----+----+-----+----+
* R1 R2 R3 R4 R5 ... Rn Rn+1 Rn+2 Rn+3
*/
We use replica to hold data and parity strips. Suppose we have a
4:2 scheme, 4 data strips and 2 parity strips on 6 replica and strip size = 1k,
so basically we'll generate 2k parites for each 4k write, we call this 6K as
stripe as a whole. For write, we'll horizontally spread data, not vertically as
replciation. So for read, we have to assemble the strip from all the data
replica.
The downsize for erasure coding is:
1. for recovery, we have to recover 0.5x more data
2. if any replica fails, we have to wait for its recover for read.
3. it needs at least 6(4+2) nodes to work.
Usage:
just add one more option for 'dog vdi create'
$dog vdi create -e test 10G # This will create a erasure coded vdi with thin-provsion
For now we only use a fixed scheme (4 data and 2 parity strips) with '-e'.
But I have '-e number' in plan, that users could specify how many parity replica
he wants with different erasure scheme for different vdis. E.g, we can have
-e 2 --> 4 : 2 (0.5 redundancy and can stand with 2 nodes failure)
-e 3 --> 8 : 3 (0.375 redunandcy and can stand with 3 nodes failure)
-e 4 --> 8 : 4 (0.5 redandancy and can stand with 4 nodes failure)
TODOs:
1. add recovery code
2. support snapshot/clone/backup
3. support user-defined redundancy level
4. add tests
Liu Yuan (11):
add forward error correction for erasure code
gateway: remove init_target_nodes
gateway: remove shortcut for local node in gateway_forward_request
gateway: rename write_info as forward_info
gateway: support forward read request
gateway: allow sending different requests to replica
sheep: use copy_policy to control erasure vdi
erasure: add basic read/write code proper
erasure: add unligned write support
change copy type from uint32_t to uint8_t
make dog copy_policy aware
dog/cluster.c | 10 +-
dog/common.c | 4 +-
dog/dog.h | 4 +-
dog/farm/farm.c | 18 +-
dog/farm/farm.h | 10 +-
dog/farm/object_tree.c | 15 +-
dog/vdi.c | 67 ++++--
include/Makefile.am | 2 +-
include/fec.h | 173 ++++++++++++++
include/sheepdog_proto.h | 24 +-
include/util.h | 2 +
lib/Makefile.am | 2 +-
lib/fec.c | 574 ++++++++++++++++++++++++++++++++++++++++++++++
sheep/gateway.c | 401 ++++++++++++++++++++++++--------
sheep/group.c | 3 +-
sheep/ops.c | 9 +-
sheep/plain_store.c | 13 +-
sheep/sheep.c | 2 +
sheep/sheep_priv.h | 11 +-
sheep/vdi.c | 40 +++-
20 files changed, 1221 insertions(+), 163 deletions(-)
create mode 100644 include/fec.h
create mode 100644 lib/fec.c
--
1.7.9.5
More information about the sheepdog
mailing list