[sheepdog] [PATCH v6 00/17] object reclaim based on generational reference counting

Hitoshi Mitake mitake.hitoshi at gmail.com
Wed Mar 5 16:28:41 CET 2014

The object reclaim doesn't support hypervolume yet. But hypervolume cannot be
used as a virtual disk (both of qemu and tgt don't support it) currently. And
the removal of old vdi deletion is acceptable for hypervolume because it doesn't
support snapshot, etc. So I think this patchset can be applied to the master

The same code is pushed to:

There are a problem reported by Valerio, but this patchset can pass
every existing test. So I think it is good time to start incremental

 - recycle old child_vdi_id and move btree_counter to the place
 - update CHANGELOG.md

 - rebase on master
 - update store version and break compatibility explicitly
 - remove coding style problems
 -- don't use pthread_mutex_t directly
 -- rename inode.gref -> inode.gref
 - resolve regression in http feature (10th patch)
 - forbid creating snapshot of hypervolumes temporally
 - enhance tests (14th patch)
 - remove various problems exposed by the patchset (1st - 3rd patches)

 - remove some bugs
 - refine tests
 -- the patchset can pass every test in vdi group

 - rebase onto master

 - rebase onto master
 - add test to check object reclaim feature
 - rename *_obj() to *_object()
 - add documentation about the algorithm

Currently, sheepdog doesn't reclaim objects until all the snapshots
are deleted.  For example:

 $ dog vdi create image
 $ dog vdi write image < some_data
 $ dog vdi snapshot image -s snap1
 $ dog vdi write image < some_data
 $ dog vdi delete image            <- this doesn't reclaim the objects
                                         of the image
 $ dog vdi delete image -s snap1   <- this reclaims all the data objects
                                         of both image and image:snap1

To resolve this problem we need to introduce something like reference
counting mechanism.  However, a naive approach doesn't work well on
Sheepdog.  It is because, when we create a snapshot or clone vdi, we
have to increment all the reference count of its data objects and send
many requests to all over the nodes.

This series uses generation reference counting [1], which is more
appropriate algorithm for distributed system like Sheepdog.  With this
series, we can reclaim objects just when they are no longer necessary.

[1] B. Goldberg, Generation reference counting: A reduced
communication distributed storage reclamation scheme, PLDI '89

Here is a brief explanation ob the object reclaim algorithm.  If you
want to know the details or correctness proof of the generational
reference counting, please refer the original paper:

Each data object contains the following structure in its inode object:

    struct generation_reference {
        int32_t generation;
        int32_t count;
    } gref;

The generation field identifies the generation of the reference and
the count field records the number of references copied from this
particular reference.

We also introduce a new type of object, a ledger object.  The ledger
object exists for each data object and contains an array of int32_t
values. The i-th element of the array contains information about the
number of outstanding data object references in the i-th generation.
Generational reference counting is performed as follows:

1. When a vdi A is created, the initial data object references are
   initialized as follows:

       for (i = 0; i < MAX_DATA_OBJS; i++) {
           A.gref[i].generation = 0;
           A.gref[i].count = 0;

   and a ledger object for each data object is initialized as follows:

       ledger_object[0] = 1;
       for (i = 1; i < MAX_DATA_OBJS; i++) {
           ledger_object[i] = 0;

   In practice, the ledger object is created and initialized when the
   object is accessed for the first time, so we don't have to create
   all the ledger objects when the vdi is created.

2. When a VDI B is created as a clone (or snapshot) of A, the new
   reference B is initialized as follows:

       for (i = 0; i < MAX_DATA_OBJS; i++) {
           B.gref[i].generation = A.gref[i].generation + 1;
           B.gref[i].count = 0;

   In addition, A.gref.count's are incremented:

       for (i = 0; i < MAX_DATA_OBJS; i++) {

3. When a object o is removed, a decrement message is sent to its
   ledger object containing the values of o.generation and o.count.

4. When a decrement message to o is received, in which the generation
   field has the value i and the count field has the value c, the
   following actions are performed:

   o.ledger[i + 1] += c;

   When every element of the ledger becomes zero (for all i >= 0,
   o.ledger[i] = 0), o can be reclaimed.

   Note that some elements of the ledger may hold negative values.

Hitoshi Mitake (17):
  tests/functional: use dog instead of qemu-img in 016
  dog: use nr_copies of inode object instead of command line parameter
  dog: forbid creating snapshot of hypervolume
  sheep: introduce generational reference counting for object reclaim
  sheep: introduce ledger objects
  sheep: decrement generational reference count on vdi deletion
  sheep: decrement generational reference count on copy-on-write
  sheep: remove max children limit
  sheep: introduce sparse objects
  sheep, http: don't allocate inode object on stack
  sheepdog proto: recycle old child_vdi_id for storing btree_counter
  sheep: update store format version to v5
  tests: show ledger objects in list
  tests/functional: add test for object reclaim
  doc: add documentation of object reclaim algorithm
  tests/functional: add a new test of the object reclaim for cloned VDIs
  update CHANGELOG.md for new object reclaim scheme

 CHANGELOG.md               |    3 +
 doc/object-reclaim.txt     |   73 +++
 dog/vdi.c                  |    7 +-
 include/internal_proto.h   |    2 +
 include/sheepdog_proto.h   |   59 ++-
 include/util.h             |    3 +
 lib/util.c                 |   62 +++
 sheep/config.c             |    2 +-
 sheep/gateway.c            |  113 ++++-
 sheep/http/kv.c            |   11 +-
 sheep/migrate.c            |   16 +
 sheep/ops.c                |   90 ++++
 sheep/plain_store.c        |   70 ++-
 sheep/sheep_priv.h         |   16 +
 sheep/store.c              |   25 ++
 sheep/vdi.c                |  370 +++------------
 tests/functional/016       |    2 +-
 tests/functional/016.out   |   14 +-
 tests/functional/042       |    4 +-
 tests/functional/042.out   |   44 +-
 tests/functional/044.out   |   14 +-
 tests/functional/057.out   |  140 +++---
 tests/functional/058.out   |   14 +-
 tests/functional/063.out   |   28 +-
 tests/functional/064.out   |   38 +-
 tests/functional/086       |   36 +-
 tests/functional/086.out   | 1068 ++------------------------------------------
 tests/functional/088       |   47 ++
 tests/functional/088.out   |   42 ++
 tests/functional/common.rc |   10 +-
 tests/functional/group     |    3 +-
 31 files changed, 915 insertions(+), 1511 deletions(-)
 create mode 100644 doc/object-reclaim.txt
 create mode 100755 tests/functional/088
 create mode 100644 tests/functional/088.out


More information about the sheepdog mailing list