[sheepdog] [PATCH v2 00/11] object reclaim based on generational reference counting

Liu Yuan namei.unix at gmail.com
Mon Jul 1 09:03:48 CEST 2013


On Wed, Jun 19, 2013 at 02:14:20AM +0900, MORITA Kazutaka wrote:
> From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> 
> v2:
>  - rebase onto master
>  - add test to check object reclaim feature
>  - rename *_obj() to *_object()
>  - add documentation about the algorithm
> 
> Currently, sheepdog doesn't reclaim objects until all the snapshots
> are deleted.  For example:
> 
>  $ collie vdi create image
>  $ collie vdi write image < some_data
>  $ collie vdi snapshot image -s snap1
>  $ collie vdi write image < some_data
>  $ collie vdi delete image            <- this doesn't reclaim the objects
>                                          of the image
>  $ collie vdi delete image -s snap1   <- this reclaims all the data objects
>                                          of both image and image:snap1
> 
> To resolve this problem we need to introduce something like reference
> counting mechanism.  However, a naive approach doesn't work well on
> Sheepdog.  It is because, when we create a snapshot or clone vdi, we
> have to increment all the reference count of its data objects and send
> many requests to all over the nodes.
> 
> This series uses generation reference counting [1], which is more
> appropriate algorithm for distributed system like Sheepdog.  With this
> series, we can reclaim objects just when they are no longer necessary.
> 
> [1] B. Goldberg, Generation reference counting: A reduced
> communication distributed storage reclamation scheme, PLDI '89
> 
> ==
> Here is a brief explanation ob the object reclaim algorithm.  If you
> want to know the details or correctness proof of the generational
> reference counting, please refer the original paper:
> 
> Each data object contains the following structure in its inode object:
> 
>     struct generation_reference {
>         int32_t generation;
>         int32_t count;
>     } data_ref;
> 
> The generation field identifies the generation of the reference and
> the count field records the number of references copied from this
> particular reference.
> 
> We also introduce a new type of object, a ledger object.  The ledger
> object exists for each data object and contains an array of int32_t
> values. The i-th element of the array contains information about the
> number of outstanding data object references in the i-th generation.
> Generational reference counting is performed as follows:
> 
> 1. When a vdi A is created, the initial data object references are
>    initialized as follows:
> 
>        for (i = 0; i < MAX_DATA_OBJS; i++) {
>            A.data_ref[i].generation = 0;
>            A.data_ref[i].count = 0;
>        }
> 
>    and a ledger object for each data object is initialized as follows:
> 
>        ledger_object[0] = 1;
>        for (i = 1; i < MAX_DATA_OBJS; i++) {
>            ledger_object[i] = 0;
>        }
> 
>    In practice, the ledger object is created and initialized when the
>    object is accessed for the first time, so we don't have to create
>    all the ledger objects when the vdi is created.
> 
> 2. When a VDI B is created as a clone (or snapshot) of A, the new
>    reference B is initialized as follows:
> 
>        for (i = 0; i < MAX_DATA_OBJS; i++) {
>            B.data_ref[i].generation = A.data_ref[i].generation + 1;
>            B.data_ref[i].count = 0;
>        }
> 
>    In addition, A.data_ref.count's are incremented:
> 
>        for (i = 0; i < MAX_DATA_OBJS; i++) {
>            A.data_ref[i].count++;
>        }
> 
> 3. When a object o is removed, a decrement message is sent to its
>    ledger object containing the values of o.generation and o.count.
> 
> 4. When a decrement message to o is received, in which the generation
>    field has the value i and the count field has the value c, the
>    following actions are performed:
> 
>    o.ledger[i]--;
>    o.ledger[i + 1] += c;
> 
>    When every element of the ledger becomes zero (for all i >= 0,
>    o.ledger[i] = 0), o can be reclaimed.
> 
>    Note that some elements of the ledger may hold negative values.

This is very fundamental change, I guess maybe we should merget it after we
choose the base for the stable branch 0.6.x.

By the way, when will the stable branch created?

Thanks
Yuan



More information about the sheepdog mailing list