[Sheepdog] partition recovery algorithm

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Mon Nov 2 07:04:24 CET 2009


On 11/01/2009 06:09 PM, Dietmar Maurer wrote:
> Is there some documentation about the implementation of the key/value storage system.

Some documentation is here:
http://www.osrg.net/sheepdog/design.html#object

But it is described about general consistent-hashing algorithm,
not Sheepdog specific one.
We'll add Sheepdog implementation to the design page.
Until document gets better, please free to contact us
in this mailing list.

> Where and how are those 4BM blocks stored locally?

4MB block (we call "object") is simply stored as a file
named its object id.
We are looking for a local key-value store to store objects more
efficiently.
We have tried Berkeley DB as a local storage, but its performance is
not good for 4 MB objects.
Berkeley DB looks like tuned to more smaller blocks.

> What kind of ID do you use to uniquely identify the blocks
> (as hash or a generation number)?

Sheepdog has three type of objects; super object, vdi object,
and data object.
Each object has 64 bit id.
Super object is unique in the cluster and its id is zero.

Vdi object id is allocated by the following rules:
  - first 18 bits are filled with zero
  - next 37 bits represent vdi id. None of two vdi has same id.
    Vdi id is allocated uniquely by locking super object.
  - next 8 bits are reserved for future use.

Data object id is allocated by the following rules;
  - first 18 bits represent one based index of the provided block
    level volume. Each vdi can use 2^18 data objects;
    maximum disk size is 1 TB by default.
  - next 37 bits represent the vdi id. This data object is allocated
    by the vdi.
  - next 8 bits are reserved for future use.

The point is each vdi can allocate its own data object without locking.

> And how does the partition recovery algorithm work?

When failure has occured, new partition information is sent from
JGroups master group, and the recovery thread moves objects based
on new partition information in the background.
Vdi objects store the old partition version numbers with each
data object id, so VM can get the old partition information
which is used to store the data object at the time.
By using the old partition information, VM can access data object
even when data is before recovering.

I think this is one of different points from Amazon Dynamo.
Dynamo does not require strong consistency, but Sheepdog requires it
because Sheepdog provides low layer storage.

The most parts of data recovery features are under development.


Regards,

MORITA Kazutaka




More information about the sheepdog mailing list