[Sheepdog] partition recovery algorithm

Mon Nov 2 08:45:48 CET 2009

> > Where and how are those 4BM blocks stored locally?
> 
> 4MB block (we call "object") is simply stored as a file
> named its object id.
> We are looking for a local key-value store to store objects more
> efficiently.
> We have tried Berkeley DB as a local storage, but its performance is
> not good for 4 MB objects.
> Berkeley DB looks like tuned to more smaller blocks.

And do you always write 4MB - or is it possible to write smaller blocks?

> > And how does the partition recovery algorithm work?
> 
> When failure has occured, new partition information is sent from
> JGroups master group, and the recovery thread moves objects based
> on new partition information in the background.
> Vdi objects store the old partition version numbers with each

But 'partition version numbers' can be the same, although the data is different (when the cluster was partitioned)?

> data object id, so VM can get the old partition information
> which is used to store the data object at the time.
> By using the old partition information, VM can access data object
> even when data is before recovering.

But how do you compare data? I mean you need to make sure data all nodes have exactly the same data. Are you using some kind of hash/digest (tiger hash, merkle tree)?

- Dietmar