[Sheepdog] partition recovery algorithm
Dietmar Maurer
dietmar at proxmox.com
Mon Nov 2 08:45:48 CET 2009
> > Where and how are those 4BM blocks stored locally?
>
> 4MB block (we call "object") is simply stored as a file
> named its object id.
> We are looking for a local key-value store to store objects more
> efficiently.
> We have tried Berkeley DB as a local storage, but its performance is
> not good for 4 MB objects.
> Berkeley DB looks like tuned to more smaller blocks.
And do you always write 4MB - or is it possible to write smaller blocks?
> > And how does the partition recovery algorithm work?
>
> When failure has occured, new partition information is sent from
> JGroups master group, and the recovery thread moves objects based
> on new partition information in the background.
> Vdi objects store the old partition version numbers with each
But 'partition version numbers' can be the same, although the data is different (when the cluster was partitioned)?
> data object id, so VM can get the old partition information
> which is used to store the data object at the time.
> By using the old partition information, VM can access data object
> even when data is before recovering.
But how do you compare data? I mean you need to make sure data all nodes have exactly the same data. Are you using some kind of hash/digest (tiger hash, merkle tree)?
- Dietmar
More information about the sheepdog
mailing list