[Sheepdog] partition recovery algorithm

Mon Nov 2 22:52:30 CET 2009

On 11/02/2009 04:45 PM, Dietmar Maurer wrote:
>>> Where and how are those 4BM blocks stored locally?
>> 4MB block (we call "object") is simply stored as a file
>> named its object id.
>> We are looking for a local key-value store to store objects more
>> efficiently.
>> We have tried Berkeley DB as a local storage, but its performance is
>> not good for 4 MB objects.
>> Berkeley DB looks like tuned to more smaller blocks.
> 
> And do you always write 4MB - or is it possible to write smaller blocks?

Sheepdog can write smaller blocks in most cases.
But in some cases (when writing to snapshot vdi image, etc),
Sheepdog have to write the entire object.

>>> And how does the partition recovery algorithm work?
>> When failure has occured, new partition information is sent from
>> JGroups master group, and the recovery thread moves objects based
>> on new partition information in the background.
>> Vdi objects store the old partition version numbers with each
> 
> But 'partition version numbers' can be the same, although the data is different (when the cluster was partitioned)?
> 
>> data object id, so VM can get the old partition information
>> which is used to store the data object at the time.
>> By using the old partition information, VM can access data object
>> even when data is before recovering.
> 
> But how do you compare data? I mean you need to make sure data all nodes have exactly the same data. Are you using some kind of hash/digest (tiger hash, merkle tree)?

Sorry, it seems I was misunderstood.
Did you mention about a network partition problem (split-brain)?
Sheepdog does not have the tolerance for network partition
in current implementation.
We think of using a majority voting algorithm to deal with this problem,
but details have not been discussed enough.

Regards,

MORITA Kazutaka