[sheepdog] reovery and consistency questions

Tue Feb 10 08:31:35 CET 2015

At Sun, 08 Feb 2015 21:15:19 +0100,
Corin Langosch wrote:
> 
> Hi guys,
> 
> I'm currently digging around in the sheepdog sources and have a few questions regarding recovery and object consistency.
> Please correct me if I'm wrong in anything I write here - it's all just read together from various documents and source
> files.
> 
> Sheepdog keeps track which nodes are alive at a given point in time in an epoch object. Every time a node joins/ leaves
> the cluster a new epoch is genereated. A history of all epochs is kept.  Objects are mapped to nodes using consistend
> hashing, the objects ec-chunks simply ordered to the neighbors nodes. Using the epoch history we can map the same object
> to the same node for any past cluster state.
> 
> As for recovery, please consider the following cluster history and an object A (2:1 ec):
> 
> E Nodes                         Placement of chunks
> 1 []
> - node1 joins
> 2 [node1]                       not enough nodes
> - node2 joins
> 3 [node1, node2]                A1=node2,A3=node1
> - node3 joins
> 4 [node1, node2, node3]         A1=node2,A2=node3,A3=node1
> - node4 joins, A3 is moved to the its new place
> 5 [node1, node2, node3, node4]  A1=node2,A2=node3,A3=node4
> - node4 crashes, A3 is recovered from A1+A2
> 6 [node1, node2, node3]         A1=node2,A2=node3,A3=node1
> - whole cluster crashes
> 7 []
> - node4 joins
> 8 [node4]                       A3=node4 (no access, not enough nodes)
> - node3 joins
> 9 [node3, node4]                A3=node4,A2=node3 (access, but A3 is outdated!!!)
> 
> How do you prevent that the outdated version of A3 on node4 is used? The latest version of A3 is on node1 (epoch 6), but
> how do we know this by only keeping track of the epochs? Afaik there's no central repository which holds all object/
> chunk versions?
> 
> Thank you in advance :)

At the epoch 8 and 9, client cannot access to sheepdog because all
members of latest healthy epoch (in this case, 6) aren't gathered yet.
In such a case, you can see an output of cluster info command like
below:

$ dog cluster info
(git)-[vid-overflow] 
Cluster status: Waiting for other nodes to join cluster
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Cluster created at Tue Feb 10 16:20:10 2015
...

So I/O to outdated objects are prevented in the above case. Access is
allowed after gathering node 1, 2, 3, and 4 in your above example.

After gathering enough members of the latest heathy epoch, sheeps run
recovery process. Recovery process is simple:
1. exchange information of owning objects each other
2. list up objects which should belong to me
3. E <- the latest epoch
4. read an object from sheeps based on epoch E, the sheeps are
   calculated based on consistent hashing
5. if no sheep processes have the object, E <- E - 1, go back to 3
and repeat the above 3 - 5 until completing recovery of all
objects. So you don't need to worry about access to outdated object :)

I understand your concern well. This is really subtle and important
point of distributed storage systems including sheepdog.

Thanks,
Hitoshi