[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster

Wed Sep 21 05:48:37 CEST 2011

On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> Looks great, but there seems to be some other cases we need to
> consider.  For example:
>
> 1. Start Sheepdog with three daemons.
>    $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster format
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> 2. Then, kill sheep daemons, and start again in the same order.
>
>    $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
>    $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      2 [10.68.14.1:7000]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> The first daemon regards the other two nodes as left nodes, and starts
> working.
>
> 3. Start the other two nodes again.
>
>    $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7000]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    $ collie cluster info -p 7001
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7000, 10.68.14.1:7002]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    $ collie cluster info -p 7002
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> The epoch informations become inconsistent.  It is because the first
> node overwrote the epochs in the other nodes.  Similar situations
> could happen if we start from the daemon which doesn't have the latest
> epoch.
>
> We can get away with claiming that this doesn't happen if the
> administrator is careful enough.  But is there any good idea to solve
> this problem?
>

I am really puzzled by the semantics of 'collie cluster info'...from the 
code, it tries to get the local epoch information, however, by semantics 
it suggests this command should get the cluster information. every node 
may have its own history, and have chances to have *different* epoch 
history with other nodes.

So, I think we should get the same epoch history on any node of cluster. 
Kazutaka, how do you think to get the cluster info only from single 
node(would be master node in my opionion)?If possible, how do we deal 
with the local epoch that it is not master node? if not, we would suffer 
epoch inconsistency as you met. we cannot get rid of this inconsistency 
in *every* cases.

Thanks,
Yuan