[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster

Wed Sep 21 07:05:23 CEST 2011

At Wed, 21 Sep 2011 11:48:37 +0800,
Liu Yuan wrote:
> 
> On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> > Looks great, but there seems to be some other cases we need to
> > consider.  For example:
> >
> > 1. Start Sheepdog with three daemons.
> >    $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster format
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > 2. Then, kill sheep daemons, and start again in the same order.
> >
> >    $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
> >    $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > The first daemon regards the other two nodes as left nodes, and starts
> > working.
> >
> > 3. Start the other two nodes again.
> >
> >    $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    $ collie cluster info -p 7001
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    $ collie cluster info -p 7002
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > The epoch informations become inconsistent.  It is because the first
> > node overwrote the epochs in the other nodes.  Similar situations
> > could happen if we start from the daemon which doesn't have the latest
> > epoch.
> >
> > We can get away with claiming that this doesn't happen if the
> > administrator is careful enough.  But is there any good idea to solve
> > this problem?
> >
> 
> I am really puzzled by the semantics of 'collie cluster info'...from the 
> code, it tries to get the local epoch information, however, by semantics 
> it suggests this command should get the cluster information. every node 
> may have its own history, and have chances to have *different* epoch 
> history with other nodes.
> 
> So, I think we should get the same epoch history on any node of cluster. 
> Kazutaka, how do you think to get the cluster info only from single 
> node(would be master node in my opionion)?If possible, how do we deal 
> with the local epoch that it is not master node? if not, we would suffer 
> epoch inconsistency as you met. we cannot get rid of this inconsistency 
> in *every* cases.

We should completely remove a SPOF, so it is not good to store the
cluster info only in the master node.  Although we can replicate it to
multiple nodes, the inconsistency problem could happen again.

Sheepdog uses the epoch histories for consistent hashing too, so every
nodes need to have the epochs in local.  In addition, to ensure strong
data consistency, I think we cannot allow that different histories
coexist in cluster.

Currently, we force users to start the node which has the latest epoch
first.  This is necessary to support strong consistency.

For example:

Sheepdog was working with 3 nodes, and wrongly killed without
"collie cluster shutdown".  The epochs are as follows:

  Epochs on node1
    2  [node1, node3]
    1  [node1, node2, node3]

  Epochs on node2
    3  [node2]
    2  [node1, node3]
    1  [node1, node2, node3]

  Epochs on node3
    1  [node1, node2, node3]

If we run a daemon on node3 first and start working from epoch2, we
would have a risk of data-loss.  It is because some data could be
written in epoch3.

In this case, node2 has the latest epoch.  If we start a sheep daemon
from node2, the epochs will be something like the followings:

  Epochs on node1
    5  [node1, node2, node3]
    4  [node1, node2]
    2  [node1, node3]
    1  [node1, node2, node3]

  Epochs on node2
    5  [node1, node2, node3]
    4  [node1, node2]
    3  [node2]
    2  [node1, node3]
    1  [node1, node2, node3]

  Epochs on node3
    5  [node1, node2, node3]
    1  [node1, node2, node3]

These are in the same epoch histories, though node1 and node2 lost
some epoch info.  In this case, Sheepdog can ensure a strong data
consistency.

Sheepdog uses virtual synchrony of corosync, which delivers node
membership changes in the same order to every nodes.  I guess we can
keep every nodes in the same epoch histories if we carefully implement
a recovery mechanism.

Thanks,

Kazutaka