[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster

Wed Sep 21 07:22:24 CEST 2011

At Wed, 21 Sep 2011 14:05:23 +0900,
MORITA Kazutaka wrote:
> 
> At Wed, 21 Sep 2011 11:48:37 +0800,
> Liu Yuan wrote:
> > 
> > On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> > > Looks great, but there seems to be some other cases we need to
> > > consider.  For example:
> > >
> > > 1. Start Sheepdog with three daemons.
> > >    $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
> > >    $ collie cluster format
> > >    $ collie cluster info
> > >    Cluster status: running
> > >
> > >    Creation time        Epoch Nodes
> > >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > 2. Then, kill sheep daemons, and start again in the same order.
> > >
> > >    $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
> > >    $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> > >    $ collie cluster info
> > >    Cluster status: running
> > >
> > >    Creation time        Epoch Nodes
> > >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> > >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > The first daemon regards the other two nodes as left nodes, and starts
> > > working.
> > >
> > > 3. Start the other two nodes again.
> > >
> > >    $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> > >    $ collie cluster info
> > >    Cluster status: running
> > >
> > >    Creation time        Epoch Nodes
> > >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> > >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> > >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >    $ collie cluster info -p 7001
> > >    Cluster status: running
> > >
> > >    Creation time        Epoch Nodes
> > >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> > >    2011-09-20 16:43:10      2 [10.68.14.1:7000, 10.68.14.1:7002]
> > >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >    $ collie cluster info -p 7002
> > >    Cluster status: running
> > >
> > >    Creation time        Epoch Nodes
> > >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> > >    2011-09-20 16:43:10      2 [10.68.14.1:7001, 10.68.14.1:7002]
> > >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > The epoch informations become inconsistent.  It is because the first
> > > node overwrote the epochs in the other nodes.  Similar situations
> > > could happen if we start from the daemon which doesn't have the latest
> > > epoch.
> > >
> > > We can get away with claiming that this doesn't happen if the
> > > administrator is careful enough.  But is there any good idea to solve
> > > this problem?
> > >
> > 
> > I am really puzzled by the semantics of 'collie cluster info'...from the 
> > code, it tries to get the local epoch information, however, by semantics 
> > it suggests this command should get the cluster information. every node 
> > may have its own history, and have chances to have *different* epoch 
> > history with other nodes.
> > 
> > So, I think we should get the same epoch history on any node of cluster. 
> > Kazutaka, how do you think to get the cluster info only from single 
> > node(would be master node in my opionion)?If possible, how do we deal 
> > with the local epoch that it is not master node? if not, we would suffer 
> > epoch inconsistency as you met. we cannot get rid of this inconsistency 
> > in *every* cases.
> 
> We should completely remove a SPOF, so it is not good to store the
> cluster info only in the master node.  Although we can replicate it to
> multiple nodes, the inconsistency problem could happen again.
> 
> Sheepdog uses the epoch histories for consistent hashing too, so every
> nodes need to have the epochs in local.  In addition, to ensure strong
> data consistency, I think we cannot allow that different histories
> coexist in cluster.
> 
> Currently, we force users to start the node which has the latest epoch
> first.  This is necessary to support strong consistency.
> 
> For example:
> 
> Sheepdog was working with 3 nodes, and wrongly killed without
> "collie cluster shutdown".  The epochs are as follows:
> 
>   Epochs on node1
>     2  [node1, node3]
>     1  [node1, node2, node3]
> 
>   Epochs on node2
>     3  [node2]
>     2  [node1, node3]
>     1  [node1, node2, node3]
> 
>   Epochs on node3
>     1  [node1, node2, node3]
> 
> If we run a daemon on node3 first and start working from epoch2, we
> would have a risk of data-loss.  It is because some data could be
> written in epoch3.
> 
> In this case, node2 has the latest epoch.  If we start a sheep daemon
> from node2, the epochs will be something like the followings:
> 
>   Epochs on node1
>     5  [node1, node2, node3]
>     4  [node1, node2]
>     2  [node1, node3]
>     1  [node1, node2, node3]
> 
>   Epochs on node2
>     5  [node1, node2, node3]
>     4  [node1, node2]
>     3  [node2]
>     2  [node1, node3]
>     1  [node1, node2, node3]
> 
>   Epochs on node3
>     5  [node1, node2, node3]
>     1  [node1, node2, node3]
> 
> 
> These are in the same epoch histories, though node1 and node2 lost

should be node1 and node3, sorry.

Kazutaka

> some epoch info.  In this case, Sheepdog can ensure a strong data
> consistency.
> 
> Sheepdog uses virtual synchrony of corosync, which delivers node
> membership changes in the same order to every nodes.  I guess we can
> keep every nodes in the same epoch histories if we carefully implement
> a recovery mechanism.
> 
> 
> Thanks,
> 
> Kazutaka
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog