[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Wed Sep 21 07:22:24 CEST 2011
At Wed, 21 Sep 2011 14:05:23 +0900,
MORITA Kazutaka wrote:
>
> At Wed, 21 Sep 2011 11:48:37 +0800,
> Liu Yuan wrote:
> >
> > On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> > > Looks great, but there seems to be some other cases we need to
> > > consider. For example:
> > >
> > > 1. Start Sheepdog with three daemons.
> > > $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
> > > $ collie cluster format
> > > $ collie cluster info
> > > Cluster status: running
> > >
> > > Creation time Epoch Nodes
> > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > 2. Then, kill sheep daemons, and start again in the same order.
> > >
> > > $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
> > > $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> > > $ collie cluster info
> > > Cluster status: running
> > >
> > > Creation time Epoch Nodes
> > > 2011-09-20 16:43:10 2 [10.68.14.1:7000]
> > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > The first daemon regards the other two nodes as left nodes, and starts
> > > working.
> > >
> > > 3. Start the other two nodes again.
> > >
> > > $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> > > $ collie cluster info
> > > Cluster status: running
> > >
> > > Creation time Epoch Nodes
> > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001]
> > > 2011-09-20 16:43:10 2 [10.68.14.1:7000]
> > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > > $ collie cluster info -p 7001
> > > Cluster status: running
> > >
> > > Creation time Epoch Nodes
> > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001]
> > > 2011-09-20 16:43:10 2 [10.68.14.1:7000, 10.68.14.1:7002]
> > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > > $ collie cluster info -p 7002
> > > Cluster status: running
> > >
> > > Creation time Epoch Nodes
> > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001]
> > > 2011-09-20 16:43:10 2 [10.68.14.1:7001, 10.68.14.1:7002]
> > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> > >
> > > The epoch informations become inconsistent. It is because the first
> > > node overwrote the epochs in the other nodes. Similar situations
> > > could happen if we start from the daemon which doesn't have the latest
> > > epoch.
> > >
> > > We can get away with claiming that this doesn't happen if the
> > > administrator is careful enough. But is there any good idea to solve
> > > this problem?
> > >
> >
> > I am really puzzled by the semantics of 'collie cluster info'...from the
> > code, it tries to get the local epoch information, however, by semantics
> > it suggests this command should get the cluster information. every node
> > may have its own history, and have chances to have *different* epoch
> > history with other nodes.
> >
> > So, I think we should get the same epoch history on any node of cluster.
> > Kazutaka, how do you think to get the cluster info only from single
> > node(would be master node in my opionion)?If possible, how do we deal
> > with the local epoch that it is not master node? if not, we would suffer
> > epoch inconsistency as you met. we cannot get rid of this inconsistency
> > in *every* cases.
>
> We should completely remove a SPOF, so it is not good to store the
> cluster info only in the master node. Although we can replicate it to
> multiple nodes, the inconsistency problem could happen again.
>
> Sheepdog uses the epoch histories for consistent hashing too, so every
> nodes need to have the epochs in local. In addition, to ensure strong
> data consistency, I think we cannot allow that different histories
> coexist in cluster.
>
> Currently, we force users to start the node which has the latest epoch
> first. This is necessary to support strong consistency.
>
> For example:
>
> Sheepdog was working with 3 nodes, and wrongly killed without
> "collie cluster shutdown". The epochs are as follows:
>
> Epochs on node1
> 2 [node1, node3]
> 1 [node1, node2, node3]
>
> Epochs on node2
> 3 [node2]
> 2 [node1, node3]
> 1 [node1, node2, node3]
>
> Epochs on node3
> 1 [node1, node2, node3]
>
> If we run a daemon on node3 first and start working from epoch2, we
> would have a risk of data-loss. It is because some data could be
> written in epoch3.
>
> In this case, node2 has the latest epoch. If we start a sheep daemon
> from node2, the epochs will be something like the followings:
>
> Epochs on node1
> 5 [node1, node2, node3]
> 4 [node1, node2]
> 2 [node1, node3]
> 1 [node1, node2, node3]
>
> Epochs on node2
> 5 [node1, node2, node3]
> 4 [node1, node2]
> 3 [node2]
> 2 [node1, node3]
> 1 [node1, node2, node3]
>
> Epochs on node3
> 5 [node1, node2, node3]
> 1 [node1, node2, node3]
>
>
> These are in the same epoch histories, though node1 and node2 lost
should be node1 and node3, sorry.
Kazutaka
> some epoch info. In this case, Sheepdog can ensure a strong data
> consistency.
>
> Sheepdog uses virtual synchrony of corosync, which delivers node
> membership changes in the same order to every nodes. I guess we can
> keep every nodes in the same epoch histories if we carefully implement
> a recovery mechanism.
>
>
> Thanks,
>
> Kazutaka
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
More information about the sheepdog
mailing list