At Wed, 21 Sep 2011 14:05:23 +0900, MORITA Kazutaka wrote: > > At Wed, 21 Sep 2011 11:48:37 +0800, > Liu Yuan wrote: > > > > On 09/20/2011 04:30 PM, MORITA Kazutaka wrote: > > > Looks great, but there seems to be some other cases we need to > > > consider. For example: > > > > > > 1. Start Sheepdog with three daemons. > > > $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done > > > $ collie cluster format > > > $ collie cluster info > > > Cluster status: running > > > > > > Creation time Epoch Nodes > > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > > > 2. Then, kill sheep daemons, and start again in the same order. > > > > > > $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done > > > $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done > > > $ collie cluster info > > > Cluster status: running > > > > > > Creation time Epoch Nodes > > > 2011-09-20 16:43:10 2 [10.68.14.1:7000] > > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > > > The first daemon regards the other two nodes as left nodes, and starts > > > working. > > > > > > 3. Start the other two nodes again. > > > > > > $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done > > > $ collie cluster info > > > Cluster status: running > > > > > > Creation time Epoch Nodes > > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > > 2011-09-20 16:43:10 2 [10.68.14.1:7000] > > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > $ collie cluster info -p 7001 > > > Cluster status: running > > > > > > Creation time Epoch Nodes > > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > > 2011-09-20 16:43:10 2 [10.68.14.1:7000, 10.68.14.1:7002] > > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > $ collie cluster info -p 7002 > > > Cluster status: running > > > > > > Creation time Epoch Nodes > > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > > 2011-09-20 16:43:10 2 [10.68.14.1:7001, 10.68.14.1:7002] > > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > > > The epoch informations become inconsistent. It is because the first > > > node overwrote the epochs in the other nodes. Similar situations > > > could happen if we start from the daemon which doesn't have the latest > > > epoch. > > > > > > We can get away with claiming that this doesn't happen if the > > > administrator is careful enough. But is there any good idea to solve > > > this problem? > > > > > > > I am really puzzled by the semantics of 'collie cluster info'...from the > > code, it tries to get the local epoch information, however, by semantics > > it suggests this command should get the cluster information. every node > > may have its own history, and have chances to have *different* epoch > > history with other nodes. > > > > So, I think we should get the same epoch history on any node of cluster. > > Kazutaka, how do you think to get the cluster info only from single > > node(would be master node in my opionion)?If possible, how do we deal > > with the local epoch that it is not master node? if not, we would suffer > > epoch inconsistency as you met. we cannot get rid of this inconsistency > > in *every* cases. > > We should completely remove a SPOF, so it is not good to store the > cluster info only in the master node. Although we can replicate it to > multiple nodes, the inconsistency problem could happen again. > > Sheepdog uses the epoch histories for consistent hashing too, so every > nodes need to have the epochs in local. In addition, to ensure strong > data consistency, I think we cannot allow that different histories > coexist in cluster. > > Currently, we force users to start the node which has the latest epoch > first. This is necessary to support strong consistency. > > For example: > > Sheepdog was working with 3 nodes, and wrongly killed without > "collie cluster shutdown". The epochs are as follows: > > Epochs on node1 > 2 [node1, node3] > 1 [node1, node2, node3] > > Epochs on node2 > 3 [node2] > 2 [node1, node3] > 1 [node1, node2, node3] > > Epochs on node3 > 1 [node1, node2, node3] > > If we run a daemon on node3 first and start working from epoch2, we > would have a risk of data-loss. It is because some data could be > written in epoch3. > > In this case, node2 has the latest epoch. If we start a sheep daemon > from node2, the epochs will be something like the followings: > > Epochs on node1 > 5 [node1, node2, node3] > 4 [node1, node2] > 2 [node1, node3] > 1 [node1, node2, node3] > > Epochs on node2 > 5 [node1, node2, node3] > 4 [node1, node2] > 3 [node2] > 2 [node1, node3] > 1 [node1, node2, node3] > > Epochs on node3 > 5 [node1, node2, node3] > 1 [node1, node2, node3] > > > These are in the same epoch histories, though node1 and node2 lost should be node1 and node3, sorry. Kazutaka > some epoch info. In this case, Sheepdog can ensure a strong data > consistency. > > Sheepdog uses virtual synchrony of corosync, which delivers node > membership changes in the same order to every nodes. I guess we can > keep every nodes in the same epoch histories if we carefully implement > a recovery mechanism. > > > Thanks, > > Kazutaka > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |