At Tue, 20 Sep 2011 17:33:03 +0800, Liu Yuan wrote: > > On 09/20/2011 04:30 PM, MORITA Kazutaka wrote: > > Looks great, but there seems to be some other cases we need to > > consider. For example: > > > > 1. Start Sheepdog with three daemons. > > $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done > > $ collie cluster format > > $ collie cluster info > > Cluster status: running > > > > Creation time Epoch Nodes > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > 2. Then, kill sheep daemons, and start again in the same order. > > > > $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done > > $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done > > $ collie cluster info > > Cluster status: running > > > > Creation time Epoch Nodes > > 2011-09-20 16:43:10 2 [10.68.14.1:7000] > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > The first daemon regards the other two nodes as left nodes, and starts > > working. > > > > 3. Start the other two nodes again. > > > > $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done > > $ collie cluster info > > Cluster status: running > > > > Creation time Epoch Nodes > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > 2011-09-20 16:43:10 2 [10.68.14.1:7000] > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > $ collie cluster info -p 7001 > > Cluster status: running > > > > Creation time Epoch Nodes > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > 2011-09-20 16:43:10 2 [10.68.14.1:7000, 10.68.14.1:7002] > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > $ collie cluster info -p 7002 > > Cluster status: running > > > > Creation time Epoch Nodes > > 2011-09-20 16:43:10 4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > 2011-09-20 16:43:10 3 [10.68.14.1:7000, 10.68.14.1:7001] > > 2011-09-20 16:43:10 2 [10.68.14.1:7001, 10.68.14.1:7002] > > 2011-09-20 16:43:10 1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002] > > > > The epoch informations become inconsistent. It is because the first > > node overwrote the epochs in the other nodes. Similar situations > > could happen if we start from the daemon which doesn't have the latest > > epoch. > > > > We can get away with claiming that this doesn't happen if the > > administrator is careful enough. But is there any good idea to solve > > this problem? > > > > Good catch. But actually, this patch set doesn't deal with the epoch > older or newer problem when is started up. > > This patch just resolves the cluster startup problem when they are > *shutdowned* by 'collie cluster shutdown' command. That is, the epoch > number is the same, but with corrupted epoch content or different ctime. > > I think this case (all nodes are down abnormally instead of being > shutdowned, for e.g power outage) should be solved by another patch, > because it is, IMHO, a different problem. Probably, we should store the information about the node shutdown status (safely shutdowned or unexpectedly stopped) to take a different approach when starting up. Though this needs not to be done in this patch. > > When nodes with newer epoch or older epoch should *not* be regarded > _leave nodes_, they should be processed as soon as they are started up. > Though, this patch set wrongly take them as leave nodes. > > I'll cook a different patch targeting for this problem, well, based on > this shutdown patch set. Good! Thanks a lot. Kazutaka |