[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster

Tue Sep 20 12:03:20 CEST 2011

At Tue, 20 Sep 2011 17:33:03 +0800,
Liu Yuan wrote:
> 
> On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> > Looks great, but there seems to be some other cases we need to
> > consider.  For example:
> >
> > 1. Start Sheepdog with three daemons.
> >    $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster format
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > 2. Then, kill sheep daemons, and start again in the same order.
> >
> >    $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
> >    $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > The first daemon regards the other two nodes as left nodes, and starts
> > working.
> >
> > 3. Start the other two nodes again.
> >
> >    $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
> >    $ collie cluster info
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    $ collie cluster info -p 7001
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7000, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    $ collie cluster info -p 7002
> >    Cluster status: running
> >
> >    Creation time        Epoch Nodes
> >    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
> >    2011-09-20 16:43:10      2 [10.68.14.1:7001, 10.68.14.1:7002]
> >    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
> >
> > The epoch informations become inconsistent.  It is because the first
> > node overwrote the epochs in the other nodes.  Similar situations
> > could happen if we start from the daemon which doesn't have the latest
> > epoch.
> >
> > We can get away with claiming that this doesn't happen if the
> > administrator is careful enough.  But is there any good idea to solve
> > this problem?
> >
> 
> Good catch. But actually, this patch set doesn't deal with the epoch 
> older or newer problem when is started up.
> 
> This patch just resolves the cluster startup problem when they are 
> *shutdowned* by 'collie cluster shutdown' command. That is, the epoch 
> number is the same, but with corrupted epoch content or different ctime.
> 
> I think this case (all nodes are down abnormally instead of being 
> shutdowned, for e.g power outage) should be solved by another patch, 
> because it is, IMHO, a different problem.

Probably, we should store the information about the node shutdown
status (safely shutdowned or unexpectedly stopped) to take a different
approach when starting up.  Though this needs not to be done in this
patch.

> 
> When nodes with newer epoch or older epoch should *not* be regarded 
> _leave nodes_, they should be processed as soon as they are started up. 
> Though, this patch set wrongly take them as leave nodes.
> 
> I'll cook a different patch targeting for this problem, well, based on 
> this shutdown patch set.

Good!  Thanks a lot.

Kazutaka