[sheepdog] [PATCH v2 04/11] sheep: don't check nodes information for joined nodes

Sat Sep 21 17:38:31 CEST 2013

At Sat, 21 Sep 2013 00:54:10 +0800,
Liu Yuan wrote:
> 
> On Sat, Sep 21, 2013 at 01:48:27AM +0900, MORITA Kazutaka wrote:
> > At Sat, 21 Sep 2013 00:39:31 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Sat, Sep 21, 2013 at 01:22:46AM +0900, MORITA Kazutaka wrote:
> > > > At Thu, 19 Sep 2013 17:08:08 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Thu, Sep 19, 2013 at 02:42:54AM +0900, MORITA Kazutaka wrote:
> > > > > > At Sat, 14 Sep 2013 18:34:24 +0800,
> > > > > > Liu Yuan wrote:
> > > > > > > 
> > > > > > > cluster_join_check is basically used to check newly joining node. But the old
> > > > > > > code also check the nodes states passed by cinfo with sys->cinfo. After we have
> > > > > > > struct rb_node rb in the sd_node, we'll never have this check passed.
> > > > > > > 
> > > > > > > Instead of doing the check with more complex code, this patch simply remove the
> > > > > > > check since nodes states in the joined nodes are always the same.
> > > > > > 
> > > > > > Is it true?  E.g. if network partition happens and two subclusters are
> > > > > > merged, the state of the joining node doesn't match.  The current code
> > > > > > can detect it, but this patch removes the check? 
> > > > > > 
> > > > > 
> > > > > why we don't allow nodes that are network partitioned to join back? Users asks
> > > > > to join the node, I think we should allow the node to join back, no?
> > > > 
> > > > We cannot allow rejoin after network partition happens.  E.g.
> > > > 
> > > >   1. Sheepdog is running with nodes {A, B, C, D, E} at epoch 1.
> > > >   2. The cluster is splitted into {A, B, C} and {D, E}.  Both are at
> > > >      epoch 2.
> > > >   3. The node A fails.  Then {B, C} is at epoch 3, and {D, E} is at
> > > >      epoch 2.
> > > >   4. If the two cluster is merged, their epoch numbers are not
> > > >      consistent.
> > > > 
> > > > I think we should at least keep the epoch number check.
> > > > 
> > > 
> > > I don't think we can do epoch check either. Suppose 
> > > 
> > > {A, B, C} at epoch 1
> > > C goes down, then {A, B} with epoch = 2
> > > when we add C back with epoch 1, C should joins the cluster.
> > 
> > The example is too simple.  When talking about network partition,
> > let's take into account the case where both subclusters have at least
> > two nodes.
> > 
> > {A, B, C, D} at epoch 1.
> > {A, B} and {C, D} at epoch 2.
> > {A, B} at epoch 2, and {C} at epoch 3 (the node D has gone).
> > 
> > Then, if {A, B} and {C} are merged, epoch numbers become inconsistent.
> 
> what did you mean by epoch inconsistent? For the current code, in your case

I meant the result of 'dog cluster info' could be different among the
nodes.

> C's cinfo will be reinited from A or B in the update_cluster_info in the
> accept handler. what exact the problem? If we simply check epoch equal, then we

update_cluster_info updates only the latest epoch and doesn't update
past epoch history.  We assume that every node has the same epoch
history especially in the recovery code.

> can't join node back if it fails, no?

No, but I think there is no way.  I don't think the current sheepdog
code can handle my example correctly, and, IMHO, it's safe to prevent
the invalid node from joining to Sheepdog.

Thanks,

Kazutaka