[sheepdog] [PATCH] sheep: remove master node

Tue Jul 16 10:35:26 CEST 2013

At Sun, 14 Jul 2013 14:25:12 +0800,
Liu Yuan wrote:
> 
> On Sun, Jul 14, 2013 at 12:08:46AM +0900, MORITA Kazutaka wrote:
> > From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> > 
> > The current procedure to handle sheep join is as follows.
> > 
> >  1. The joining node sends a join request.
> >  2. The master node accepts the request.
> >  3. All the nodes update cluster members.
> > 
> > This procedure has some problems:
> > 
> >  - The master election is too complex to maintain.
> >    It is very difficult to make sure that the implementation is
> >    correct.
> > 
> >  - The master node can fail while it is accepting the joining node.
> >    The newly elected master has to take over the process, but it's
> >    usually difficult to implement because we have to know what the
> >    previous master did and what it did not before its failure.
> > 
> > This patch changes the sheep join procedure to the following.
> > 
> >  1. The joining node sends a join request.
> >  2. Some of the existing nodes accept the request.
> 
> Seems that all the nodes in the cluster accept the request, no?

If the join event in the cluster event queue is updated by another
node's accept message, the node doesn't call sd_accept_handler().
However, the current implementation doesn't check arriving messages
while it dispatches events, so all the nodes seems to call
sd_accept_handler with the corosync and zookeeper drivers.

> 
> >  3. All the nodes update cluster members.
> > 
> > It is allowed for the multiple nodes to call sd_accept_handler()
> > against the same join request, but at least one node must have to do
> > it.  With this change, we can eliminate a master, and node failure
> > while accepting node join is also allowed.
> > 
> 
> Why sd_accept_handler is reentrant in cluster aspect? I noticed that, e.g,

It is because if only one node calls sd_accept_handler(), the node can
be a SPOF while processing the joining node.  Allowing multiple nodes
to call sd_accept_handler() looks the simplest way to me.  I agree
that this needs a discussion, though.

> push_join_response() of zk driver are called on all the nodes too. So if
> following case happens, can sheep handle it?
> 
> 2 nodes in the cluster {A, B}. And C is joining the cluster.
> 
> A -> push_join_response() and quickly return, watcher of A, B, C is called
>      to handle EVENT_ACCEPT from A.
> B -> push_join_response() slowly return because of network, A, B, C handles
>      EVENT_ACCEPT from B.
> 
> Simply put, can sheep hanle multiple EVENT_ACCEPT of the same node?

I think the answer is yes.

 - local: The event queue is a mmapped file and guared by flock, so
   concurrent sd_accept_handler() calls don't happen.

 - corosync: cdrv_cpg_deliver() ignores the arriving
   COROSYNC_MSG_TYPE_ACCEPT() if there is no JOIN event in the queue.

 - zookeeper: push_join_response() just overwrites the znode with
   EVENT_ACCEPT, and multiple calls of push_join_response() is no
   problem.

Thanks,

Kazutaka