[sheepdog] [PATCH] sheep: remove master node

Tue Jul 16 11:37:55 CEST 2013

On Tue, Jul 16, 2013 at 05:35:26PM +0900, MORITA Kazutaka wrote:
> At Sun, 14 Jul 2013 14:25:12 +0800,
> Liu Yuan wrote:
> > 
> > On Sun, Jul 14, 2013 at 12:08:46AM +0900, MORITA Kazutaka wrote:
> > > From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> > > 
> > > The current procedure to handle sheep join is as follows.
> > > 
> > >  1. The joining node sends a join request.
> > >  2. The master node accepts the request.
> > >  3. All the nodes update cluster members.
> > > 
> > > This procedure has some problems:
> > > 
> > >  - The master election is too complex to maintain.
> > >    It is very difficult to make sure that the implementation is
> > >    correct.
> > > 
> > >  - The master node can fail while it is accepting the joining node.
> > >    The newly elected master has to take over the process, but it's
> > >    usually difficult to implement because we have to know what the
> > >    previous master did and what it did not before its failure.
> > > 
> > > This patch changes the sheep join procedure to the following.
> > > 
> > >  1. The joining node sends a join request.
> > >  2. Some of the existing nodes accept the request.
> > 
> > Seems that all the nodes in the cluster accept the request, no?
> 
> If the join event in the cluster event queue is updated by another
> node's accept message, the node doesn't call sd_accept_handler().
> However, the current implementation doesn't check arriving messages
> while it dispatches events, so all the nodes seems to call
> sd_accept_handler with the corosync and zookeeper drivers.
> 
> > 
> > >  3. All the nodes update cluster members.
> > > 
> > > It is allowed for the multiple nodes to call sd_accept_handler()
> > > against the same join request, but at least one node must have to do
> > > it.  With this change, we can eliminate a master, and node failure
> > > while accepting node join is also allowed.
> > > 
> > 
> > Why sd_accept_handler is reentrant in cluster aspect? I noticed that, e.g,
> 
> It is because if only one node calls sd_accept_handler(), the node can
> be a SPOF while processing the joining node.  Allowing multiple nodes
> to call sd_accept_handler() looks the simplest way to me.  I agree
> that this needs a discussion, though.
> 

I think ask all the nodes to call sd_accept_handler() is pretty fine.

>
> > push_join_response() of zk driver are called on all the nodes too. So if
> > following case happens, can sheep handle it?
> > 
> > 2 nodes in the cluster {A, B}. And C is joining the cluster.
> > 
> > A -> push_join_response() and quickly return, watcher of A, B, C is called
> >      to handle EVENT_ACCEPT from A.
> > B -> push_join_response() slowly return because of network, A, B, C handles
> >      EVENT_ACCEPT from B.
> > 
> > Simply put, can sheep hanle multiple EVENT_ACCEPT of the same node?
> 
> I think the answer is yes.
> 
>  - local: The event queue is a mmapped file and guared by flock, so
>    concurrent sd_accept_handler() calls don't happen.
> 
>  - corosync: cdrv_cpg_deliver() ignores the arriving
>    COROSYNC_MSG_TYPE_ACCEPT() if there is no JOIN event in the queue.

Corosync actually never try to send EVENT_ACCEPT more than once for current code
So no worries about corosync.

> 
>  - zookeeper: push_join_response() just overwrites the znode with
>    EVENT_ACCEPT, and multiple calls of push_join_response() is no
>    problem.

I noticed zookeeper just send one event to watcher on my test box even if there
are multiple updater to one member of the queue. But I think there is problem
like above example. I think we need to check if there someone updates the join
event already in the queue inside push_join_response(), to allow only one
updater thus one update event to watcher of all nodes.

Thanks
Yuan