[sheepdog] [PATCH] sheep: remove master node
Liu Yuan
namei.unix at gmail.com
Tue Jul 16 11:37:55 CEST 2013
On Tue, Jul 16, 2013 at 05:35:26PM +0900, MORITA Kazutaka wrote:
> At Sun, 14 Jul 2013 14:25:12 +0800,
> Liu Yuan wrote:
> >
> > On Sun, Jul 14, 2013 at 12:08:46AM +0900, MORITA Kazutaka wrote:
> > > From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> > >
> > > The current procedure to handle sheep join is as follows.
> > >
> > > 1. The joining node sends a join request.
> > > 2. The master node accepts the request.
> > > 3. All the nodes update cluster members.
> > >
> > > This procedure has some problems:
> > >
> > > - The master election is too complex to maintain.
> > > It is very difficult to make sure that the implementation is
> > > correct.
> > >
> > > - The master node can fail while it is accepting the joining node.
> > > The newly elected master has to take over the process, but it's
> > > usually difficult to implement because we have to know what the
> > > previous master did and what it did not before its failure.
> > >
> > > This patch changes the sheep join procedure to the following.
> > >
> > > 1. The joining node sends a join request.
> > > 2. Some of the existing nodes accept the request.
> >
> > Seems that all the nodes in the cluster accept the request, no?
>
> If the join event in the cluster event queue is updated by another
> node's accept message, the node doesn't call sd_accept_handler().
> However, the current implementation doesn't check arriving messages
> while it dispatches events, so all the nodes seems to call
> sd_accept_handler with the corosync and zookeeper drivers.
>
> >
> > > 3. All the nodes update cluster members.
> > >
> > > It is allowed for the multiple nodes to call sd_accept_handler()
> > > against the same join request, but at least one node must have to do
> > > it. With this change, we can eliminate a master, and node failure
> > > while accepting node join is also allowed.
> > >
> >
> > Why sd_accept_handler is reentrant in cluster aspect? I noticed that, e.g,
>
> It is because if only one node calls sd_accept_handler(), the node can
> be a SPOF while processing the joining node. Allowing multiple nodes
> to call sd_accept_handler() looks the simplest way to me. I agree
> that this needs a discussion, though.
>
I think ask all the nodes to call sd_accept_handler() is pretty fine.
>
> > push_join_response() of zk driver are called on all the nodes too. So if
> > following case happens, can sheep handle it?
> >
> > 2 nodes in the cluster {A, B}. And C is joining the cluster.
> >
> > A -> push_join_response() and quickly return, watcher of A, B, C is called
> > to handle EVENT_ACCEPT from A.
> > B -> push_join_response() slowly return because of network, A, B, C handles
> > EVENT_ACCEPT from B.
> >
> > Simply put, can sheep hanle multiple EVENT_ACCEPT of the same node?
>
> I think the answer is yes.
>
> - local: The event queue is a mmapped file and guared by flock, so
> concurrent sd_accept_handler() calls don't happen.
>
> - corosync: cdrv_cpg_deliver() ignores the arriving
> COROSYNC_MSG_TYPE_ACCEPT() if there is no JOIN event in the queue.
Corosync actually never try to send EVENT_ACCEPT more than once for current code
So no worries about corosync.
>
> - zookeeper: push_join_response() just overwrites the znode with
> EVENT_ACCEPT, and multiple calls of push_join_response() is no
> problem.
I noticed zookeeper just send one event to watcher on my test box even if there
are multiple updater to one member of the queue. But I think there is problem
like above example. I think we need to check if there someone updates the join
event already in the queue inside push_join_response(), to allow only one
updater thus one update event to watcher of all nodes.
Thanks
Yuan
More information about the sheepdog
mailing list