On Tue, Jul 16, 2013 at 05:35:26PM +0900, MORITA Kazutaka wrote: > At Sun, 14 Jul 2013 14:25:12 +0800, > Liu Yuan wrote: > > > > On Sun, Jul 14, 2013 at 12:08:46AM +0900, MORITA Kazutaka wrote: > > > From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> > > > > > > The current procedure to handle sheep join is as follows. > > > > > > 1. The joining node sends a join request. > > > 2. The master node accepts the request. > > > 3. All the nodes update cluster members. > > > > > > This procedure has some problems: > > > > > > - The master election is too complex to maintain. > > > It is very difficult to make sure that the implementation is > > > correct. > > > > > > - The master node can fail while it is accepting the joining node. > > > The newly elected master has to take over the process, but it's > > > usually difficult to implement because we have to know what the > > > previous master did and what it did not before its failure. > > > > > > This patch changes the sheep join procedure to the following. > > > > > > 1. The joining node sends a join request. > > > 2. Some of the existing nodes accept the request. > > > > Seems that all the nodes in the cluster accept the request, no? > > If the join event in the cluster event queue is updated by another > node's accept message, the node doesn't call sd_accept_handler(). > However, the current implementation doesn't check arriving messages > while it dispatches events, so all the nodes seems to call > sd_accept_handler with the corosync and zookeeper drivers. > > > > > > 3. All the nodes update cluster members. > > > > > > It is allowed for the multiple nodes to call sd_accept_handler() > > > against the same join request, but at least one node must have to do > > > it. With this change, we can eliminate a master, and node failure > > > while accepting node join is also allowed. > > > > > > > Why sd_accept_handler is reentrant in cluster aspect? I noticed that, e.g, > > It is because if only one node calls sd_accept_handler(), the node can > be a SPOF while processing the joining node. Allowing multiple nodes > to call sd_accept_handler() looks the simplest way to me. I agree > that this needs a discussion, though. > I think ask all the nodes to call sd_accept_handler() is pretty fine. > > > push_join_response() of zk driver are called on all the nodes too. So if > > following case happens, can sheep handle it? > > > > 2 nodes in the cluster {A, B}. And C is joining the cluster. > > > > A -> push_join_response() and quickly return, watcher of A, B, C is called > > to handle EVENT_ACCEPT from A. > > B -> push_join_response() slowly return because of network, A, B, C handles > > EVENT_ACCEPT from B. > > > > Simply put, can sheep hanle multiple EVENT_ACCEPT of the same node? > > I think the answer is yes. > > - local: The event queue is a mmapped file and guared by flock, so > concurrent sd_accept_handler() calls don't happen. > > - corosync: cdrv_cpg_deliver() ignores the arriving > COROSYNC_MSG_TYPE_ACCEPT() if there is no JOIN event in the queue. Corosync actually never try to send EVENT_ACCEPT more than once for current code So no worries about corosync. > > - zookeeper: push_join_response() just overwrites the znode with > EVENT_ACCEPT, and multiple calls of push_join_response() is no > problem. I noticed zookeeper just send one event to watcher on my test box even if there are multiple updater to one member of the queue. But I think there is problem like above example. I think we need to check if there someone updates the join event already in the queue inside push_join_response(), to allow only one updater thus one update event to watcher of all nodes. Thanks Yuan |