[sheepdog] [PATCH v3] sheep: remove master node

Tue Jul 23 10:44:39 CEST 2013

At Tue, 23 Jul 2013 16:30:33 +0800,
Kai Zhang wrote:
> 
> 
> On Jul 23, 2013, at 4:00 PM, MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> wrote:
> 
> > At Sat, 20 Jul 2013 15:21:55 +0800,
> > Kai Zhang wrote:
> >> 
> >> On Jul 19, 2013, at 12:01 PM, MORITA Kazutaka <morita.kazutaka at gmail.com> wrote:
> >> 
> >>> This patch changes the sheep join procedure to the following.
> >>> 
> >>> 1. The joining node sends a join request.
> >>> 2. Some of the existing nodes accept the request.
> >>> 3. All the nodes update cluster members.
> >>> 
> >>> It is allowed for the multiple nodes to call sd_join_handler() against
> >>> the same join request, but at least one node must have to do it.  With
> >>> this change, we can eliminate a master, and node failure while
> >>> accepting node join is also allowed.
> >>> 
> >>> Removing a master from zookeeper is not easy since it doesn't expect
> >>> that multiple nodes send EVENT_ACCEPT.  I'll leave this for another
> >>> day.
> >> 
> >> 
> >> Here are 2 questions in my mind:
> >> 
> >> 1. Based on current implementation of cluster driver, we accept all join request
> >> when the cluster is running.
> >> However, consider the fowling scenario:
> >> - cluster runs with A, B, C, D
> >> - A quits for some reasons
> >> - after A quits, lots of data operations happened
> >> - B, C, D all quit for some reasons
> >> - A comes back with old data
> >> - B, C, D come back
> >> In this scenario, old data will overwrite new data, no?
> > 
> > Yes, but it happens without my series.  The current implementation
> > allows the client to see the old data, and that's the reason I added a
> > sd_printf with high priority like
> > 
> >  sd_printf(SDOG_ALERT, "clients may see old data");
> > 
> > Previously, I tried to add a code to stop removing stale objects when
> > the above situation happens, and add a chance to recover correct
> > objects manually.
> > 
> >  http://lists.wpkg.org/pipermail/sheepdog/2013-May/009869.html
> > 
> > However, we agreed that stopping a sheepdog cluster and doing manual
> > recovery is not acceptable approach for service providers, and are
> > still finding a better way to recover successfully from the above
> > problem.
> > 
> 
> I think in old implementation, when B, C, D come back, A will kill it self
> due to its epoch is less than other's. This would save the data a little bit.
> However, if a client connected with A before B, C, D came back, the client
> would receive old data.

Ah, sorry.  The node A doesn't start until the nodes B, C, and D come
back.  It is because the latest epoch in the node A includes B, C, and
D.

> 
> 
> >> 
> >> 2. All sheep who join an empty cluster at the same time will always successful.
> > 
> > Even if sheep joins to the empty cluster at the same time, the cluster
> > drivers orders the join events.
> 
> Yes, but zookeeper cannot handle it right now.
> However, I think there is a way to remove it.
> 
> I would like to try after this patch is merged.

Thanks a lot.

> 
> 
> > 
> >> Is this safe?
> >> 
> >> 
> >> By the way, in term of zookeeper, I think it can work well when there are multiple EVENT_ACCEPT
> >> events for one join request.
> >> This is because an update event will only trigger zookeeper driver to fetch a event from the queue.
> >> If the event is "accept", then it will handle it by calling sd_accept_handler and move to next node.
> > 
> > How can we gurantee that more than one sheep fetch the accept event at
> > the same time?
> 
> Well, the "fetch" will not remove the event from the queue. 
> The queue is shared by all sheep and no one can remove any item in it except the administrator.
> Each sheep uses an integer to record its queue's head position. 
> One sheep fetch the event will not conflict with others.
> And finally, all sheep should fetch the event and handle it.
> So this should not be a problem.

Okay, thanks for your explanation.

Should I include my zookeer patch into this series?  You know, my
zookeeper patch didn't remove the zk master in either way.  But if you
want to do your work onto my patch, I'll include my zk patch again
into the v4 series.

Thanks,

Kazutaka