[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request

Fri Apr 27 09:12:26 CEST 2012

On Fri, 2012-04-27 at 12:54 +0800, Liu Yuan wrote:
> Hi, Shevek
> On 04/27/2012 08:15 AM, Shevek wrote:
> 
> > 
> > A problem arises if a node joins the cluster and generates a
> > confchg event, then crashes or leaves without sending a join
> > request and receiving a join response. The second node to join
> > never becomes master, and the entire cluster hangs.
> > 
> > This patch allows a node to detect whether it should promote itself
> > to master after an arbitrary confchg event. Every node except the
> > master creates a blocked JOIN event for every node that joined
> > after itself, therefore the master is the node which has a JOIN
> > event for every node in the members list.
> > 
> > A following patch will handle the case where a join request
> > is sent, but the master crashes before sending a join response.
> > 
> 
> 
> Thanks for your patch
> 
> I think the (commit: c4e3559758b2e) dedicated to this problem.
> mastership is actually transferred to the second sheep. So I suspect
> that hang is caused by other bug.

Our patch is correct and required in addition to the c4e3 patch.
Christoph Hellwig and I proved it with gdb, and we have detailed log
files and memory dumps to prove this patch correct.

Our patch is required because if the master never sends a join_response,
the secondary sheep will have a blocked JOIN event for itself in its
queue, and has never set join_finished, so has never built cpg_nodes,
and cannot set .gone. The c4e3 patch only works if the master has
unblocked the JOIN event by sending a join_response.

> Is there way to confirm or reproduce this hang reliably?

Yes, we did it in our lab.

If you want to reproduce the cluster hang without a complicated network
setup, hack a master sheep to:
a) start a cluster
b) join itself fully
c) format
d) wait for the cfgchange event for a secondary sheep
e) exit/crash/net-fail before sending any join_response messages to the
second sheep - and the cluster will hang. When this patch is applied,
the cluster will un-hang.

The .gone patch is not sufficient, as it does not unblock the message
queue if the crashing master has not yet sent a join response to the
second sheep. The .gone patch only works if the master accepted the
second sheep before crashing or losing network connection.

> > There is a third outstanding issue if two clusters merge, also to be
> > addressed in a following patch.
> 
> 
> What kind of issue?

Issue 1: There is a race condition with inc_epoch which causes sheep to
have mismatched epochs. We have a good patch for this which solves the
problem, we will publish it in the next couple of days.

Issue 2: Sheepdog builds the cpg_nodes array based on the assumption
that corosync always does ordered message delivery, but there are
circumstances where corosync can deliver (especially cfgchange) messages
in a different order to different sheep. We have reproduced this, and it
is not a bug in corosync, it's just a fact of networking.

If either corosync is started very rapidly after sheepdog, or the
network partitions then rejoins, or sheepdog is started before the
corosync ring has coalesced, then rings will get inconsistent copies of
cpg_nodes, and will therefore have an inconsistent view of who is the
master. This causes really weird effects.

We do not yet have a patch for this, but we have a good understanding of
the problem, and we can reproduce it in our lab, so we will write a
patch soon.

> +	// Exactly one non-master member has seen join events for all other
> +	// members, because events are ordered.
> +	for (i = 0; i < member_list_entries; i++) {
> +		struct cpg_node member = {
> 
> please use the /* */ to comment multiple lines.

I will do that in future.

Thank you.

S.