[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request

Liu Yuan namei.unix at gmail.com
Fri Apr 27 06:54:40 CEST 2012


Hi, Shevek
On 04/27/2012 08:15 AM, Shevek wrote:

> 
> A problem arises if a node joins the cluster and generates a
> confchg event, then crashes or leaves without sending a join
> request and receiving a join response. The second node to join
> never becomes master, and the entire cluster hangs.
> 
> This patch allows a node to detect whether it should promote itself
> to master after an arbitrary confchg event. Every node except the
> master creates a blocked JOIN event for every node that joined
> after itself, therefore the master is the node which has a JOIN
> event for every node in the members list.
> 
> A following patch will handle the case where a join request
> is sent, but the master crashes before sending a join response.
> 


Thanks for your patch

I think the (commit: c4e3559758b2e) dedicated to this problem.
mastership is actually transferred to the second sheep. So I suspect
that hang is caused by other bug.

Is there way to confirm or reproduce this hang reliably?

> There is a third outstanding issue if two clusters merge, also to be
> addressed in a following patch.


What kind of issue?


+	// Exactly one non-master member has seen join events for all other
+	// members, because events are ordered.
+	for (i = 0; i < member_list_entries; i++) {
+		struct cpg_node member = {

please use the /* */ to comment multiple lines.

Thanks,
Yuan



More information about the sheepdog mailing list