[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request

Liu Yuan namei.unix at gmail.com
Sat Apr 28 05:02:06 CEST 2012


On 04/27/2012 10:43 PM, Liu Yuan wrote:

> On 04/27/2012 08:15 AM, Shevek wrote:
> 
>> +	// Exactly one non-master member has seen join events for all other
>> +	// members, because events are ordered.
>> +	for (i = 0; i < member_list_entries; i++) {
>> +		struct cpg_node member = {
>> +			.nodeid = member_list[i].nodeid,
>> +			.pid = member_list[i].pid,
>> +			};
>> +		cevent = find_block_event(COROSYNC_EVENT_TYPE_JOIN, &member);
>> +		if (cevent == NULL) {
>> +			dprintf("Not promoting because member is not in our event list.");
>> +			goto nopromote;
>> +		}
>> +	}
>> +
>> +	list_for_each_entry(cevent, &corosync_event_list, list) {
>> +		dprintf("Setting first_node on event %p.", cevent);
>> +		cevent->first_node = 1;
>> +	}
>> +nopromote:
>> +
> 
> 
> I think the fix is the way too hacky. The fix here abuse the 'first
> node' denotation which is to mean, IIUC, 'first node in the cluster' or
> 'first group of nodes in the cluster'.
> 
> I am not quit sure about this, but the fix really confuses me, it makes
> the join phase elusive. Kazum, how do you think of it?
> 
> Thanks,
> Yuan


Hi, Shevek,
	I have drafted a patch against the issue, could you please try the
following patch?

Thanks,
Yuan

>From 3ace66e31e3a5dd6886b340e54a37f31594d25dc Mon Sep 17 00:00:00 2001
From: Liu Yuan <tailai.ly at taobao.com>
Date: Sat, 28 Apr 2012 10:41:57 +0800
Subject: [PATCH] corosync: fix cluster hang

Consider two nodes cluster, A is the master and B is going
to join.
                A: I'm the master
B: send_join_msg
                A: crached before sending join_response
B: blocked for ever

The fix is let B go as far as possible, where B can find himself
become the master.

Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
 sheep/cluster/corosync.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/sheep/cluster/corosync.c b/sheep/cluster/corosync.c
index 4a588e9..7f52385 100644
--- a/sheep/cluster/corosync.c
+++ b/sheep/cluster/corosync.c
@@ -393,7 +393,7 @@ static void __corosync_dispatch(void)
                        }
                }

-               if (join_finished)
+               if (join_finished || is_master(&cevent->sender))
                        done = __corosync_dispatch_one(cevent);
                else



More information about the sheepdog mailing list