On 04/27/2012 10:43 PM, Liu Yuan wrote: > On 04/27/2012 08:15 AM, Shevek wrote: > >> + // Exactly one non-master member has seen join events for all other >> + // members, because events are ordered. >> + for (i = 0; i < member_list_entries; i++) { >> + struct cpg_node member = { >> + .nodeid = member_list[i].nodeid, >> + .pid = member_list[i].pid, >> + }; >> + cevent = find_block_event(COROSYNC_EVENT_TYPE_JOIN, &member); >> + if (cevent == NULL) { >> + dprintf("Not promoting because member is not in our event list."); >> + goto nopromote; >> + } >> + } >> + >> + list_for_each_entry(cevent, &corosync_event_list, list) { >> + dprintf("Setting first_node on event %p.", cevent); >> + cevent->first_node = 1; >> + } >> +nopromote: >> + > > > I think the fix is the way too hacky. The fix here abuse the 'first > node' denotation which is to mean, IIUC, 'first node in the cluster' or > 'first group of nodes in the cluster'. > > I am not quit sure about this, but the fix really confuses me, it makes > the join phase elusive. Kazum, how do you think of it? > > Thanks, > Yuan Hi, Shevek, I have drafted a patch against the issue, could you please try the following patch? Thanks, Yuan >From 3ace66e31e3a5dd6886b340e54a37f31594d25dc Mon Sep 17 00:00:00 2001 From: Liu Yuan <tailai.ly at taobao.com> Date: Sat, 28 Apr 2012 10:41:57 +0800 Subject: [PATCH] corosync: fix cluster hang Consider two nodes cluster, A is the master and B is going to join. A: I'm the master B: send_join_msg A: crached before sending join_response B: blocked for ever The fix is let B go as far as possible, where B can find himself become the master. Signed-off-by: Liu Yuan <tailai.ly at taobao.com> --- sheep/cluster/corosync.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/sheep/cluster/corosync.c b/sheep/cluster/corosync.c index 4a588e9..7f52385 100644 --- a/sheep/cluster/corosync.c +++ b/sheep/cluster/corosync.c @@ -393,7 +393,7 @@ static void __corosync_dispatch(void) } } - if (join_finished) + if (join_finished || is_master(&cevent->sender)) done = __corosync_dispatch_one(cevent); else |