[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request
Liu Yuan
namei.unix at gmail.com
Sat Apr 28 05:02:06 CEST 2012
On 04/27/2012 10:43 PM, Liu Yuan wrote:
> On 04/27/2012 08:15 AM, Shevek wrote:
>
>> + // Exactly one non-master member has seen join events for all other
>> + // members, because events are ordered.
>> + for (i = 0; i < member_list_entries; i++) {
>> + struct cpg_node member = {
>> + .nodeid = member_list[i].nodeid,
>> + .pid = member_list[i].pid,
>> + };
>> + cevent = find_block_event(COROSYNC_EVENT_TYPE_JOIN, &member);
>> + if (cevent == NULL) {
>> + dprintf("Not promoting because member is not in our event list.");
>> + goto nopromote;
>> + }
>> + }
>> +
>> + list_for_each_entry(cevent, &corosync_event_list, list) {
>> + dprintf("Setting first_node on event %p.", cevent);
>> + cevent->first_node = 1;
>> + }
>> +nopromote:
>> +
>
>
> I think the fix is the way too hacky. The fix here abuse the 'first
> node' denotation which is to mean, IIUC, 'first node in the cluster' or
> 'first group of nodes in the cluster'.
>
> I am not quit sure about this, but the fix really confuses me, it makes
> the join phase elusive. Kazum, how do you think of it?
>
> Thanks,
> Yuan
Hi, Shevek,
I have drafted a patch against the issue, could you please try the
following patch?
Thanks,
Yuan
>From 3ace66e31e3a5dd6886b340e54a37f31594d25dc Mon Sep 17 00:00:00 2001
From: Liu Yuan <tailai.ly at taobao.com>
Date: Sat, 28 Apr 2012 10:41:57 +0800
Subject: [PATCH] corosync: fix cluster hang
Consider two nodes cluster, A is the master and B is going
to join.
A: I'm the master
B: send_join_msg
A: crached before sending join_response
B: blocked for ever
The fix is let B go as far as possible, where B can find himself
become the master.
Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
sheep/cluster/corosync.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/sheep/cluster/corosync.c b/sheep/cluster/corosync.c
index 4a588e9..7f52385 100644
--- a/sheep/cluster/corosync.c
+++ b/sheep/cluster/corosync.c
@@ -393,7 +393,7 @@ static void __corosync_dispatch(void)
}
}
- if (join_finished)
+ if (join_finished || is_master(&cevent->sender))
done = __corosync_dispatch_one(cevent);
else
More information about the sheepdog
mailing list