[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request

Sat Apr 28 12:58:11 CEST 2012

On 04/28/2012 06:15 PM, Liu Yuan wrote:

> I don't think so.
> 
> see below example (with only my previous c4e3559758b2e fix)
> 
> we have 2 nodes in the cluster (A, B), so
> 
> init: nr_cpg_nodes = 2, A is the master
> 	C joins
> 1: A, B, C get a join_messgae(C)
>   for nr_cpg_nodes, A = 2, B = 2, C = 0
> 2: A crashed before sending response
> 3: B, C get a leve_message(A)
>   for nr_cpg_nodes, B = 2, C = 0
>   for is_master(), B = 1, C = 0
>   for join_finished, B = 1, C = 0
>   so now B is elected to be master and responsible to send_reponse()
> 4: everything goes okay.


Finally I got sometime to do more test, I have run below test to prove
it correct:

first patch the master:

diff --git a/sheep/cluster/corosync.c b/sheep/cluster/corosync.c
index 4a588e9..e960088 100644
--- a/sheep/cluster/corosync.c
+++ b/sheep/cluster/corosync.c
@@ -280,6 +280,7 @@ static int __corosync_dispatch_one(struct
corosync_event *cevent)
        enum cluster_join_result res;
        struct sd_node entries[SD_MAX_NODES];
        int idx;
+       static int i;

        switch (cevent->type) {
        case COROSYNC_EVENT_TYPE_JOIN:
@@ -300,6 +301,12 @@ static int __corosync_dispatch_one(struct
corosync_event *cevent)
                        if (res == CJ_RES_MASTER_TRANSFER)
                                nr_cpg_nodes = 0;

+                       i++;
+                       if (i == 3) {
+                               dprintf("%d\n", i);
+                               panic("Okay, I am forced out\n");
+                       }
+
                        send_message(COROSYNC_MSG_TYPE_JOIN_RESPONSE, res,
                                     &cevent->sender, cpg_nodes,
nr_cpg_nodes,
                                     cevent->msg, cevent->msg_len);


then run the following script:
for i in 0 1; do sheep/sheep -a -d /home/tailai.ly/sheepdog/store/$i -z
$i -p 700$i;sleep 1;done
echo simulate master is down before sending response
for i in 2; do sheep/sheep -a -d /home/tailai.ly/sheepdog/store/$i -z $i
-p 700$i;sleep 1;done

then we can see from the log that node 1 and 2 join the cluster without
problem.

Thanks,
Yuan