[Sheepdog] [PATCH v2] sheep: tame sheep to recover the crash cluster

Sun Sep 25 17:01:02 CEST 2011

At Sun, 25 Sep 2011 12:00:20 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai.ly at taobao.com>
> 
> Hi Kazum,
>         would this solve the data loss problem as you mentioned when there is no epoch overlap? This patch would allow cluster to recover as if the master (last failed node) were no ever crashed.

The concept of mastership transfer is great!  I think this is the
right way to go.

But your patch has some problems.  Here is a test case to reproduce
the problems:

    #!/bin/bash

    # create a directory which has a different creation time
    sheep /store/1 -p 7001
    sleep 1
    collie cluster format -p 7001
    collie cluster shutdown -p 7001
    sleep 1

    # start Sheepdog
    sheep /store/0 -p 7000
    sleep 1
    collie cluster format -p 7000

    while true; do
        sheep /store/1 -p 7001
        sheep /store/2 -p 7002

        # wait for node join
        while [ "`collie cluster info -p 7002 -r 2>&1 | head -1`" != 'running' ]; do
            sleep 0.1
        done

        if [ "`collie node list -p 7002 -r | wc -l`" -ne 2 ]; then
            # break if the result is not correct
            break
        fi

        pkill -f "sheep /store/2"
    done

    # show results
    collie cluster info -p 7000
    collie cluster info -p 7002

The detailed reasons are below.

> @@ -975,17 +992,6 @@ static void __sd_deliver(struct cpg_event *cevent)
>  		addr_to_str(name, sizeof(name), m->from.addr, m->from.port),
>  		m->pid);
>  
> -	/*
> -	 * we don't want to perform any deliver events until we
> -	 * join; we wait for our JOIN message.
> -	 */
> -	if (!sys->join_finished) {
> -		if (m->pid != sys->this_pid || m->nodeid != sys->this_nodeid) {
> -			cevent->skip = 1;
> -			return;
> -		}
> -	}
> -

Sheepdog assumes that only joined nodes handle the delived messages,
so we cannot remove this block.  You should pass only mastership
transfer events here.

>  	if (m->op == SD_MSG_JOIN) {
>  		uint32_t nodeid = m->nodeid;
>  		uint32_t pid = m->pid;
> @@ -1052,7 +1058,15 @@ static void send_join_response(struct work_deliver *w)
>  			jm->nr_leave_nodes++;
>  		}
>  		print_node_list(&sys->leave_list);
> +	} else if (jm->result != SD_RES_SUCCESS &&
> +			jm->epoch > sys->epoch &&
> +			jm->cluster_status == SD_STATUS_WAIT_FOR_JOIN) {
> +		eprintf("Transfer mastership.\n");
> +		leave_cluster();
> +		eprintf("Restart me later when master is up, please.Bye.\n");
> +		exit(1);
>  	}
> +	jm->epoch = sys->epoch;
>  	send_message(sys->handle, m);
>  }
>  
> @@ -1090,15 +1104,23 @@ static void __sd_deliver_done(struct cpg_event *cevent)
>  				lm = (struct leave_message *)m;
>  				add_node_to_leave_list(m);
>  
> -				if (lm->epoch > sys->leave_epoch)
> -					sys->leave_epoch = lm->epoch;
> +				/* Sheep needs this to identify itself as master.
> +				 * Now mastership transfer is done.
> +				 */
> +				if (!sys->join_finished) {
> +					sys->join_finished = 1;
> +					move_node_to_sd_list(sys->this_nodeid, sys->this_pid, sys->this_node);
> +					sys->epoch = get_latest_epoch();
> +				}

IIUC, this codes assume that all other nodes will send leave messages
because this node has a newer epoch, so this can be a master.  But the
assumption is wrong because the node which has a completely wrong
epoch information (e.g. a node with a different creation time) also
sends a leave message.

My suggestion is introducing another message type something like
SD_MSG_MASTER_TRANSFER.  I think we should clearly distinguish master
transfer events from leave messages.

Thanks,

Kazutaka

>  
>  				nr_local = get_nodes_nr_epoch(sys->epoch);
>  				nr = get_nodes_nr_from(&sys->sd_node_list);
>  				nr_leave = get_nodes_nr_from(&sys->leave_list);
> +
> +				dprintf("%d == %d + %d \n", nr_local, nr, nr_leave);
>  				if (nr_local == nr + nr_leave) {
>  					sys->status = SD_STATUS_OK;
> -					sys->epoch = sys->leave_epoch + 1;
> +					sys->epoch = sys->epoch;
>  					update_epoch_log(sys->epoch);
>  					update_epoch_store(sys->epoch);
>  				}
> @@ -1931,7 +1953,6 @@ join_retry:
>  	sys->handle = cpg_handle;
>  	sys->this_nodeid = nodeid;
>  	sys->this_pid = getpid();
> -	sys->leave_epoch = 0;
>  
>  	ret = set_addr(nodeid, port);
>  	if (ret)
> diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
> index 6680f79..4711cdd 100644
> --- a/sheep/sheep_priv.h
> +++ b/sheep/sheep_priv.h
> @@ -144,7 +144,6 @@ struct cluster_info {
>  	int nr_outstanding_reqs;
>  
>  	uint32_t recovered_epoch;
> -	uint32_t leave_epoch; /* The highest number in the clsuter */
>  
>  	int use_directio;
>  
> -- 
> 1.7.6.1
> 
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog