[Sheepdog] [PATCH v2 1/3] sheep: introduce SD_STATUS_HALT

Wed Oct 12 13:28:08 CEST 2011

At Tue, 11 Oct 2011 17:27:11 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai.ly at taobao.com>
> 
> Currently, sheepdog will serve IO requests even if number of nodes is less than 'copies'.
> 
> When the number of the nodes (or zones) is less than the copies specified by
> collie-cluster-format command, the sheepdog cluster should stop serving IO requests.
> 
> This is necessary to solve the below subtle case:
> 
> + good nodes, - failed nodes.
> 
> 0       1      2     3
> +       -      -     +
> +  -->  - -->  - --> +
> +       +      -     # <-- permanently down.
>         ^
>         |
> this node has the latest data
> 
> at stage 3, we will have a cluster recovered without the data tracked at stage 1.
> 
> When the nodes are in the SD_STATUS_HALT, the sheepdog can also serve configuration change
> and do the recovery job.
> 
> Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
> ---
>  include/sheep.h          |    1 +
>  include/sheepdog_proto.h |    1 +
>  sheep/group.c            |   27 ++++++++++++++++++++++-----
>  sheep/sheep_priv.h       |    1 +
>  4 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/include/sheep.h b/include/sheep.h
> index 31516d9..943cdf7 100644
> --- a/include/sheep.h
> +++ b/include/sheep.h
> @@ -254,6 +254,7 @@ static inline const char *sd_strerror(int err)
>  		{SD_RES_WAIT_FOR_FORMAT, "Waiting for a format operation"},
>  		{SD_RES_WAIT_FOR_JOIN, "Waiting for other nodes joining"},
>  		{SD_RES_JOIN_FAILED, "The node had failed to join sheepdog"},
> +		{SD_RES_HALT, "The node is stopped doing IO, short of living nodes"},
>  
>  		{SD_RES_OLD_NODE_VER, "Remote node has an old epoch"},
>  		{SD_RES_NEW_NODE_VER, "Remote node has a new epoch"},
> diff --git a/include/sheepdog_proto.h b/include/sheepdog_proto.h
> index 2b042f4..a5a41d0 100644
> --- a/include/sheepdog_proto.h
> +++ b/include/sheepdog_proto.h
> @@ -58,6 +58,7 @@
>  #define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format operation */
>  #define SD_RES_WAIT_FOR_JOIN    0x17 /* Sheepdog is waiting for other nodes joining */
>  #define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
> +#define SD_RES_HALT 0x19 /* Target node is stopped doing IO */
>  
>  /*
>   * Object ID rules
> diff --git a/sheep/group.c b/sheep/group.c
> index f6743f5..59293b2 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -335,6 +335,9 @@ void cluster_queue_request(struct work *work, int idx)
>  		case SD_STATUS_JOIN_FAILED:
>  			ret = SD_RES_JOIN_FAILED;
>  			break;
> +		case SD_STATUS_HALT:
> +			ret = SD_RES_HALT;
> +			break;
>  		default:
>  			ret = SD_RES_SYSTEM_ERROR;
>  			break;
> @@ -639,6 +642,10 @@ static int get_cluster_status(struct sheepdog_node_list_entry *from,
>  		break;
>  	case SD_STATUS_SHUTDOWN:
>  		return SD_RES_SHUTDOWN;
> +	case SD_STATUS_HALT:
> +		if (inc_epoch);
> +			*inc_epoch = 1;
> +		break;

We should check epoch and ctime of the joining node.  Otherwise,
invalid nodes can join the cluster.

>  	default:
>  		break;
>  	}
> @@ -810,12 +817,13 @@ static void update_cluster_info(struct join_message *msg)
>  				sheepid_to_str(&msg->nodes[i].sheepid));
>  	}
>  
> -	if (msg->cluster_status != SD_STATUS_OK)
> +	if (msg->cluster_status == SD_STATUS_WAIT_FOR_JOIN)
>  		add_node_to_leave_list((struct message_header *)msg);
>  
>  	sys->join_finished = 1;
>  
> -	if (msg->cluster_status == SD_STATUS_OK && msg->inc_epoch)
> +	if ((msg->cluster_status == SD_STATUS_OK || msg->cluster_status == SD_STATUS_HALT)
> +	     && msg->inc_epoch)
>  		update_epoch_log(sys->epoch);
>  
>  join_finished:
> @@ -840,6 +848,12 @@ join_finished:
>  		}
>  	}
>  
> +	if (msg->cluster_status == SD_STATUS_HALT && msg->inc_epoch) {
> +		sys->epoch++;
> +		update_epoch_log(sys->epoch);
> +		update_epoch_store(sys->epoch);
> +	}
> +

We need to call set_global_nr_copies() and set_cluster_ctime() here
for newly added nodes.

Other than above, we must replace "sys->status == SD_STATUS_OK"
with "sys->status == SD_STATUS_OK || sys->status == SD_STATUS_HALT"
in del_node() and __sd_notify_done(), I think.

Thanks,

Kazutaka