[Sheepdog] [PATCH 1/3] collie: add manual recover subcommand for cluster

Mon Oct 24 08:11:09 CEST 2011

At Sat, 22 Oct 2011 13:32:42 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai.ly at taobao.com>
> 
> Currently, the sheepdog cluster cannot get recovered for below conditions
> 
> 1) the master node is physically down after the cluster crashes with
>    different epoches during recovery.
> 2) some of nodes are physically down after the cluster is shutdowned
>    during recovery.
> 
> This patch add a manual recovery mechanism. With this patch, you can manually
> recover the cluster at any live node by:
> 
> $ collie cluster recover
> 
> [Use with Caution]
> 
> This command will increment cluster epoch by 1!
> 
> for 1) case, you need to try to start up the nodes in sequence for the first
> round until the master node is up, thanks to the mastership mechanism. If
> unfortunately not, you can simply run the recover command. After that, you can
> freely join other good nodes in.
> 
> for 2) case, you'd better try to start up all the nodes to see if any of nodes get
> physically down. If any, unfortunately, you can simply run the recover command.

How about prompting a warning message before doing cluster recovery?
I guess newbies could run 'collie cluster recovery' wrongly without
finding the previous master node.

Thanks,

Kazutaka

> 
> Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
> ---
>  collie/cluster.c |   36 ++++++++++++++++++++++++++++++++++++
>  include/sheep.h  |    1 +
>  2 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/collie/cluster.c b/collie/cluster.c
> index adc3b5f..90ff125 100644
> --- a/collie/cluster.c
> +++ b/collie/cluster.c
> @@ -169,6 +169,40 @@ static int cluster_shutdown(int argc, char **argv)
>  	return EXIT_SUCCESS;
>  }
>  
> +static int cluster_recover(int argc, char **argv)
> +{
> +	int fd, ret;
> +	struct sd_req hdr;
> +	struct sd_rsp *rsp = (struct sd_rsp *)&hdr;
> +	unsigned rlen, wlen;
> +
> +	fd = connect_to(sdhost, sdport);
> +	if (fd < 0)
> +		return EXIT_SYSFAIL;
> +
> +	memset(&hdr, 0, sizeof(hdr));
> +
> +	hdr.opcode = SD_OP_RECOVER;
> +	hdr.epoch = node_list_version;
> +
> +	rlen = 0;
> +	wlen = 0;
> +	ret = exec_req(fd, &hdr, NULL, &wlen, &rlen);
> +	close(fd);
> +
> +	if (ret) {
> +		fprintf(stderr, "failed to connect\n");
> +		return EXIT_SYSFAIL;
> +	}
> +
> +	if (rsp->result != SD_RES_SUCCESS) {
> +		fprintf(stderr, "%s\n", sd_strerror(rsp->result));
> +		return EXIT_FAILURE;
> +	}
> +
> +	return EXIT_SUCCESS;
> +}
> +
>  static struct subcommand cluster_cmd[] = {
>  	{"info", NULL, "aprh", "show cluster information",
>  	 0, cluster_info},
> @@ -176,6 +210,8 @@ static struct subcommand cluster_cmd[] = {
>  	 0, cluster_format},
>  	{"shutdown", NULL, "aph", "stop Sheepdog",
>  	 SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown},
> +	{"recover", NULL, "aph", "manually recover the cluster",
> +	0, cluster_recover},
>  	{NULL,},
>  };
>  
> diff --git a/include/sheep.h b/include/sheep.h
> index e06d34b..46ecf96 100644
> --- a/include/sheep.h
> +++ b/include/sheep.h
> @@ -36,6 +36,7 @@
>  #define SD_OP_STAT_CLUSTER   0x87
>  #define SD_OP_KILL_NODE      0x88
>  #define SD_OP_GET_VDI_ATTR   0x89
> +#define SD_OP_RECOVER	     0x8A
>  
>  #define SD_FLAG_CMD_IO_LOCAL   0x0010
>  #define SD_FLAG_CMD_RECOVERY 0x0020
> -- 
> 1.7.6.1
> 
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog