[Sheepdog] [PATCH 1/3] collie: add manual recover subcommand for cluster
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Mon Oct 24 08:11:09 CEST 2011
At Sat, 22 Oct 2011 13:32:42 +0800,
Liu Yuan wrote:
>
> From: Liu Yuan <tailai.ly at taobao.com>
>
> Currently, the sheepdog cluster cannot get recovered for below conditions
>
> 1) the master node is physically down after the cluster crashes with
> different epoches during recovery.
> 2) some of nodes are physically down after the cluster is shutdowned
> during recovery.
>
> This patch add a manual recovery mechanism. With this patch, you can manually
> recover the cluster at any live node by:
>
> $ collie cluster recover
>
> [Use with Caution]
>
> This command will increment cluster epoch by 1!
>
> for 1) case, you need to try to start up the nodes in sequence for the first
> round until the master node is up, thanks to the mastership mechanism. If
> unfortunately not, you can simply run the recover command. After that, you can
> freely join other good nodes in.
>
> for 2) case, you'd better try to start up all the nodes to see if any of nodes get
> physically down. If any, unfortunately, you can simply run the recover command.
How about prompting a warning message before doing cluster recovery?
I guess newbies could run 'collie cluster recovery' wrongly without
finding the previous master node.
Thanks,
Kazutaka
>
> Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
> ---
> collie/cluster.c | 36 ++++++++++++++++++++++++++++++++++++
> include/sheep.h | 1 +
> 2 files changed, 37 insertions(+), 0 deletions(-)
>
> diff --git a/collie/cluster.c b/collie/cluster.c
> index adc3b5f..90ff125 100644
> --- a/collie/cluster.c
> +++ b/collie/cluster.c
> @@ -169,6 +169,40 @@ static int cluster_shutdown(int argc, char **argv)
> return EXIT_SUCCESS;
> }
>
> +static int cluster_recover(int argc, char **argv)
> +{
> + int fd, ret;
> + struct sd_req hdr;
> + struct sd_rsp *rsp = (struct sd_rsp *)&hdr;
> + unsigned rlen, wlen;
> +
> + fd = connect_to(sdhost, sdport);
> + if (fd < 0)
> + return EXIT_SYSFAIL;
> +
> + memset(&hdr, 0, sizeof(hdr));
> +
> + hdr.opcode = SD_OP_RECOVER;
> + hdr.epoch = node_list_version;
> +
> + rlen = 0;
> + wlen = 0;
> + ret = exec_req(fd, &hdr, NULL, &wlen, &rlen);
> + close(fd);
> +
> + if (ret) {
> + fprintf(stderr, "failed to connect\n");
> + return EXIT_SYSFAIL;
> + }
> +
> + if (rsp->result != SD_RES_SUCCESS) {
> + fprintf(stderr, "%s\n", sd_strerror(rsp->result));
> + return EXIT_FAILURE;
> + }
> +
> + return EXIT_SUCCESS;
> +}
> +
> static struct subcommand cluster_cmd[] = {
> {"info", NULL, "aprh", "show cluster information",
> 0, cluster_info},
> @@ -176,6 +210,8 @@ static struct subcommand cluster_cmd[] = {
> 0, cluster_format},
> {"shutdown", NULL, "aph", "stop Sheepdog",
> SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown},
> + {"recover", NULL, "aph", "manually recover the cluster",
> + 0, cluster_recover},
> {NULL,},
> };
>
> diff --git a/include/sheep.h b/include/sheep.h
> index e06d34b..46ecf96 100644
> --- a/include/sheep.h
> +++ b/include/sheep.h
> @@ -36,6 +36,7 @@
> #define SD_OP_STAT_CLUSTER 0x87
> #define SD_OP_KILL_NODE 0x88
> #define SD_OP_GET_VDI_ATTR 0x89
> +#define SD_OP_RECOVER 0x8A
>
> #define SD_FLAG_CMD_IO_LOCAL 0x0010
> #define SD_FLAG_CMD_RECOVERY 0x0020
> --
> 1.7.6.1
>
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
More information about the sheepdog
mailing list