At Sat, 22 Oct 2011 13:32:42 +0800, Liu Yuan wrote: > > From: Liu Yuan <tailai.ly at taobao.com> > > Currently, the sheepdog cluster cannot get recovered for below conditions > > 1) the master node is physically down after the cluster crashes with > different epoches during recovery. > 2) some of nodes are physically down after the cluster is shutdowned > during recovery. > > This patch add a manual recovery mechanism. With this patch, you can manually > recover the cluster at any live node by: > > $ collie cluster recover > > [Use with Caution] > > This command will increment cluster epoch by 1! > > for 1) case, you need to try to start up the nodes in sequence for the first > round until the master node is up, thanks to the mastership mechanism. If > unfortunately not, you can simply run the recover command. After that, you can > freely join other good nodes in. > > for 2) case, you'd better try to start up all the nodes to see if any of nodes get > physically down. If any, unfortunately, you can simply run the recover command. How about prompting a warning message before doing cluster recovery? I guess newbies could run 'collie cluster recovery' wrongly without finding the previous master node. Thanks, Kazutaka > > Signed-off-by: Liu Yuan <tailai.ly at taobao.com> > --- > collie/cluster.c | 36 ++++++++++++++++++++++++++++++++++++ > include/sheep.h | 1 + > 2 files changed, 37 insertions(+), 0 deletions(-) > > diff --git a/collie/cluster.c b/collie/cluster.c > index adc3b5f..90ff125 100644 > --- a/collie/cluster.c > +++ b/collie/cluster.c > @@ -169,6 +169,40 @@ static int cluster_shutdown(int argc, char **argv) > return EXIT_SUCCESS; > } > > +static int cluster_recover(int argc, char **argv) > +{ > + int fd, ret; > + struct sd_req hdr; > + struct sd_rsp *rsp = (struct sd_rsp *)&hdr; > + unsigned rlen, wlen; > + > + fd = connect_to(sdhost, sdport); > + if (fd < 0) > + return EXIT_SYSFAIL; > + > + memset(&hdr, 0, sizeof(hdr)); > + > + hdr.opcode = SD_OP_RECOVER; > + hdr.epoch = node_list_version; > + > + rlen = 0; > + wlen = 0; > + ret = exec_req(fd, &hdr, NULL, &wlen, &rlen); > + close(fd); > + > + if (ret) { > + fprintf(stderr, "failed to connect\n"); > + return EXIT_SYSFAIL; > + } > + > + if (rsp->result != SD_RES_SUCCESS) { > + fprintf(stderr, "%s\n", sd_strerror(rsp->result)); > + return EXIT_FAILURE; > + } > + > + return EXIT_SUCCESS; > +} > + > static struct subcommand cluster_cmd[] = { > {"info", NULL, "aprh", "show cluster information", > 0, cluster_info}, > @@ -176,6 +210,8 @@ static struct subcommand cluster_cmd[] = { > 0, cluster_format}, > {"shutdown", NULL, "aph", "stop Sheepdog", > SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown}, > + {"recover", NULL, "aph", "manually recover the cluster", > + 0, cluster_recover}, > {NULL,}, > }; > > diff --git a/include/sheep.h b/include/sheep.h > index e06d34b..46ecf96 100644 > --- a/include/sheep.h > +++ b/include/sheep.h > @@ -36,6 +36,7 @@ > #define SD_OP_STAT_CLUSTER 0x87 > #define SD_OP_KILL_NODE 0x88 > #define SD_OP_GET_VDI_ATTR 0x89 > +#define SD_OP_RECOVER 0x8A > > #define SD_FLAG_CMD_IO_LOCAL 0x0010 > #define SD_FLAG_CMD_RECOVERY 0x0020 > -- > 1.7.6.1 > > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |