[Sheepdog] [PATCH 1/3] collie: add manual recover subcommand for cluster

Sat Oct 22 07:32:42 CEST 2011

From: Liu Yuan <tailai.ly at taobao.com>

Currently, the sheepdog cluster cannot get recovered for below conditions

1) the master node is physically down after the cluster crashes with
   different epoches during recovery.
2) some of nodes are physically down after the cluster is shutdowned
   during recovery.

This patch add a manual recovery mechanism. With this patch, you can manually
recover the cluster at any live node by:

$ collie cluster recover

[Use with Caution]

This command will increment cluster epoch by 1!

for 1) case, you need to try to start up the nodes in sequence for the first
round until the master node is up, thanks to the mastership mechanism. If
unfortunately not, you can simply run the recover command. After that, you can
freely join other good nodes in.

for 2) case, you'd better try to start up all the nodes to see if any of nodes get
physically down. If any, unfortunately, you can simply run the recover command.

Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
 collie/cluster.c |   36 ++++++++++++++++++++++++++++++++++++
 include/sheep.h  |    1 +
 2 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/collie/cluster.c b/collie/cluster.c
index adc3b5f..90ff125 100644
--- a/collie/cluster.c
+++ b/collie/cluster.c
@@ -169,6 +169,40 @@ static int cluster_shutdown(int argc, char **argv)
 	return EXIT_SUCCESS;
 }
 
+static int cluster_recover(int argc, char **argv)
+{
+	int fd, ret;
+	struct sd_req hdr;
+	struct sd_rsp *rsp = (struct sd_rsp *)&hdr;
+	unsigned rlen, wlen;
+
+	fd = connect_to(sdhost, sdport);
+	if (fd < 0)
+		return EXIT_SYSFAIL;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.opcode = SD_OP_RECOVER;
+	hdr.epoch = node_list_version;
+
+	rlen = 0;
+	wlen = 0;
+	ret = exec_req(fd, &hdr, NULL, &wlen, &rlen);
+	close(fd);
+
+	if (ret) {
+		fprintf(stderr, "failed to connect\n");
+		return EXIT_SYSFAIL;
+	}
+
+	if (rsp->result != SD_RES_SUCCESS) {
+		fprintf(stderr, "%s\n", sd_strerror(rsp->result));
+		return EXIT_FAILURE;
+	}
+
+	return EXIT_SUCCESS;
+}
+
 static struct subcommand cluster_cmd[] = {
 	{"info", NULL, "aprh", "show cluster information",
 	 0, cluster_info},
@@ -176,6 +210,8 @@ static struct subcommand cluster_cmd[] = {
 	 0, cluster_format},
 	{"shutdown", NULL, "aph", "stop Sheepdog",
 	 SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown},
+	{"recover", NULL, "aph", "manually recover the cluster",
+	0, cluster_recover},
 	{NULL,},
 };
 
diff --git a/include/sheep.h b/include/sheep.h
index e06d34b..46ecf96 100644
--- a/include/sheep.h
+++ b/include/sheep.h
@@ -36,6 +36,7 @@
 #define SD_OP_STAT_CLUSTER   0x87
 #define SD_OP_KILL_NODE      0x88
 #define SD_OP_GET_VDI_ATTR   0x89
+#define SD_OP_RECOVER	     0x8A
 
 #define SD_FLAG_CMD_IO_LOCAL   0x0010
 #define SD_FLAG_CMD_RECOVERY 0x0020
-- 
1.7.6.1