[Sheepdog] [PATCH V2 2/2] sheep: teach sheepdog to better recovery the cluster

Fri Sep 23 13:49:56 CEST 2011

At Thu, 22 Sep 2011 15:05:27 +0800,
Liu Yuan wrote:
> 
> On 09/22/2011 02:34 PM, Liu Yuan wrote:
> > On 09/22/2011 02:01 PM, MORITA Kazutaka wrote:
> >> At Wed, 21 Sep 2011 14:59:26 +0800,
> >> Liu Yuan wrote:
> >>> Kazutaka,
> >>>       I guess this patch addresses inconsistency problem you mentioned.
> >>> other comments are addressed too.
> >> Thanks, this solves the inconsistency problem in a nice way!  I've
> >> applied 3 patches in the v3 patchset.
> >>
> >
> > Umm, actually, this just resolve some special case as you mentioned 
> > (the first node we start up should be firstly down, because in its 
> > epoch, there are full nodes information stored)
> >
> > Currently, we cannot recovery the cluster if we start up nodes other 
> > than the firstly-down node *correctly* and in my option, we even 
> > cannot handle this situation by software. Sheepdog itself cannot 
> > determine who has the epoch with the full nodes information. however, 
> > from outside, the admin can find it by hand. so to be afraid, sheepdog 
> > will rely on the knowledge outside to handle some recovery cases.
> >
> For e.g. below we get the inconsistent epoch history, though the cluster 
> gets up. as you mentioned, inconsistent epoch history will result in 
> data loss.
> 
> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./sheep/sheep 
> /store/$i -z $i -p 700$i;sleep 1;done
> root at taobao:/home/dev/sheepdog# collie/collie cluster format
> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do pkill -f "sheep 
> /store/$i"; sleep 1; done
> root at taobao:/home/dev/sheepdog# for i in 1 0 0 2; do ./sheep/sheep 
> /store/$i -z $i -p 700$i;sleep 1;done
> root at taobao:/home/dev/sheepdog# for i in 0 1 2; do ./collie/collie 
> cluster info -p 700$i;done
> Cluster status: running
> 
> Creation time        Epoch Nodes
> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]
> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> 2011-09-22 15:03:22      2 [192.168.0.1:7001]
> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]
> Cluster status: running
> 
> Creation time        Epoch Nodes
> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]
> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]
> Cluster status: running
> 
> Creation time        Epoch Nodes
> 2011-09-22 15:03:22      4 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]
> 2011-09-22 15:03:22      3 [192.168.0.1:7000, 192.168.0.1:7001]
> 2011-09-22 15:03:22      2 [192.168.0.1:7001, 192.168.0.1:7002]
> 2011-09-22 15:03:22      1 [192.168.0.1:7000, 192.168.0.1:7001, 
> 192.168.0.1:7002]

Hi Yuan,

How about the below patch?  I guess this would solve all the problem
we've discussed.

=
From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
Subject: [PATCH] sheep: wait for correct nodes when starting cluster

Currently, we cannot recover the cluster when we start up nodes other
than the first failed node or the last failed one.  To solve this
problem, this patch introduce 'wait_node_list' in 'struct cluster_info'.

Each daemon creates the wait_node_list from the latest epoch which is
stored in local.  In starting up phase, the master node accepts only
nodes in wait_node_list.  This can prevent Sheepdog from starting up
with a wrong node list.

In addition, a newly added node sends its wait_node_list to the master
node when it joins.  This is necessary when there are nodes which were
added after the master node had failed.

The example of starting up from the second failed node:

  $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
  $ ./collie/collie cluster format
  $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
  $ for i in 1 0 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
  $ for i in 0 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
  $ for i in 0 1 2; do ./collie/collie cluster info -p 700$i; done
  Cluster status: running

  Creation time        Epoch Nodes
  2011-09-23 18:35:57      6 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      5 [10.68.14.1:7000, 10.68.14.1:7001]
  2011-09-23 18:35:57      4 [10.68.14.1:7001]
  2011-09-23 18:35:57      3 [10.68.14.1:7002]
  2011-09-23 18:35:57      2 [10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
  Cluster status: running

  Creation time        Epoch Nodes
  2011-09-23 18:35:57      6 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      5 [10.68.14.1:7000, 10.68.14.1:7001]
  2011-09-23 18:35:57      4 [10.68.14.1:7001]
  2011-09-23 18:35:57      3 [10.68.14.1:7002]
  2011-09-23 18:35:57      2 [10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
  Cluster status: running

  Creation time        Epoch Nodes
  2011-09-23 18:35:57      6 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      5 [10.68.14.1:7000, 10.68.14.1:7001]
  2011-09-23 18:35:57      4 [10.68.14.1:7001]
  2011-09-23 18:35:57      3 [10.68.14.1:7002]
  2011-09-23 18:35:57      2 [10.68.14.1:7001, 10.68.14.1:7002]
  2011-09-23 18:35:57      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]

Signed-off-by: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
---
 sheep/group.c      |  152 +++++++++++++++++++++++++++++++++++----------------
 sheep/sheep_priv.h |    3 +
 2 files changed, 107 insertions(+), 48 deletions(-)

diff --git a/sheep/group.c b/sheep/group.c
index 9d05fdd..f31dabd 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -70,6 +70,12 @@ struct join_message {
 		uint32_t pid;
 		struct sheepdog_node_list_entry ent;
 	} leave_nodes[SD_MAX_NODES];
+	uint32_t nr_wait_nodes;
+	struct {
+		uint32_t nodeid;
+		uint32_t pid;
+		struct sheepdog_node_list_entry ent;
+	} wait_nodes[SD_MAX_NODES];
 };
 
 struct leave_message {
@@ -414,6 +420,7 @@ static inline int get_nodes_nr_from(struct list_head *l)
 	return nr;
 }
 
+#ifdef COMPILE_UNUSED_CODE
 static int get_nodes_nr_epoch(int epoch)
 {
 	struct sheepdog_node_list_entry nodes[SD_MAX_NODES];
@@ -423,18 +430,41 @@ static int get_nodes_nr_epoch(int epoch)
 	nr /= sizeof(nodes[0]);
 	return nr;
 }
+#endif
 
-static struct node *find_leave_node(struct node *node)
+static struct node *find_node_with_addr(struct list_head *node_list,
+					struct sheepdog_node_list_entry *e)
 {
 	struct node *n;
-	list_for_each_entry(n, &sys->leave_list, list) {
-		if (node_cmp(&n->ent, &node->ent) == 0)
+
+	list_for_each_entry(n, node_list, list) {
+		if (node_cmp(&n->ent, e) == 0)
 			return n;
-		dprintf("%d\n", node_cmp(&n->ent, &node->ent));
 	}
 	return NULL;
+}
+
+static int add_node_to_list(struct sheepdog_node_list_entry *e,
+			    struct list_head *list)
+{
+	struct node *n;
+
+	if (find_node_with_addr(list, e))
+		return SD_RES_SUCCESS;
+
+	n = zalloc(sizeof(*n));
+	if (!n) {
+		eprintf("oom\n");
+		return SD_RES_NO_MEM;
+	}
+
+	n->ent = *e;
+
+	list_add_tail(&n->list, list);
 
+	return SD_RES_SUCCESS;
 }
+
 static int add_node_to_leave_list(struct message_header *msg)
 {
 	int ret = SD_RES_SUCCESS;
@@ -445,51 +475,39 @@ static int add_node_to_leave_list(struct message_header *msg)
 
 	switch (msg->op) {
 	case SD_MSG_LEAVE:
-		n = zalloc(sizeof(*n));
-		if (!n) {
-			ret = SD_RES_NO_MEM;
-			goto err;
-		}
-
-		n->nodeid = msg->nodeid;
-		n->pid = msg->pid;
-		n->ent = msg->from;
-		if (find_leave_node(n)) {
-			free(n);
-			goto ret;
-		}
-
-		list_add_tail(&n->list, &sys->leave_list);
+		print_node_list(&sys->wait_node_list);
+		if (find_node_with_addr(&sys->wait_node_list, &msg->from))
+			ret = add_node_to_list(&msg->from, &sys->leave_list);
 		goto ret;
 	case SD_MSG_JOIN:
 		jm = (struct join_message *)msg;
-		nr = jm->nr_leave_nodes;
-		break;
-	default:
-		ret = SD_RES_INVALID_PARMS;
-		goto err;
-	}
 
-	for (i = 0; i < nr; i++) {
-		n = zalloc(sizeof(*n));
-		if (!n) {
-			ret = SD_RES_NO_MEM;
-			goto free;
+		nr = jm->nr_wait_nodes;
+		for (i = 0; i < nr; i++) {
+			ret = add_node_to_list(&jm->wait_nodes[i].ent, &tmp_list);
+			if (ret != SD_RES_SUCCESS)
+				goto free;
 		}
+		list_splice_init(&tmp_list, &sys->wait_node_list);
+		print_node_list(&sys->wait_node_list);
 
-		n->nodeid = jm->leave_nodes[i].nodeid;
-		n->pid = jm->leave_nodes[i].pid;
-		n->ent = jm->leave_nodes[i].ent;
-		if (find_leave_node(n)) {
-			free(n);
-			continue;
-		}
+		nr = jm->nr_leave_nodes;
+		for (i = 0; i < nr; i++) {
+			if (!find_node_with_addr(&sys->wait_node_list,
+						 &jm->leave_nodes[i].ent))
+				continue;
 
-		list_add_tail(&n->list, &tmp_list);
+			ret = add_node_to_list(&jm->leave_nodes[i].ent,&tmp_list);
+			if (ret != SD_RES_SUCCESS)
+				goto free;
+		}
+		list_splice_init(&tmp_list, &sys->leave_list);
+		print_node_list(&sys->leave_list);
+		goto ret;
+	default:
+		ret = SD_RES_INVALID_PARMS;
+		goto err;
 	}
-	list_splice_init(&tmp_list, &sys->leave_list);
-	goto ret;
-
 free:
 	list_for_each_entry_safe(n, t, &tmp_list, list) {
 		free(n);
@@ -507,7 +525,7 @@ static int get_cluster_status(struct sheepdog_node_list_entry *from,
 			      uint32_t *status, uint8_t *inc_epoch)
 {
 	int i;
-	int nr_local_entries, nr_leave_entries;
+	int nr_local_entries, nr_leave_entries, nr_wait_entries;
 	struct sheepdog_node_list_entry local_entries[SD_MAX_NODES];
 	struct node *node;
 	uint32_t local_epoch;
@@ -586,8 +604,9 @@ static int get_cluster_status(struct sheepdog_node_list_entry *from,
 		nr_entries = get_nodes_nr_from(&sys->sd_node_list) + 1;
 
 		if (nr_entries != nr_local_entries) {
+			nr_wait_entries = get_nodes_nr_from(&sys->wait_node_list);
 			nr_leave_entries = get_nodes_nr_from(&sys->leave_list);
-			if (nr_local_entries == nr_entries + nr_leave_entries) {
+			if (nr_wait_entries == nr_entries + nr_leave_entries) {
 				/* Even though some nodes leave, we can make do with it.
 				 * Order cluster to do recovery right now.
 				 */
@@ -1054,6 +1073,15 @@ static void send_join_response(struct work_deliver *w)
 			jm->nr_leave_nodes++;
 		}
 		print_node_list(&sys->leave_list);
+
+		jm->nr_wait_nodes = 0;
+		list_for_each_entry(node, &sys->wait_node_list, list) {
+			jm->wait_nodes[jm->nr_wait_nodes].nodeid = node->nodeid;
+			jm->wait_nodes[jm->nr_wait_nodes].pid = node->pid;
+			jm->wait_nodes[jm->nr_wait_nodes].ent = node->ent;
+			jm->nr_wait_nodes++;
+		}
+		print_node_list(&sys->wait_node_list);
 	}
 	send_message(sys->handle, m);
 }
@@ -1062,11 +1090,13 @@ static void __sd_deliver_done(struct cpg_event *cevent)
 {
 	struct work_deliver *w = container_of(cevent, struct work_deliver, cev);
 	struct message_header *m;
+	struct join_message *jm;
 	struct leave_message *lm;
 	char name[128];
 	int do_recovery;
 	struct node *node, *t;
-	int nr, nr_local, nr_leave;
+	int i, nr, nr_wait, nr_leave;
+	LIST_HEAD(tmp_list);
 
 	m = w->msg;
 
@@ -1095,10 +1125,10 @@ static void __sd_deliver_done(struct cpg_event *cevent)
 				if (lm->epoch > sys->leave_epoch)
 					sys->leave_epoch = lm->epoch;
 
-				nr_local = get_nodes_nr_epoch(sys->epoch);
+				nr_wait = get_nodes_nr_from(&sys->wait_node_list);
 				nr = get_nodes_nr_from(&sys->sd_node_list);
 				nr_leave = get_nodes_nr_from(&sys->leave_list);
-				if (nr_local == nr + nr_leave) {
+				if (nr_wait == nr + nr_leave) {
 					sys->status = SD_STATUS_OK;
 					sys->epoch = sys->leave_epoch + 1;
 					update_epoch_log(sys->epoch);
@@ -1123,6 +1153,16 @@ static void __sd_deliver_done(struct cpg_event *cevent)
 	if (m->state == DM_INIT && is_master()) {
 		switch (m->op) {
 		case SD_MSG_JOIN:
+			jm = (struct join_message *)m;
+			if (sys->status == SD_STATUS_WAIT_FOR_JOIN) {
+				nr = jm->nr_wait_nodes;
+				for (i = 0; i < nr; i++)
+					add_node_to_list(&jm->wait_nodes[i].ent,
+							 &sys->wait_node_list);
+				print_node_list(&sys->wait_node_list);
+
+			}
+
 			send_join_response(w);
 			break;
 		case SD_MSG_VDI_OP:
@@ -1307,6 +1347,7 @@ static void send_join_request(struct cpg_address *addr, struct work_confchg *w)
 	struct join_message msg;
 	struct sheepdog_node_list_entry entries[SD_MAX_NODES];
 	int nr_entries, i, ret;
+	struct node *n;
 
 	/* if I've just joined in cpg, I'll join in sheepdog. */
 	if (!is_my_cpg_addr(addr))
@@ -1331,6 +1372,11 @@ static void send_join_request(struct cpg_address *addr, struct work_confchg *w)
 			msg.nodes[i].ent = entries[i];
 	}
 
+	print_node_list(&sys->wait_node_list);
+	msg.nr_wait_nodes = 0;
+	list_for_each_entry(n, &sys->wait_node_list, list)
+		msg.wait_nodes[msg.nr_wait_nodes++].ent = n->ent;
+
 	send_message(sys->handle, (struct message_header *)&msg);
 
 	vprintf(SDOG_INFO "%x %u\n", sys->this_nodeid, sys->this_pid);
@@ -1894,11 +1940,12 @@ static int set_addr(unsigned int nodeid, int port)
 
 int create_cluster(int port, int64_t zone)
 {
-	int fd, ret;
+	int i, fd, ret, latest_epoch, nr;
 	cpg_handle_t cpg_handle;
 	struct cpg_name group = { 8, "sheepdog" };
 	cpg_callbacks_t cb = {&sd_deliver, &sd_confchg};
 	unsigned int nodeid = 0;
+	struct sheepdog_node_list_entry nodes[SD_MAX_NODES];
 
 	ret = cpg_initialize(&cpg_handle, &cb);
 	if (ret != CS_OK) {
@@ -1946,7 +1993,8 @@ join_retry:
 		sys->this_node.zone = zone;
 	dprintf("zone id = %u\n", sys->this_node.zone);
 
-	if (get_latest_epoch() == 0)
+	latest_epoch = get_latest_epoch();
+	if (latest_epoch == 0)
 		sys->status = SD_STATUS_WAIT_FOR_FORMAT;
 	else
 		sys->status = SD_STATUS_WAIT_FOR_JOIN;
@@ -1954,6 +2002,14 @@ join_retry:
 	INIT_LIST_HEAD(&sys->cpg_node_list);
 	INIT_LIST_HEAD(&sys->pending_list);
 	INIT_LIST_HEAD(&sys->leave_list);
+	INIT_LIST_HEAD(&sys->wait_node_list);
+	nr = epoch_log_read(latest_epoch, (char *)nodes, sizeof(nodes));
+	nr /= sizeof(nodes[0]);
+	for (i = 0; i < nr; i++) {
+		ret = add_node_to_list(nodes + i, &sys->wait_node_list);
+		if (ret != SD_RES_SUCCESS)
+			return 1;
+	}
 
 	INIT_LIST_HEAD(&sys->outstanding_req_list);
 	INIT_LIST_HEAD(&sys->req_wait_for_obj_list);
diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
index 1068e03..4a5a474 100644
--- a/sheep/sheep_priv.h
+++ b/sheep/sheep_priv.h
@@ -123,6 +123,9 @@ struct cluster_info {
 	 */
 	struct list_head leave_list;
 
+	/* this list contains necessary and sufficient nodes to start with */
+	struct list_head wait_node_list;
+
 	/* this array contains a list of ordered virtual nodes */
 	struct sheepdog_vnode_list_entry vnodes[SD_MAX_VNODES];
 	int nr_vnodes;
-- 
1.7.2.5