[sheepdog] [PATCH v3 1/2] sheep/recovery: allocate old vinfo by using sys->cinfo

Fri Apr 25 11:40:15 CEST 2014

From: Robin Dong <sanbai at taobao.com>

Scene:

  1. start up 4 sheep daemons in one cluster

  2. write data into the cluster

  3. "dog kill node 2" and wait for recovery complete
     current epoch status ('O' for update epoch, 'X' for stale epoch):

        node:  1  2  3  4
        epoch: O  X  O  O

  4. kill all nodes

  5. start up all 4 sheep daemons again and wait for recovery complete

    5.1: start up node 1, 2, 3

        node:  1   2   3
        epoch: O   X   O  (wait for join status)

    5.2: join the node 4
        node:  1   2   3   4
        epoch: O   X   O   O (okay status)

    but now node 4's old cluster info is wrong

        wrong old cinfo: [1, 2, 3]
        right old cinfo: [1, 3, 4]


then we read out the data and will find out it is corrupted.

The reason is the code use information from cluster as old vinfo, but the
information from cluster present 4 nodes, not previous 3 nodes status.
We don't need to worry about "node 2" who's epoch is stale, it will find
out oid correctly in recovery process because it use current_vnode_info as
'cur_info' argument in start_recovery().

To solve this problem, we allocate old vinfo by using nodes information stored
in epoch (which has been loaded into sys->cinfo) instead of which read out from
new cluster (zookeeper/corosync, etc.).

Cc: Liu Yuan <namei.unix at gmail.com>
Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
Signed-off-by: Robin Dong <sanbai at taobao.com>
---
v1-->v2:
  1. modify patch comment
  2. add 'many nodes fail' test case
v2-->v3:
  1. add ascii-picture in patch comment
  2. add comment in code to explain old_vnode_info

 sheep/group.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/sheep/group.c b/sheep/group.c
index b0873d0..4114dfb 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -555,22 +555,21 @@ int inc_and_log_epoch(void)
 				sys->cinfo.nr_nodes);
 }
 
-static struct vnode_info *alloc_old_vnode_info(const struct sd_node *joined,
-					       const struct rb_root *nroot)
+static struct vnode_info *alloc_old_vnode_info(void)
 {
 	struct rb_root old_root = RB_ROOT;
-	struct sd_node *n;
 	struct vnode_info *old;
 
-	/* exclude the newly added one */
-	rb_for_each_entry(n, nroot, rb) {
+	/*
+	 * If the previous cluster has failed node, (For example, 3 good nodes
+	 * and 1 failed node), the 'nroot' will present 4 good nodes after
+	 * shutdown and restart this 4 nodes cluster, this is incorrect.
+	 * We should use old nodes information which is stored in epoch to
+	 * rebuild old_vnode_info.
+	 */
+	for (int i = 0; i < sys->cinfo.nr_nodes; i++) {
 		struct sd_node *new = xmalloc(sizeof(*new));
-
-		*new = *n;
-		if (node_eq(joined, new)) {
-			free(new);
-			continue;
-		}
+		*new = sys->cinfo.nodes[i];
 		if (rb_insert(&old_root, new, rb, node_cmp))
 			panic("node hash collision");
 	}
@@ -669,15 +668,16 @@ static void update_cluster_info(const struct cluster_info *cinfo,
 			set_cluster_config(&sys->cinfo);
 
 		if (nr_nodes != cinfo->nr_nodes) {
-			int ret = inc_and_log_epoch();
+			int ret;
+			if (old_vnode_info)
+				put_vnode_info(old_vnode_info);
+
+			old_vnode_info = alloc_old_vnode_info();
+			ret = inc_and_log_epoch();
 			if (ret != 0)
 				panic("cannot log current epoch %d",
 				      sys->cinfo.epoch);
 
-			if (!old_vnode_info)
-				old_vnode_info = alloc_old_vnode_info(joined,
-								      nroot);
-
 			start_recovery(main_thread_get(current_vnode_info),
 				       old_vnode_info, true);
 		} else if (!was_cluster_shutdowned()) {
-- 
1.7.12.4