[sheepdog] [PATCH v3 1/2] sheep/recovery: allocate old vinfo by using sys->cinfo
Liu Yuan
namei.unix at gmail.com
Fri Apr 25 15:19:34 CEST 2014
On Fri, Apr 25, 2014 at 05:40:15PM +0800, Robin Dong wrote:
> From: Robin Dong <sanbai at taobao.com>
>
> Scene:
>
> 1. start up 4 sheep daemons in one cluster
>
> 2. write data into the cluster
>
> 3. "dog kill node 2" and wait for recovery complete
> current epoch status ('O' for update epoch, 'X' for stale epoch):
>
> node: 1 2 3 4
> epoch: O X O O
>
> 4. kill all nodes
>
> 5. start up all 4 sheep daemons again and wait for recovery complete
>
> 5.1: start up node 1, 2, 3
>
> node: 1 2 3
> epoch: O X O (wait for join status)
>
> 5.2: join the node 4
> node: 1 2 3 4
> epoch: O X O O (okay status)
>
> but now node 4's old cluster info is wrong
>
> wrong old cinfo: [1, 2, 3]
> right old cinfo: [1, 3, 4]
>
>
> then we read out the data and will find out it is corrupted.
>
> The reason is the code use information from cluster as old vinfo, but the
> information from cluster present 4 nodes, not previous 3 nodes status.
> We don't need to worry about "node 2" who's epoch is stale, it will find
> out oid correctly in recovery process because it use current_vnode_info as
> 'cur_info' argument in start_recovery().
>
> To solve this problem, we allocate old vinfo by using nodes information stored
> in epoch (which has been loaded into sys->cinfo) instead of which read out from
> new cluster (zookeeper/corosync, etc.).
>
> Cc: Liu Yuan <namei.unix at gmail.com>
> Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> Signed-off-by: Robin Dong <sanbai at taobao.com>
> ---
> v1-->v2:
> 1. modify patch comment
> 2. add 'many nodes fail' test case
> v2-->v3:
> 1. add ascii-picture in patch comment
> 2. add comment in code to explain old_vnode_info
>
> sheep/group.c | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/sheep/group.c b/sheep/group.c
> index b0873d0..4114dfb 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -555,22 +555,21 @@ int inc_and_log_epoch(void)
> sys->cinfo.nr_nodes);
> }
>
> -static struct vnode_info *alloc_old_vnode_info(const struct sd_node *joined,
> - const struct rb_root *nroot)
> +static struct vnode_info *alloc_old_vnode_info(void)
> {
> struct rb_root old_root = RB_ROOT;
> - struct sd_node *n;
> struct vnode_info *old;
>
> - /* exclude the newly added one */
> - rb_for_each_entry(n, nroot, rb) {
> + /*
> + * If the previous cluster has failed node, (For example, 3 good nodes
> + * and 1 failed node), the 'nroot' will present 4 good nodes after
> + * shutdown and restart this 4 nodes cluster, this is incorrect.
> + * We should use old nodes information which is stored in epoch to
> + * rebuild old_vnode_info.
> + */
> + for (int i = 0; i < sys->cinfo.nr_nodes; i++) {
> struct sd_node *new = xmalloc(sizeof(*new));
> -
> - *new = *n;
> - if (node_eq(joined, new)) {
> - free(new);
> - continue;
> - }
> + *new = sys->cinfo.nodes[i];
> if (rb_insert(&old_root, new, rb, node_cmp))
> panic("node hash collision");
> }
> @@ -669,15 +668,16 @@ static void update_cluster_info(const struct cluster_info *cinfo,
> set_cluster_config(&sys->cinfo);
>
> if (nr_nodes != cinfo->nr_nodes) {
> - int ret = inc_and_log_epoch();
> + int ret;
> + if (old_vnode_info)
> + put_vnode_info(old_vnode_info);
> +
> + old_vnode_info = alloc_old_vnode_info();
> + ret = inc_and_log_epoch();
> if (ret != 0)
> panic("cannot log current epoch %d",
> sys->cinfo.epoch);
>
> - if (!old_vnode_info)
> - old_vnode_info = alloc_old_vnode_info(joined,
> - nroot);
> -
> start_recovery(main_thread_get(current_vnode_info),
> old_vnode_info, true);
> } else if (!was_cluster_shutdowned()) {
> --
> 1.7.12.4
>
Applied this two, thanks
Yuan
More information about the sheepdog
mailing list