[sheepdog] [PATCH v3 1/2] sheep/recovery: allocate old vinfo by using sys->cinfo

Fri Apr 25 15:19:34 CEST 2014

On Fri, Apr 25, 2014 at 05:40:15PM +0800, Robin Dong wrote:
> From: Robin Dong <sanbai at taobao.com>
> 
> Scene:
> 
>   1. start up 4 sheep daemons in one cluster
> 
>   2. write data into the cluster
> 
>   3. "dog kill node 2" and wait for recovery complete
>      current epoch status ('O' for update epoch, 'X' for stale epoch):
> 
>         node:  1  2  3  4
>         epoch: O  X  O  O
> 
>   4. kill all nodes
> 
>   5. start up all 4 sheep daemons again and wait for recovery complete
> 
>     5.1: start up node 1, 2, 3
> 
>         node:  1   2   3
>         epoch: O   X   O  (wait for join status)
> 
>     5.2: join the node 4
>         node:  1   2   3   4
>         epoch: O   X   O   O (okay status)
> 
>     but now node 4's old cluster info is wrong
> 
>         wrong old cinfo: [1, 2, 3]
>         right old cinfo: [1, 3, 4]
> 
> 
> then we read out the data and will find out it is corrupted.
> 
> The reason is the code use information from cluster as old vinfo, but the
> information from cluster present 4 nodes, not previous 3 nodes status.
> We don't need to worry about "node 2" who's epoch is stale, it will find
> out oid correctly in recovery process because it use current_vnode_info as
> 'cur_info' argument in start_recovery().
> 
> To solve this problem, we allocate old vinfo by using nodes information stored
> in epoch (which has been loaded into sys->cinfo) instead of which read out from
> new cluster (zookeeper/corosync, etc.).
> 
> Cc: Liu Yuan <namei.unix at gmail.com>
> Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> Signed-off-by: Robin Dong <sanbai at taobao.com>
> ---
> v1-->v2:
>   1. modify patch comment
>   2. add 'many nodes fail' test case
> v2-->v3:
>   1. add ascii-picture in patch comment
>   2. add comment in code to explain old_vnode_info
> 
>  sheep/group.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/sheep/group.c b/sheep/group.c
> index b0873d0..4114dfb 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -555,22 +555,21 @@ int inc_and_log_epoch(void)
>  				sys->cinfo.nr_nodes);
>  }
>  
> -static struct vnode_info *alloc_old_vnode_info(const struct sd_node *joined,
> -					       const struct rb_root *nroot)
> +static struct vnode_info *alloc_old_vnode_info(void)
>  {
>  	struct rb_root old_root = RB_ROOT;
> -	struct sd_node *n;
>  	struct vnode_info *old;
>  
> -	/* exclude the newly added one */
> -	rb_for_each_entry(n, nroot, rb) {
> +	/*
> +	 * If the previous cluster has failed node, (For example, 3 good nodes
> +	 * and 1 failed node), the 'nroot' will present 4 good nodes after
> +	 * shutdown and restart this 4 nodes cluster, this is incorrect.
> +	 * We should use old nodes information which is stored in epoch to
> +	 * rebuild old_vnode_info.
> +	 */
> +	for (int i = 0; i < sys->cinfo.nr_nodes; i++) {
>  		struct sd_node *new = xmalloc(sizeof(*new));
> -
> -		*new = *n;
> -		if (node_eq(joined, new)) {
> -			free(new);
> -			continue;
> -		}
> +		*new = sys->cinfo.nodes[i];
>  		if (rb_insert(&old_root, new, rb, node_cmp))
>  			panic("node hash collision");
>  	}
> @@ -669,15 +668,16 @@ static void update_cluster_info(const struct cluster_info *cinfo,
>  			set_cluster_config(&sys->cinfo);
>  
>  		if (nr_nodes != cinfo->nr_nodes) {
> -			int ret = inc_and_log_epoch();
> +			int ret;
> +			if (old_vnode_info)
> +				put_vnode_info(old_vnode_info);
> +
> +			old_vnode_info = alloc_old_vnode_info();
> +			ret = inc_and_log_epoch();
>  			if (ret != 0)
>  				panic("cannot log current epoch %d",
>  				      sys->cinfo.epoch);
>  
> -			if (!old_vnode_info)
> -				old_vnode_info = alloc_old_vnode_info(joined,
> -								      nroot);
> -
>  			start_recovery(main_thread_get(current_vnode_info),
>  				       old_vnode_info, true);
>  		} else if (!was_cluster_shutdowned()) {
> -- 
> 1.7.12.4
> 

Applied this two, thanks

Yuan