[sheepdog] [PATCH v2 1/2] sheep/recovery: allocate old vinfo by using sys->cinfo

Fri Apr 25 11:16:10 CEST 2014

On Fri, Apr 25, 2014 at 03:00:25PM +0800, Robin Dong wrote:
> From: Robin Dong <sanbai at taobao.com>
> 
> Scene:
>   1. start up 4 sheep daemons in one cluster
>   2. write data into the cluster
>   3. "dog kill node 2" and wait for recovery complete
>      current epoch status ('O' for update epoch, 'X' for stale epoch):
> 
>         node: 1  2  3  4
>        epoch: O  X  O  O
> 
>   4. kill all nodes
>   5. start up all 4 sheep daemons again and wait for recovery complete
> 
> then we read out the data and find out it is corrupted.
> 
> The reason is the code use information from cluster as old vinfo, but the
> information from cluster present 4 nodes, not previous 3 nodes status.
> We don't need to worry about "node 2" who's epoch is stale, it will find
> out oid correctly in recovery process because it use current_vnode_info as
> 'cur_info' argument in start_recovery().

step 1: start up node 1, 2, 3

      node:  1   2   3
      epoch: O   X   O  (wait for join status)

step 2: join the node 4
      node:  1   2   3   4
      epoch: O   X   O   O (okay status) 

but now node 4's old cluster info is wrong

      wrong old cinfo: [1, 2, 3]
      right old cinfo: [1, 3, 4]

with the patch set, node 4 will have the consistent epoch history 

Having a picture like above should explain better what is problem.

> 
> To solve this problem, we allocate old vinfo by using nodes information stored
> in epoch (which has been loaded into sys->cinfo) instead of which read out from
> new cluster (zookeeper/corosync, etc.).
> 
> Cc: Liu Yuan <namei.unix at gmail.com>
> Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> Signed-off-by: Robin Dong <sanbai at taobao.com>
> ---
>  sheep/group.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/sheep/group.c b/sheep/group.c
> index b0873d0..2c027fc 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -555,22 +555,14 @@ int inc_and_log_epoch(void)
>  				sys->cinfo.nr_nodes);
>  }
>  
> -static struct vnode_info *alloc_old_vnode_info(const struct sd_node *joined,
> -					       const struct rb_root *nroot)
> +static struct vnode_info *alloc_old_vnode_info(void)
>  {
>  	struct rb_root old_root = RB_ROOT;
> -	struct sd_node *n;
>  	struct vnode_info *old;
>  
> -	/* exclude the newly added one */
> -	rb_for_each_entry(n, nroot, rb) {
> +	for (int i = 0; i < sys->cinfo.nr_nodes; i++) {
>  		struct sd_node *new = xmalloc(sizeof(*new));
> -
> -		*new = *n;
> -		if (node_eq(joined, new)) {
> -			free(new);
> -			continue;
> -		}
> +		*new = sys->cinfo.nodes[i];
>  		if (rb_insert(&old_root, new, rb, node_cmp))
>  			panic("node hash collision");
>  	}

I think we should comment in the source file why we should use sys->cinfo instead
of passed nroot. Especially noting when this function called, sys->cinfo is
consistent all over the nodes so it is safe to make use of it.

Thanks
Yuan