[sheepdog] [PATCH v2 1/2] sheep/recovery: allocate old vinfo by using sys->cinfo
Liu Yuan
namei.unix at gmail.com
Fri Apr 25 11:16:10 CEST 2014
On Fri, Apr 25, 2014 at 03:00:25PM +0800, Robin Dong wrote:
> From: Robin Dong <sanbai at taobao.com>
>
> Scene:
> 1. start up 4 sheep daemons in one cluster
> 2. write data into the cluster
> 3. "dog kill node 2" and wait for recovery complete
> current epoch status ('O' for update epoch, 'X' for stale epoch):
>
> node: 1 2 3 4
> epoch: O X O O
>
> 4. kill all nodes
> 5. start up all 4 sheep daemons again and wait for recovery complete
>
> then we read out the data and find out it is corrupted.
>
> The reason is the code use information from cluster as old vinfo, but the
> information from cluster present 4 nodes, not previous 3 nodes status.
> We don't need to worry about "node 2" who's epoch is stale, it will find
> out oid correctly in recovery process because it use current_vnode_info as
> 'cur_info' argument in start_recovery().
step 1: start up node 1, 2, 3
node: 1 2 3
epoch: O X O (wait for join status)
step 2: join the node 4
node: 1 2 3 4
epoch: O X O O (okay status)
but now node 4's old cluster info is wrong
wrong old cinfo: [1, 2, 3]
right old cinfo: [1, 3, 4]
with the patch set, node 4 will have the consistent epoch history
Having a picture like above should explain better what is problem.
>
> To solve this problem, we allocate old vinfo by using nodes information stored
> in epoch (which has been loaded into sys->cinfo) instead of which read out from
> new cluster (zookeeper/corosync, etc.).
>
> Cc: Liu Yuan <namei.unix at gmail.com>
> Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> Signed-off-by: Robin Dong <sanbai at taobao.com>
> ---
> sheep/group.c | 25 +++++++++----------------
> 1 file changed, 9 insertions(+), 16 deletions(-)
>
> diff --git a/sheep/group.c b/sheep/group.c
> index b0873d0..2c027fc 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -555,22 +555,14 @@ int inc_and_log_epoch(void)
> sys->cinfo.nr_nodes);
> }
>
> -static struct vnode_info *alloc_old_vnode_info(const struct sd_node *joined,
> - const struct rb_root *nroot)
> +static struct vnode_info *alloc_old_vnode_info(void)
> {
> struct rb_root old_root = RB_ROOT;
> - struct sd_node *n;
> struct vnode_info *old;
>
> - /* exclude the newly added one */
> - rb_for_each_entry(n, nroot, rb) {
> + for (int i = 0; i < sys->cinfo.nr_nodes; i++) {
> struct sd_node *new = xmalloc(sizeof(*new));
> -
> - *new = *n;
> - if (node_eq(joined, new)) {
> - free(new);
> - continue;
> - }
> + *new = sys->cinfo.nodes[i];
> if (rb_insert(&old_root, new, rb, node_cmp))
> panic("node hash collision");
> }
I think we should comment in the source file why we should use sys->cinfo instead
of passed nroot. Especially noting when this function called, sys->cinfo is
consistent all over the nodes so it is safe to make use of it.
Thanks
Yuan
More information about the sheepdog
mailing list