[sheepdog] [PATCH v2] recovery: fix incomplete recovery because of faulty oid scheduling
MORITA Kazutaka
morita.kazutaka at gmail.com
Mon Apr 29 14:49:24 CEST 2013
At Mon, 29 Apr 2013 20:31:02 +0800,
Liu Yuan wrote:
>
> From: Liu Yuan <tailai.ly at taobao.com>
>
> tests/010 will fail sometimes to catch this bug.
>
> When auto-recover is disabled, if we prepare_schedule_oid on the same oid
> multiple times, it will break finish_schedule_oid to wrongly squeeze some
> victim oid out of rw->oids array, then this node will never have a chance to
> recover the ejected oids.
>
> Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
> ---
> sheep/recovery.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/sheep/recovery.c b/sheep/recovery.c
> index 23babe0..d1085aa 100644
> --- a/sheep/recovery.c
> +++ b/sheep/recovery.c
> @@ -237,18 +237,14 @@ static inline void prepare_schedule_oid(uint64_t oid)
> oid);
> return;
> }
> - /*
> - * When auto recovery is enabled, the oid is currently being
> - * recovered
> - */
> - if (!sys->disable_recovery && rw->oids[rw->done] == oid)
> + /* When auto-recovery is enabled, oid is currently being recovered */
> + if (!rw->suspended && rw->oids[rw->done] == oid)
> return;
The comment message is no longer correct because sheep exits this
function here even when auto-recovery is disabled.
s/auto-recovery is enabled/recovery is not suspended/ ?
> +
> rw->nr_prio_oids++;
> rw->prio_oids = xrealloc(rw->prio_oids,
> rw->nr_prio_oids * sizeof(uint64_t));
> rw->prio_oids[rw->nr_prio_oids - 1] = oid;
> - resume_suspended_recovery();
> -
> sd_dprintf("%"PRIx64" nr_prio_oids %d", oid, rw->nr_prio_oids);
> }
>
> @@ -291,6 +287,7 @@ bool oid_in_recovery(uint64_t oid)
> }
>
> prepare_schedule_oid(oid);
> + resume_suspended_recovery();
Why do you move resume_suspended_recovery()? If the oid is not
scheduled, we don't have to call this function.
Thanks,
Kazutaka
More information about the sheepdog
mailing list