[sheepdog] [PATCH v2] recovery: fix incomplete recovery because of faulty oid scheduling

Liu Yuan namei.unix at gmail.com
Mon Apr 29 14:31:02 CEST 2013


From: Liu Yuan <tailai.ly at taobao.com>

tests/010 will fail sometimes to catch this bug.

When auto-recover is disabled, if we prepare_schedule_oid on the same oid
multiple times, it will break finish_schedule_oid to wrongly squeeze some
victim oid out of rw->oids array, then this node will never have a chance to
recover the ejected oids.

Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
 sheep/recovery.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/sheep/recovery.c b/sheep/recovery.c
index 23babe0..d1085aa 100644
--- a/sheep/recovery.c
+++ b/sheep/recovery.c
@@ -237,18 +237,14 @@ static inline void prepare_schedule_oid(uint64_t oid)
 				   oid);
 			return;
 		}
-	/*
-	 * When auto recovery is enabled, the oid is currently being
-	 * recovered
-	 */
-	if (!sys->disable_recovery && rw->oids[rw->done] == oid)
+	/* When auto-recovery is enabled, oid is currently being recovered */
+	if (!rw->suspended && rw->oids[rw->done] == oid)
 		return;
+
 	rw->nr_prio_oids++;
 	rw->prio_oids = xrealloc(rw->prio_oids,
 				 rw->nr_prio_oids * sizeof(uint64_t));
 	rw->prio_oids[rw->nr_prio_oids - 1] = oid;
-	resume_suspended_recovery();
-
 	sd_dprintf("%"PRIx64" nr_prio_oids %d", oid, rw->nr_prio_oids);
 }
 
@@ -291,6 +287,7 @@ bool oid_in_recovery(uint64_t oid)
 	}
 
 	prepare_schedule_oid(oid);
+	resume_suspended_recovery();
 	return true;
 }
 
-- 
1.7.9.5




More information about the sheepdog mailing list