[sheepdog] [PATCH v4 7/8] recovery: fix a race condition in recovery

levin li levin108 at gmail.com
Fri May 25 04:30:59 CEST 2012


From: levin li <xingke.lwp at taobao.com>

Take consider of this scene:

Node A and B are in recovery
A is recovering object x from B,
and object x hasn't been recovered by B.
B is recovering object y from A,
and object y hasn't been recovered by A.

Then B will response A with result SD_RES_NEW_NODE_VER, and
A will also response B with result SD_RES_NEW_NODE_VER, then
A and B will continually retry to recover object x and y, but always
get an response SD_RES_NEW_NODE_VER, neither success, so here's a
dead lock which stops the recovery from completing.

Signed-off-by: levin li <xingke.lwp at taobao.com>
---
 sheep/sdnet.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/sheep/sdnet.c b/sheep/sdnet.c
index 83baae2..a5b3e28 100644
--- a/sheep/sdnet.c
+++ b/sheep/sdnet.c
@@ -212,7 +212,11 @@ static int check_request(struct request *req)
 	if (!req->local_oid)
 		return 0;
 
-	if (is_recoverying_oid(req->local_oid)) {
+	/* IO request of recovery should not wait, or else it may cause
+	   dead lock of recovery, if fails, recovery will take its own
+	   retrying mechanism. */
+	if (is_recoverying_oid(req->local_oid) &&
+	    !(req->rq.flags & SD_FLAG_CMD_RECOVERY)) {
 		if (req->rq.flags & SD_FLAG_CMD_IO_LOCAL) {
 			/* Sheep peer request */
 			if (is_recovery_init()) {
-- 
1.7.10




More information about the sheepdog mailing list