From: levin li <xingke.lwp at taobao.com> Take consider of this scene: Node A and B are in recovery A is recovering object x from B, and object x hasn't been recovered by B. B is recovering object y from A, and object y hasn't been recovered by A. Then B will response A with result SD_RES_NEW_NODE_VER, and A will also response B with result SD_RES_NEW_NODE_VER, then A and B will continually retry to recover object x and y, but always get an response SD_RES_NEW_NODE_VER, neither success, so here's a dead lock which stops the recovery from completing. Signed-off-by: levin li <xingke.lwp at taobao.com> --- sheep/sdnet.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sheep/sdnet.c b/sheep/sdnet.c index 83baae2..a5b3e28 100644 --- a/sheep/sdnet.c +++ b/sheep/sdnet.c @@ -212,7 +212,11 @@ static int check_request(struct request *req) if (!req->local_oid) return 0; - if (is_recoverying_oid(req->local_oid)) { + /* IO request of recovery should not wait, or else it may cause + dead lock of recovery, if fails, recovery will take its own + retrying mechanism. */ + if (is_recoverying_oid(req->local_oid) && + !(req->rq.flags & SD_FLAG_CMD_RECOVERY)) { if (req->rq.flags & SD_FLAG_CMD_IO_LOCAL) { /* Sheep peer request */ if (is_recovery_init()) { -- 1.7.10 |