[sheepdog] [PATCH] sheep: fix a bug that hangs the cluster during recovery

levin li levin108 at gmail.com
Mon Aug 27 12:45:07 CEST 2012


From: levin li <xingke.lwp at taobao.com>

During recovery, a VDI creation request may waits for recovery
to complete, and VDI creation request is a cluster request which
prevent other cluster requests being processed, when recovery comes
to notify_recovery_completion_work, it issues another cluster request
with SD_OP_COMPLETE_RECOVERY which is blocked by VDI creation, and
as result, notify_recovery_completion_work blocks the recovery_wqueue,
if a new recovery comes, it's blocked, at the same time, a VDI creation
request may waits for this recovery to complete, so it's a dead lock.

Signed-off-by: levin li <xingke.lwp at taobao.com>
---
 sheep/recovery.c   |    2 +-
 sheep/sheep.c      |    1 +
 sheep/sheep_priv.h |    1 +
 3 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/sheep/recovery.c b/sheep/recovery.c
index 2232110..59ac9d6 100644
--- a/sheep/recovery.c
+++ b/sheep/recovery.c
@@ -373,7 +373,7 @@ static inline void finish_recovery(struct recovery_work *rw)
 	/* notify recovery completion to other nodes */
 	rw->work.fn = notify_recovery_completion_work;
 	rw->work.done = notify_recovery_completion_main;
-	queue_work(sys->recovery_wqueue, &rw->work);
+	queue_work(sys->recovery_notify_wqueue, &rw->work);
 
 	dprintf("recovery complete: new epoch %"PRIu32"\n",
 		sys->recovered_epoch);
diff --git a/sheep/sheep.c b/sheep/sheep.c
index 31af42c..10c0501 100644
--- a/sheep/sheep.c
+++ b/sheep/sheep.c
@@ -370,6 +370,7 @@ int main(int argc, char **argv)
 	sys->gateway_wqueue = init_work_queue("gateway", false);
 	sys->io_wqueue = init_work_queue("io", false);
 	sys->recovery_wqueue = init_work_queue("recovery", true);
+	sys->recovery_notify_wqueue = init_work_queue("recovery notify", true);
 	sys->deletion_wqueue = init_work_queue("deletion", true);
 	sys->block_wqueue = init_work_queue("block", true);
 	sys->sockfd_wqueue = init_work_queue("sockfd", true);
diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
index 1f5a1bd..90006f6 100644
--- a/sheep/sheep_priv.h
+++ b/sheep/sheep_priv.h
@@ -115,6 +115,7 @@ struct cluster_info {
 	struct work_queue *io_wqueue;
 	struct work_queue *deletion_wqueue;
 	struct work_queue *recovery_wqueue;
+	struct work_queue *recovery_notify_wqueue;
 	struct work_queue *block_wqueue;
 	struct work_queue *sockfd_wqueue;
 	struct work_queue *reclaim_wqueue;
-- 
1.7.1




More information about the sheepdog mailing list