[sheepdog] [PATCH v2] sheep: let outstanding IO req doesn't block confchg event

Mon May 28 10:07:36 CEST 2012

From: Liu Yuan <tailai.ly at taobao.com>

v2:
 - remove unnecessary else
---------------------------------------------- >8

We already define the in-fly IO object as busy object, which sit on the
sys->outstanding_req_list. So recovery request for this object will be queued
on sys->req_wait_for_obj_list, where it will be resumed later.

So there is no need to block confchg event for outstanding IO thus confchg
could be processed as soon as possible. Confchg should take precedence over IO
req because:

Suppose doing heavy IO on each node while cluster is in recovery.
Every node is issuing IO request while doing recovery. Both outstanding
IO and unfinished confchg event blocks each other (nearly dead lock),
all nodes are busy retrying those pending I/Os (live lock), and recovery
requests are mostly denied of service, neither outstanding IO nor
recovery moves on to completion.

farm_write()'s epoch check function as a safe guard for follwing case from
Kazutaka

If there are 1 node, A, and the number of copies is 1, how does
Farm handle the following case?

 - the user add the second node B, and there is in-flight I/Os on
   node A
 - the node A increments the epoch from 1 to 2, and the node B recovers
   objects from epoch 1 on node A
 - after node B receives objects to epoch 2, the in-flight I/Os on
   node A updates objects in epoch 1 on node A.
 - node A sends responses to clients as success, but the updated data
   will be lost??

Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
 sheep/farm/farm.c  |    4 ++++
 sheep/group.c      |    3 +--
 sheep/sdnet.c      |    2 --
 sheep/sheep_priv.h |    1 -
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/sheep/farm/farm.c b/sheep/farm/farm.c
index 1575d24..568fad2 100644
--- a/sheep/farm/farm.c
+++ b/sheep/farm/farm.c
@@ -110,6 +110,10 @@ static int farm_write(uint64_t oid, struct siocb *iocb, int create)
 	char path[PATH_MAX];
 	ssize_t size;
 
+	if (iocb->epoch < sys_epoch()) {
+		dprintf("%"PRIu32" sys %"PRIu32"\n", iocb->epoch, sys_epoch());
+		return SD_RES_OLD_NODE_VER;
+	}
 	if (is_vdi_obj(oid))
 		flags &= ~O_DIRECT;
 
diff --git a/sheep/group.c b/sheep/group.c
index 50a53c1..f3b95cc 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -1096,7 +1096,6 @@ static void process_request_queue(void)
 		if (is_io_op(req->op)) {
 			list_add_tail(&req->request_list,
 				      &sys->outstanding_req_list);
-			sys->nr_outstanding_io++;
 
 			if (need_consistency_check(req))
 				set_consistency_check(req);
@@ -1125,7 +1124,7 @@ static inline void process_event_queue(void)
 	 * we need to serialize events so we don't call queue_work
 	 * if one event is running by executing event_fn() or event_done().
 	 */
-	if (event_running || sys->nr_outstanding_io)
+	if (event_running)
 		return;
 
 	cevent = list_first_entry(&sys->event_queue,
diff --git a/sheep/sdnet.c b/sheep/sdnet.c
index 4224220..f4408f7 100644
--- a/sheep/sdnet.c
+++ b/sheep/sdnet.c
@@ -97,7 +97,6 @@ static void io_op_done(struct work *work)
 	struct sd_req *hdr = &req->rq;
 
 	list_del(&req->request_list);
-	sys->nr_outstanding_io--;
 
 	switch (req->rp.result) {
 	case SD_RES_OLD_NODE_VER:
@@ -193,7 +192,6 @@ static int check_request(struct request *req)
 		/* ask gateway to retry. */
 		req->rp.result = SD_RES_OLD_NODE_VER;
 		req->rp.epoch = sys->epoch;
-		sys->nr_outstanding_io++;
 		req->work.done(&req->work);
 		return -1;
 	} else if (after(req->rq.epoch, sys->epoch)) {
diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
index 69ece1c..6114a21 100644
--- a/sheep/sheep_priv.h
+++ b/sheep/sheep_priv.h
@@ -140,7 +140,6 @@ struct cluster_info {
 	struct list_head wait_rw_queue;
 	struct list_head wait_obj_queue;
 	struct event_struct *cur_cevent;
-	int nr_outstanding_io;
 	int nr_outstanding_reqs;
 	unsigned int outstanding_data_size;
 
-- 
1.7.10.2