[sheepdog] [PATCH] sheep: fix a epoch mismatch bug

Liu Yuan namei.unix at gmail.com
Mon May 28 11:24:52 CEST 2012


From: Liu Yuan <tailai.ly at taobao.com>

This is a nasty fallout from removing register/un-register group_fd, can be
observed by following script:

Join a new node while someone left meantime
==============================
for i in 0 1 2; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
collie/collie cluster format -c 3
collie/collie vdi create test0 100M -P
sleep 1
for i in 3; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
for i in 1; do pkill -f "sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i";done;
==============================

The culprit is that we failed to inc sys->epoch because the sys stat of the
newly joined node is SD_STATUS_WAIT_FOR_FORMAT before calling __sd_join_done().

The fix is simple, adding a new status to indicate that "I'm already joined,
though need update other states, I'm still capable of recovering"

Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
 sheep/group.c      |    5 +++++
 sheep/sheep_priv.h |    9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/sheep/group.c b/sheep/group.c
index f38446d..7f0dab1 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -1145,6 +1145,11 @@ void sd_join_handler(struct sd_node *joined, struct sd_node *members,
 			break;
 
 		update_cluster_info(jm, joined, members, nr_members);
+		/* I'm already joined, though need update other states by
+		 * join_done() but I'm capable of recovering since now.
+		 */
+		if (jm->cluster_status == SD_STATUS_OK)
+			sys_stat_set(SD_STATUS_WAIT_FOR_JOIN_DONE);
 
 		w = zalloc(sizeof(*w));
 		if (!w)
diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
index 69ece1c..1e6f600 100644
--- a/sheep/sheep_priv.h
+++ b/sheep/sheep_priv.h
@@ -33,6 +33,7 @@
 #define SD_STATUS_SHUTDOWN          0x00000008
 #define SD_STATUS_JOIN_FAILED       0x00000010
 #define SD_STATUS_HALT              0x00000020
+#define SD_STATUS_WAIT_FOR_JOIN_DONE 0x00000040
 
 #define SD_RES_NETWORK_ERROR    0x81 /* Network error between sheep */
 
@@ -410,9 +411,15 @@ static inline uint32_t sys_stat_get(void)
 	return sys->status;
 }
 
+static inline uint32_t sys_stat_wait_join_done(void)
+{
+	return sys->status & SD_STATUS_WAIT_FOR_JOIN_DONE;
+}
 static inline int sys_can_recover(void)
 {
-	return sys_stat_ok() || sys_stat_halt();
+	return (sys_stat_ok() ||
+		sys_stat_halt() ||
+		sys_stat_wait_join_done());
 }
 
 static inline int sys_can_halt(void)
-- 
1.7.10.2




More information about the sheepdog mailing list