[sheepdog] [PATCH] sheep: fix a epoch mismatch bug
Liu Yuan
namei.unix at gmail.com
Mon May 28 11:24:52 CEST 2012
From: Liu Yuan <tailai.ly at taobao.com>
This is a nasty fallout from removing register/un-register group_fd, can be
observed by following script:
Join a new node while someone left meantime
==============================
for i in 0 1 2; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
collie/collie cluster format -c 3
collie/collie vdi create test0 100M -P
sleep 1
for i in 3; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
for i in 1; do pkill -f "sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i";done;
==============================
The culprit is that we failed to inc sys->epoch because the sys stat of the
newly joined node is SD_STATUS_WAIT_FOR_FORMAT before calling __sd_join_done().
The fix is simple, adding a new status to indicate that "I'm already joined,
though need update other states, I'm still capable of recovering"
Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
---
sheep/group.c | 5 +++++
sheep/sheep_priv.h | 9 ++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/sheep/group.c b/sheep/group.c
index f38446d..7f0dab1 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -1145,6 +1145,11 @@ void sd_join_handler(struct sd_node *joined, struct sd_node *members,
break;
update_cluster_info(jm, joined, members, nr_members);
+ /* I'm already joined, though need update other states by
+ * join_done() but I'm capable of recovering since now.
+ */
+ if (jm->cluster_status == SD_STATUS_OK)
+ sys_stat_set(SD_STATUS_WAIT_FOR_JOIN_DONE);
w = zalloc(sizeof(*w));
if (!w)
diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
index 69ece1c..1e6f600 100644
--- a/sheep/sheep_priv.h
+++ b/sheep/sheep_priv.h
@@ -33,6 +33,7 @@
#define SD_STATUS_SHUTDOWN 0x00000008
#define SD_STATUS_JOIN_FAILED 0x00000010
#define SD_STATUS_HALT 0x00000020
+#define SD_STATUS_WAIT_FOR_JOIN_DONE 0x00000040
#define SD_RES_NETWORK_ERROR 0x81 /* Network error between sheep */
@@ -410,9 +411,15 @@ static inline uint32_t sys_stat_get(void)
return sys->status;
}
+static inline uint32_t sys_stat_wait_join_done(void)
+{
+ return sys->status & SD_STATUS_WAIT_FOR_JOIN_DONE;
+}
static inline int sys_can_recover(void)
{
- return sys_stat_ok() || sys_stat_halt();
+ return (sys_stat_ok() ||
+ sys_stat_halt() ||
+ sys_stat_wait_join_done());
}
static inline int sys_can_halt(void)
--
1.7.10.2
More information about the sheepdog
mailing list