[sheepdog] [PATCH] sheep: sheep aborted during startup, still joined in cluster

Mon May 5 06:18:11 CEST 2014

Currently, create_cluster() is called before any thread is created.
Once any one of the following steps during startup is failed,
sheep calls exit() or abort() directly so that leave_cluster() is not called.
Other nodes would consider that that one should be alived.
This will cause many problems.

This is a reproducible case of using journal file. Hot fix also submit.
But, should we re-arrange the startup steps? Should we avoid panic()
because it is dangerous?

Steps:

1. Start three sheeps. Cluster manager is zookeeper.
for i in `seq 0 2`; do
    sheep/sheep /tmp/sd$i -y 127.0.0.1 -c zookeeper:127.0.0.1:2181 -z $i \
        -p 700$i -j size=64M -n
done

2. Format the cluster and create a vdi. Data object file 007c2b2500000000
is always written into sd1 according to sheepdog hash algorithm.
$ dog cluster format -c 1
$ dog vdi create test 4M -P

3. Write the vdi continuously.
for i in `seq 0 4194303`; do
    echo $i
    echo "a" | dog vdi write test $i 1
done

4. Kill all sheeps in another terminal during writing vdi.
$ killall sheep

5. Sometimes journal files are not cleaned up when sheeps exit.
If they are not found, try step 3 and step 4 again.
$ ls /tmp/sd*/journal*
/tmp/sd0/journal_file0  /tmp/sd0/journal_file1
/tmp/sd1/journal_file0  /tmp/sd1/journal_file1
/tmp/sd2/journal_file0  /tmp/sd2/journal_file1

6. Remove data object file to simulate WAL is finished
but data object file is not created.
$ rm /tmp/sd1/obj/007c2b2500000000

7. Start the three sheeps again. We found sd0 and sd2 is up, but sd1 is down.

8. By the program log (sheep.log), we can see the sheep process of sd1
is already aborted.

 INFO [main] md_add_disk(337) /tmp/sd1/obj, vdisk nr 261, total disk 1
 INFO [main] send_join_request(787) IPv4 ip:127.0.0.1 port:7001
 INFO [main] replay_journal_entry(159) /tmp/sd1/obj/007c2b2500000000, ...
ERROR [main] replay_journal_entry(166) open No such file or directory
EMERG [main] check_recover_journal_file(262)
    PANIC: recoverying from journal file (new) failed
EMERG [main] crash_handler(268) sheep exits unexpectedly (Aborted).
EMERG [main] sd_backtrace(833) sheep() [0x406157]
...

9. However, dog command shows the node is still in the cluster!

$ dog cluster info
Cluster status: running, auto-recovery enabled

Cluster created at Mon May  5 10:33:26 2014

Epoch Time           Version
2014-05-05 10:33:26      1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]

$ dog node list
  Id   Host:Port         V-Nodes       Zone
   0   127.0.0.1:7000      	128          0
   1   127.0.0.1:7001      	128          1
   2   127.0.0.1:7002      	128          2

Signed-off-by: Ruoyu <liangry at ucweb.com>
---
 sheep/journal.c | 29 +++++++++++++++++++----------
 sheep/sheep.c   |  4 +++-
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/sheep/journal.c b/sheep/journal.c
index 57502b6..3c70c13 100644
--- a/sheep/journal.c
+++ b/sheep/journal.c
@@ -151,9 +151,11 @@ static int replay_journal_entry(struct journal_descriptor *jd)
 		return 0;
 	}
 
-	if (jd->flag != JF_STORE)
-		panic("flag is not JF_STORE, the journaling file is broken."
+	if (jd->flag != JF_STORE) {
+		sd_emerg("flag is not JF_STORE, the journaling file is broken."
 		      " please remove the journaling file and restart sheep daemon");
+		return -1;
+	}
 
 	sd_info("%s, size %" PRIu64 ", off %" PRIu64 ", %d", path, jd->size,
 		jd->offset, jd->create);
@@ -245,21 +247,27 @@ skip:
  * we actually only recover one jfile, the other would be empty. This process
  * is fast with buffered IO that only take several secends at most.
  */
-static void check_recover_journal_file(const char *p)
+static int check_recover_journal_file(const char *p)
 {
 	int old = 0, new = 0;
 
 	if (get_old_new_jfile(p, &old, &new) < 0)
-		return;
+		return -1;
 
 	/* No journal file found */
 	if (old == 0)
-		return;
+		return 0;
 
-	if (do_recover(old) < 0)
-		panic("recoverying from journal file (old) failed");
-	if (do_recover(new) < 0)
-		panic("recoverying from journal file (new) failed");
+	if (do_recover(old) < 0) {
+		sd_emerg("recoverying from journal file (old) failed");
+		return -1;
+	}
+	if (do_recover(new) < 0) {
+		sd_emerg("recoverying from journal file (new) failed");
+		return -1;
+	}
+
+	return 0;
 }
 
 int journal_file_init(const char *path, size_t size, bool skip)
@@ -267,7 +275,8 @@ int journal_file_init(const char *path, size_t size, bool skip)
 	int fd;
 
 	if (!skip)
-		check_recover_journal_file(path);
+		if (check_recover_journal_file(path) != 0)
+			return -1;
 
 	jfile_size = size / 2;
 
diff --git a/sheep/sheep.c b/sheep/sheep.c
index 74d1aaf..3bb71d2 100644
--- a/sheep/sheep.c
+++ b/sheep/sheep.c
@@ -875,8 +875,10 @@ int main(int argc, char **argv)
 			memcpy(jpath, dir, strlen(dir));
 		sd_debug("%s, %"PRIu64", %d", jpath, jsize, jskip);
 		ret = journal_file_init(jpath, jsize, jskip);
-		if (ret)
+		if (ret) {
+			leave_cluster();
 			exit(1);
+		}
 	}
 
 	init_fec();
-- 
1.8.3.2