[sheepdog] [PATCH] sheep: sheep aborted during startup, still joined in cluster

Tue May 6 05:17:12 CEST 2014

On Mon, May 05, 2014 at 12:18:11PM +0800, Ruoyu wrote:
> Currently, create_cluster() is called before any thread is created.
> Once any one of the following steps during startup is failed,
> sheep calls exit() or abort() directly so that leave_cluster() is not called.
> Other nodes would consider that that one should be alived.
> This will cause many problems.
> 
> This is a reproducible case of using journal file. Hot fix also submit.
> But, should we re-arrange the startup steps? Should we avoid panic()
> because it is dangerous?
> 
> Steps:
> 
> 1. Start three sheeps. Cluster manager is zookeeper.
> for i in `seq 0 2`; do
>     sheep/sheep /tmp/sd$i -y 127.0.0.1 -c zookeeper:127.0.0.1:2181 -z $i \
>         -p 700$i -j size=64M -n
> done
> 
> 2. Format the cluster and create a vdi. Data object file 007c2b2500000000
> is always written into sd1 according to sheepdog hash algorithm.
> $ dog cluster format -c 1
> $ dog vdi create test 4M -P
> 
> 3. Write the vdi continuously.
> for i in `seq 0 4194303`; do
>     echo $i
>     echo "a" | dog vdi write test $i 1
> done
> 
> 4. Kill all sheeps in another terminal during writing vdi.
> $ killall sheep
> 
> 5. Sometimes journal files are not cleaned up when sheeps exit.
> If they are not found, try step 3 and step 4 again.
> $ ls /tmp/sd*/journal*
> /tmp/sd0/journal_file0  /tmp/sd0/journal_file1
> /tmp/sd1/journal_file0  /tmp/sd1/journal_file1
> /tmp/sd2/journal_file0  /tmp/sd2/journal_file1
> 
> 6. Remove data object file to simulate WAL is finished
> but data object file is not created.
> $ rm /tmp/sd1/obj/007c2b2500000000
> 
> 7. Start the three sheeps again. We found sd0 and sd2 is up, but sd1 is down.
> 
> 8. By the program log (sheep.log), we can see the sheep process of sd1
> is already aborted.
> 
>  INFO [main] md_add_disk(337) /tmp/sd1/obj, vdisk nr 261, total disk 1
>  INFO [main] send_join_request(787) IPv4 ip:127.0.0.1 port:7001
>  INFO [main] replay_journal_entry(159) /tmp/sd1/obj/007c2b2500000000, ...
> ERROR [main] replay_journal_entry(166) open No such file or directory
> EMERG [main] check_recover_journal_file(262)
>     PANIC: recoverying from journal file (new) failed
> EMERG [main] crash_handler(268) sheep exits unexpectedly (Aborted).
> EMERG [main] sd_backtrace(833) sheep() [0x406157]
> ...
> 
> 9. However, dog command shows the node is still in the cluster!
> 
> $ dog cluster info
> Cluster status: running, auto-recovery enabled
> 
> Cluster created at Mon May  5 10:33:26 2014
> 
> Epoch Time           Version
> 2014-05-05 10:33:26      1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]
> 
> $ dog node list
>   Id   Host:Port         V-Nodes       Zone
>    0   127.0.0.1:7000      	128          0
>    1   127.0.0.1:7001      	128          1
>    2   127.0.0.1:7002      	128          2
> 
> Signed-off-by: Ruoyu <liangry at ucweb.com>
> ---
>  sheep/journal.c | 29 +++++++++++++++++++----------
>  sheep/sheep.c   |  4 +++-
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/sheep/journal.c b/sheep/journal.c
> index 57502b6..3c70c13 100644
> --- a/sheep/journal.c
> +++ b/sheep/journal.c
> @@ -151,9 +151,11 @@ static int replay_journal_entry(struct journal_descriptor *jd)
>  		return 0;
>  	}
>  
> -	if (jd->flag != JF_STORE)
> -		panic("flag is not JF_STORE, the journaling file is broken."
> +	if (jd->flag != JF_STORE) {
> +		sd_emerg("flag is not JF_STORE, the journaling file is broken."
>  		      " please remove the journaling file and restart sheep daemon");
> +		return -1;
> +	}
>  
>  	sd_info("%s, size %" PRIu64 ", off %" PRIu64 ", %d", path, jd->size,
>  		jd->offset, jd->create);
> @@ -245,21 +247,27 @@ skip:
>   * we actually only recover one jfile, the other would be empty. This process
>   * is fast with buffered IO that only take several secends at most.
>   */
> -static void check_recover_journal_file(const char *p)
> +static int check_recover_journal_file(const char *p)
>  {
>  	int old = 0, new = 0;
>  
>  	if (get_old_new_jfile(p, &old, &new) < 0)
> -		return;
> +		return -1;
>  
>  	/* No journal file found */
>  	if (old == 0)
> -		return;
> +		return 0;
>  
> -	if (do_recover(old) < 0)
> -		panic("recoverying from journal file (old) failed");
> -	if (do_recover(new) < 0)
> -		panic("recoverying from journal file (new) failed");
> +	if (do_recover(old) < 0) {
> +		sd_emerg("recoverying from journal file (old) failed");
> +		return -1;
> +	}
> +	if (do_recover(new) < 0) {
> +		sd_emerg("recoverying from journal file (new) failed");
> +		return -1;
> +	}
> +
> +	return 0;
>  }
>  
>  int journal_file_init(const char *path, size_t size, bool skip)
> @@ -267,7 +275,8 @@ int journal_file_init(const char *path, size_t size, bool skip)
>  	int fd;
>  
>  	if (!skip)
> -		check_recover_journal_file(path);
> +		if (check_recover_journal_file(path) != 0)
> +			return -1;
>  
>  	jfile_size = size / 2;
>  
> diff --git a/sheep/sheep.c b/sheep/sheep.c
> index 74d1aaf..3bb71d2 100644
> --- a/sheep/sheep.c
> +++ b/sheep/sheep.c
> @@ -875,8 +875,10 @@ int main(int argc, char **argv)
>  			memcpy(jpath, dir, strlen(dir));
>  		sd_debug("%s, %"PRIu64", %d", jpath, jsize, jskip);
>  		ret = journal_file_init(jpath, jsize, jskip);
> -		if (ret)
> +		if (ret) {
> +			leave_cluster();

I think you should goto line 945 at sheep.c for gracefully cleanup of resources.

Thanks
Yuan