[sheepdog] [PATCH] sheep: sheep aborted during startup, still joined in cluster

Tue May 6 15:14:27 CEST 2014

On 2014年05月06日 11:17, Liu Yuan wrote:
> On Mon, May 05, 2014 at 12:18:11PM +0800, Ruoyu wrote:
>> Currently, create_cluster() is called before any thread is created.
>> Once any one of the following steps during startup is failed,
>> sheep calls exit() or abort() directly so that leave_cluster() is not called.
>> Other nodes would consider that that one should be alived.
>> This will cause many problems.
>>
>> This is a reproducible case of using journal file. Hot fix also submit.
>> But, should we re-arrange the startup steps? Should we avoid panic()
>> because it is dangerous?
>>
>> Steps:
>>
>> 1. Start three sheeps. Cluster manager is zookeeper.
>> for i in `seq 0 2`; do
>>      sheep/sheep /tmp/sd$i -y 127.0.0.1 -c zookeeper:127.0.0.1:2181 -z $i \
>>          -p 700$i -j size=64M -n
>> done
>>
>> 2. Format the cluster and create a vdi. Data object file 007c2b2500000000
>> is always written into sd1 according to sheepdog hash algorithm.
>> $ dog cluster format -c 1
>> $ dog vdi create test 4M -P
>>
>> 3. Write the vdi continuously.
>> for i in `seq 0 4194303`; do
>>      echo $i
>>      echo "a" | dog vdi write test $i 1
>> done
>>
>> 4. Kill all sheeps in another terminal during writing vdi.
>> $ killall sheep
>>
>> 5. Sometimes journal files are not cleaned up when sheeps exit.
>> If they are not found, try step 3 and step 4 again.
>> $ ls /tmp/sd*/journal*
>> /tmp/sd0/journal_file0  /tmp/sd0/journal_file1
>> /tmp/sd1/journal_file0  /tmp/sd1/journal_file1
>> /tmp/sd2/journal_file0  /tmp/sd2/journal_file1
>>
>> 6. Remove data object file to simulate WAL is finished
>> but data object file is not created.
>> $ rm /tmp/sd1/obj/007c2b2500000000
>>
>> 7. Start the three sheeps again. We found sd0 and sd2 is up, but sd1 is down.
>>
>> 8. By the program log (sheep.log), we can see the sheep process of sd1
>> is already aborted.
>>
>>   INFO [main] md_add_disk(337) /tmp/sd1/obj, vdisk nr 261, total disk 1
>>   INFO [main] send_join_request(787) IPv4 ip:127.0.0.1 port:7001
>>   INFO [main] replay_journal_entry(159) /tmp/sd1/obj/007c2b2500000000, ...
>> ERROR [main] replay_journal_entry(166) open No such file or directory
>> EMERG [main] check_recover_journal_file(262)
>>      PANIC: recoverying from journal file (new) failed
>> EMERG [main] crash_handler(268) sheep exits unexpectedly (Aborted).
>> EMERG [main] sd_backtrace(833) sheep() [0x406157]
>> ...
>>
>> 9. However, dog command shows the node is still in the cluster!
>>
>> $ dog cluster info
>> Cluster status: running, auto-recovery enabled
>>
>> Cluster created at Mon May  5 10:33:26 2014
>>
>> Epoch Time           Version
>> 2014-05-05 10:33:26      1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]
>>
>> $ dog node list
>>    Id   Host:Port         V-Nodes       Zone
>>     0   127.0.0.1:7000      	128          0
>>     1   127.0.0.1:7001      	128          1
>>     2   127.0.0.1:7002      	128          2
>>
>> Signed-off-by: Ruoyu <liangry at ucweb.com>
>> ---
>>   sheep/journal.c | 29 +++++++++++++++++++----------
>>   sheep/sheep.c   |  4 +++-
>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/sheep/journal.c b/sheep/journal.c
>> index 57502b6..3c70c13 100644
>> --- a/sheep/journal.c
>> +++ b/sheep/journal.c
>> @@ -151,9 +151,11 @@ static int replay_journal_entry(struct journal_descriptor *jd)
>>   		return 0;
>>   	}
>>   
>> -	if (jd->flag != JF_STORE)
>> -		panic("flag is not JF_STORE, the journaling file is broken."
>> +	if (jd->flag != JF_STORE) {
>> +		sd_emerg("flag is not JF_STORE, the journaling file is broken."
>>   		      " please remove the journaling file and restart sheep daemon");
>> +		return -1;
>> +	}
>>   
>>   	sd_info("%s, size %" PRIu64 ", off %" PRIu64 ", %d", path, jd->size,
>>   		jd->offset, jd->create);
>> @@ -245,21 +247,27 @@ skip:
>>    * we actually only recover one jfile, the other would be empty. This process
>>    * is fast with buffered IO that only take several secends at most.
>>    */
>> -static void check_recover_journal_file(const char *p)
>> +static int check_recover_journal_file(const char *p)
>>   {
>>   	int old = 0, new = 0;
>>   
>>   	if (get_old_new_jfile(p, &old, &new) < 0)
>> -		return;
>> +		return -1;
>>   
>>   	/* No journal file found */
>>   	if (old == 0)
>> -		return;
>> +		return 0;
>>   
>> -	if (do_recover(old) < 0)
>> -		panic("recoverying from journal file (old) failed");
>> -	if (do_recover(new) < 0)
>> -		panic("recoverying from journal file (new) failed");
>> +	if (do_recover(old) < 0) {
>> +		sd_emerg("recoverying from journal file (old) failed");
>> +		return -1;
>> +	}
>> +	if (do_recover(new) < 0) {
>> +		sd_emerg("recoverying from journal file (new) failed");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>>   }
>>   
>>   int journal_file_init(const char *path, size_t size, bool skip)
>> @@ -267,7 +275,8 @@ int journal_file_init(const char *path, size_t size, bool skip)
>>   	int fd;
>>   
>>   	if (!skip)
>> -		check_recover_journal_file(path);
>> +		if (check_recover_journal_file(path) != 0)
>> +			return -1;
>>   
>>   	jfile_size = size / 2;
>>   
>> diff --git a/sheep/sheep.c b/sheep/sheep.c
>> index 74d1aaf..3bb71d2 100644
>> --- a/sheep/sheep.c
>> +++ b/sheep/sheep.c
>> @@ -875,8 +875,10 @@ int main(int argc, char **argv)
>>   			memcpy(jpath, dir, strlen(dir));
>>   		sd_debug("%s, %"PRIu64", %d", jpath, jsize, jskip);
>>   		ret = journal_file_init(jpath, jsize, jskip);
>> -		if (ret)
>> +		if (ret) {
>> +			leave_cluster();
> I think you should goto line 945 at sheep.c for gracefully cleanup of resources.
If we want to cleanup resources gracefully, more changes are needed.

1. assigned an exit code stand for exception, normally it is 1, is not 0.
2. adjust the cleanup steps, for example, pid file is created later than 
other initial steps. I think it should be cleaned up first.
3. adjust some initial steps.

I will try to patch it later.
>
> Thanks
> Yuan