[sheepdog] [PATCH 1/2] sheep: fix error handling in epoch_log_remote

Tue Jun 5 15:47:18 CEST 2012

On 06/05/2012 07:53 PM, Christoph Hellwig wrote:

> Make sure we exit if any operations fails, operations that can fail are:
> 
>  - get_latest_epoch - because local_stat_cluster is a a force operation
>    we might not have any epoch log yet.  In that case there is no point
>    to continue.
>  - epoch_log_read - can fail in any number of ways, although once we
>    have a valid epoch it generally shouldn't.
>  - if we didn't manage to read the epoch log from any node we should
>    return -1 as there is no valid data returned.  If we have a valid
>    node list of size 0 one of the nodes will return that to us.
> 
> Also adjust get_vnodes_from_epoch to deal with the negative return value,
> local_stat_cluster currently doesn't have any error handling and will
> be fixed in the next patch.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> 
> Index: sheepdog/sheep/store.c
> ===================================================================
> --- sheepdog.orig/sheep/store.c	2012-06-05 07:40:55.336111143 -0400
> +++ sheepdog/sheep/store.c	2012-06-05 07:41:00.272111270 -0400
> @@ -102,10 +102,17 @@ void do_io_request(struct work *work)
>  int epoch_log_read_remote(uint32_t epoch, char *buf, int len)
>  {
>  	int i, ret;
> -	unsigned int nr, le = get_latest_epoch();
> +	unsigned int nr, le;
>  	struct sd_node nodes[SD_MAX_NODES];
>  
> +	le = get_latest_epoch();
> +	if (!le)
> +		return 0;
> +
>  	nr = epoch_log_read(le, (char *)nodes, sizeof(nodes));
> +	if (nr < 0)
> +		return -1;
> +
>  	nr /= sizeof(nodes[0]);
>  
>  	for (i = 0; i < nr; i++) {
> @@ -135,16 +142,11 @@ int epoch_log_read_remote(uint32_t epoch
>  		ret = exec_req(fd, &hdr, buf, &wlen, &rlen);
>  		close(fd);
>  
> -		if (ret)
> -			continue;
> -		if (rsp->result == SD_RES_SUCCESS) {
> -			ret = rsp->data_length;
> -			goto out;
> -		}
> +		if (!ret && rsp->result == SD_RES_SUCCESS)
> +			return rsp->data_length;
>  	}
> -	ret = 0; /* If no one has targeted epoch file, we can safely return 0 */
> -out:
> -	return ret;
> +
> +	return -1;
>  }
>  
>  int epoch_log_read_nr(uint32_t epoch, char *buf, int len)
> Index: sheepdog/sheep/recovery.c
> ===================================================================
> --- sheepdog.orig/sheep/recovery.c	2012-06-05 07:32:25.608098136 -0400
> +++ sheepdog/sheep/recovery.c	2012-06-05 07:41:15.808111665 -0400
> @@ -63,7 +63,7 @@ static struct vnode_info *get_vnodes_fro
>  	if (nr_nodes < 0) {
>  		nr_nodes = epoch_log_read_remote(epoch, (void *)nodes,
>  						 sizeof(nodes));
> -		if (nr_nodes == 0)
> +		if (nr_nodes <= 0)
>  			return NULL;
>  		nr_nodes /= sizeof(nodes[0]);
>  	}

We really should return 0 if we don't get any epoch file. Without your
patch, if we lost one of epochs file in all nodes, say bellow case, I
deliberately remove epoch 3, we can till get other informatoin:

--------
tailai.ly at taobao:~/sheepdog$ collie cluster info
Cluster status: running

Cluster created at Tue Jun  5 21:44:21 2012

Epoch Time           Version
2012-06-05 21:44:24      4 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005]
1970-01-01 07:00:00      3 []
2012-06-05 21:44:22      2 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003]
2012-06-05 21:44:21      1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]

but with your patch, we can only get:
tailai.ly at taobao:~/sheepdog$ collie cluster info
Cluster status: running

Cluster created at Tue Jun  5 21:43:41 2012

Epoch Time           Version
2012-06-05 21:43:44      4 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005]

------

Thanks,
Yuan