[sheepdog] [PATCH 1/2] sheep: fix error handling in epoch_log_remote
Liu Yuan
namei.unix at gmail.com
Tue Jun 5 15:47:18 CEST 2012
On 06/05/2012 07:53 PM, Christoph Hellwig wrote:
> Make sure we exit if any operations fails, operations that can fail are:
>
> - get_latest_epoch - because local_stat_cluster is a a force operation
> we might not have any epoch log yet. In that case there is no point
> to continue.
> - epoch_log_read - can fail in any number of ways, although once we
> have a valid epoch it generally shouldn't.
> - if we didn't manage to read the epoch log from any node we should
> return -1 as there is no valid data returned. If we have a valid
> node list of size 0 one of the nodes will return that to us.
>
> Also adjust get_vnodes_from_epoch to deal with the negative return value,
> local_stat_cluster currently doesn't have any error handling and will
> be fixed in the next patch.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
>
> Index: sheepdog/sheep/store.c
> ===================================================================
> --- sheepdog.orig/sheep/store.c 2012-06-05 07:40:55.336111143 -0400
> +++ sheepdog/sheep/store.c 2012-06-05 07:41:00.272111270 -0400
> @@ -102,10 +102,17 @@ void do_io_request(struct work *work)
> int epoch_log_read_remote(uint32_t epoch, char *buf, int len)
> {
> int i, ret;
> - unsigned int nr, le = get_latest_epoch();
> + unsigned int nr, le;
> struct sd_node nodes[SD_MAX_NODES];
>
> + le = get_latest_epoch();
> + if (!le)
> + return 0;
> +
> nr = epoch_log_read(le, (char *)nodes, sizeof(nodes));
> + if (nr < 0)
> + return -1;
> +
> nr /= sizeof(nodes[0]);
>
> for (i = 0; i < nr; i++) {
> @@ -135,16 +142,11 @@ int epoch_log_read_remote(uint32_t epoch
> ret = exec_req(fd, &hdr, buf, &wlen, &rlen);
> close(fd);
>
> - if (ret)
> - continue;
> - if (rsp->result == SD_RES_SUCCESS) {
> - ret = rsp->data_length;
> - goto out;
> - }
> + if (!ret && rsp->result == SD_RES_SUCCESS)
> + return rsp->data_length;
> }
> - ret = 0; /* If no one has targeted epoch file, we can safely return 0 */
> -out:
> - return ret;
> +
> + return -1;
> }
>
> int epoch_log_read_nr(uint32_t epoch, char *buf, int len)
> Index: sheepdog/sheep/recovery.c
> ===================================================================
> --- sheepdog.orig/sheep/recovery.c 2012-06-05 07:32:25.608098136 -0400
> +++ sheepdog/sheep/recovery.c 2012-06-05 07:41:15.808111665 -0400
> @@ -63,7 +63,7 @@ static struct vnode_info *get_vnodes_fro
> if (nr_nodes < 0) {
> nr_nodes = epoch_log_read_remote(epoch, (void *)nodes,
> sizeof(nodes));
> - if (nr_nodes == 0)
> + if (nr_nodes <= 0)
> return NULL;
> nr_nodes /= sizeof(nodes[0]);
> }
We really should return 0 if we don't get any epoch file. Without your
patch, if we lost one of epochs file in all nodes, say bellow case, I
deliberately remove epoch 3, we can till get other informatoin:
--------
tailai.ly at taobao:~/sheepdog$ collie cluster info
Cluster status: running
Cluster created at Tue Jun 5 21:44:21 2012
Epoch Time Version
2012-06-05 21:44:24 4 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005]
1970-01-01 07:00:00 3 []
2012-06-05 21:44:22 2 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003]
2012-06-05 21:44:21 1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]
but with your patch, we can only get:
tailai.ly at taobao:~/sheepdog$ collie cluster info
Cluster status: running
Cluster created at Tue Jun 5 21:43:41 2012
Epoch Time Version
2012-06-05 21:43:44 4 [127.0.0.1:7000, 127.0.0.1:7001,
127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005]
------
Thanks,
Yuan
More information about the sheepdog
mailing list