On 06/05/2012 07:53 PM, Christoph Hellwig wrote: > Make sure we exit if any operations fails, operations that can fail are: > > - get_latest_epoch - because local_stat_cluster is a a force operation > we might not have any epoch log yet. In that case there is no point > to continue. > - epoch_log_read - can fail in any number of ways, although once we > have a valid epoch it generally shouldn't. > - if we didn't manage to read the epoch log from any node we should > return -1 as there is no valid data returned. If we have a valid > node list of size 0 one of the nodes will return that to us. > > Also adjust get_vnodes_from_epoch to deal with the negative return value, > local_stat_cluster currently doesn't have any error handling and will > be fixed in the next patch. > > Signed-off-by: Christoph Hellwig <hch at lst.de> > > Index: sheepdog/sheep/store.c > =================================================================== > --- sheepdog.orig/sheep/store.c 2012-06-05 07:40:55.336111143 -0400 > +++ sheepdog/sheep/store.c 2012-06-05 07:41:00.272111270 -0400 > @@ -102,10 +102,17 @@ void do_io_request(struct work *work) > int epoch_log_read_remote(uint32_t epoch, char *buf, int len) > { > int i, ret; > - unsigned int nr, le = get_latest_epoch(); > + unsigned int nr, le; > struct sd_node nodes[SD_MAX_NODES]; > > + le = get_latest_epoch(); > + if (!le) > + return 0; > + > nr = epoch_log_read(le, (char *)nodes, sizeof(nodes)); > + if (nr < 0) > + return -1; > + > nr /= sizeof(nodes[0]); > > for (i = 0; i < nr; i++) { > @@ -135,16 +142,11 @@ int epoch_log_read_remote(uint32_t epoch > ret = exec_req(fd, &hdr, buf, &wlen, &rlen); > close(fd); > > - if (ret) > - continue; > - if (rsp->result == SD_RES_SUCCESS) { > - ret = rsp->data_length; > - goto out; > - } > + if (!ret && rsp->result == SD_RES_SUCCESS) > + return rsp->data_length; > } > - ret = 0; /* If no one has targeted epoch file, we can safely return 0 */ > -out: > - return ret; > + > + return -1; > } > > int epoch_log_read_nr(uint32_t epoch, char *buf, int len) > Index: sheepdog/sheep/recovery.c > =================================================================== > --- sheepdog.orig/sheep/recovery.c 2012-06-05 07:32:25.608098136 -0400 > +++ sheepdog/sheep/recovery.c 2012-06-05 07:41:15.808111665 -0400 > @@ -63,7 +63,7 @@ static struct vnode_info *get_vnodes_fro > if (nr_nodes < 0) { > nr_nodes = epoch_log_read_remote(epoch, (void *)nodes, > sizeof(nodes)); > - if (nr_nodes == 0) > + if (nr_nodes <= 0) > return NULL; > nr_nodes /= sizeof(nodes[0]); > } We really should return 0 if we don't get any epoch file. Without your patch, if we lost one of epochs file in all nodes, say bellow case, I deliberately remove epoch 3, we can till get other informatoin: -------- tailai.ly at taobao:~/sheepdog$ collie cluster info Cluster status: running Cluster created at Tue Jun 5 21:44:21 2012 Epoch Time Version 2012-06-05 21:44:24 4 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005] 1970-01-01 07:00:00 3 [] 2012-06-05 21:44:22 2 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002, 127.0.0.1:7003] 2012-06-05 21:44:21 1 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002] but with your patch, we can only get: tailai.ly at taobao:~/sheepdog$ collie cluster info Cluster status: running Cluster created at Tue Jun 5 21:43:41 2012 Epoch Time Version 2012-06-05 21:43:44 4 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002, 127.0.0.1:7003, 127.0.0.1:7004, 127.0.0.1:7005] ------ Thanks, Yuan |