[Sheepdog] [PATCH 3/3] farm: avoid unnecessary IO operation when recovering

Liu Yuan namei.unix at gmail.com
Mon Mar 12 12:48:47 CET 2012


On 03/12/2012 07:03 PM, Li Wenpeng wrote:

> From: levin li <xingke.lwp at taobao.com>
> 
> When the cluster is recovering, we only need to write the objects
> which no longer belong to the node to the snapshot, instead of writing
> all the objects, by which we decreased the IO operation.
> 
> When we try to read an object, we first read it from the local
> object directory, if not found, then read it from the snapshot.
> 
> Signed-off-by: levin li <xingke.lwp at taobao.com>
> ---
>  sheep/farm/farm.c  |  139 +++++++++++++++++++++++++++++----------------------
>  sheep/farm/farm.h  |    3 +-
>  sheep/farm/snap.c  |    9 +---
>  sheep/farm/trunk.c |   63 +++++++++++++++++++++++
>  sheep/store.c      |   14 +++--
>  5 files changed, 154 insertions(+), 74 deletions(-)
> 
> diff --git a/sheep/farm/farm.c b/sheep/farm/farm.c
> index 6db7dab..e0b22b1 100644
> --- a/sheep/farm/farm.c
> +++ b/sheep/farm/farm.c
> @@ -279,6 +279,40 @@ static int farm_get_objlist(struct siocb *iocb)
>  	return SD_RES_SUCCESS;
>  }
>  
> +
> +static void *read_working_object(uint64_t oid, int length)
> +{
> +	void *buf = NULL;
> +	char path[PATH_MAX];
> +	int fd, ret;
> +
> +	snprintf(path, sizeof(path), "%s%016" PRIx64, obj_path, oid);
> +
> +	fd = open(path, O_RDONLY, def_fmode);
> +	if (fd < 0) {
> +		eprintf("failed to open %s: %m\n", path);
> +		goto out;
> +	}
> +
> +	buf = malloc(length);
> +	if (!buf) {
> +		eprintf("no memory to allocate buffer.\n");
> +		goto out;
> +	}
> +
> +	ret = xread(fd, buf, length);
> +	if (length != ret) {
> +		eprintf("object read error.\n");
> +		free(buf);
> +		buf = NULL;
> +		goto out;
> +	}
> +	close(fd);
> +
> +out:
> +	return buf;
> +}
> +
>  static void *retrieve_object_from_snap(uint64_t oid, int epoch)
>  {
>  	struct sha1_file_hdr hdr;
> @@ -299,11 +333,11 @@ static void *retrieve_object_from_snap(uint64_t oid, int epoch)
>  		struct sha1_file_hdr h;
>  		if (trunk_buf->oid != oid)
>  			continue;
> +
>  		buffer = sha1_file_read(trunk_buf->sha1, &h);
> -		if (!buffer)
> -			goto out;
>  		break;
>  	}
> +
>  out:
>  	dprintf("oid %"PRIx64", epoch %d, %s\n", oid, epoch, buffer ? "succeed" : "fail");
>  	free(trunk_free);
> @@ -312,8 +346,25 @@ out:
>  
>  static int farm_read(uint64_t oid, struct siocb *iocb)
>  {
> +	int i;
> +
>  	if (iocb->epoch < sys->epoch) {
> -		void *buffer = retrieve_object_from_snap(oid, iocb->epoch);
> +		void *buffer;
> +		buffer = read_working_object(oid, iocb->length);
> +		if (!buffer) {
> +			/* Here if read the object from the current epoch failed,


I think 'current epoch' should be 'targeted epoch'

> +			 * we need to read from the later epoch, because in some epoch
> +			 * we doesn't write the object to the snapshot, we assume
> +			 * it in the current local object directory, but maybe
> +			 * in the next epoch we removed it from the local directory.
> +			 * in this case, trying to read object from the older epoch
> +			 * will fail. */


'in this case, trying to read object...' would be better rephrased as
'in this case, we should try to retrieve object upwards, since when the
object is to be removed, it will get written to the snapshot at later epoch'

> +			for (i = iocb->epoch; i < sys->epoch; i++) {
> +				buffer = retrieve_object_from_snap(oid, i);
> +				if (buffer)
> +					break;
> +			}
> +		}
>  		if (!buffer)
>  			return SD_RES_NO_OBJ;
>  		memcpy(iocb->buf, buffer, iocb->length);
> @@ -368,31 +419,45 @@ out:
>  static int farm_link(uint64_t oid, struct siocb *iocb, int tgt_epoch)
>  {
>  	int ret = SD_RES_EIO;
> -	void *buf;
> +	void *buf = NULL;
>  	struct siocb io = { 0 };
> +	int i;
>  
>  	dprintf("try link %"PRIx64" from snapshot with epoch %d\n", oid, tgt_epoch);
> -	buf = retrieve_object_from_snap(oid, tgt_epoch);
> +
> +	buf = read_working_object(oid, iocb->length);
> +	if (buf)
> +		goto out;


when farm_link is called, we are in the case that the object is *not* in
the working directory. So we dont need to call read_working_object()
right there.

Other looks good to me.

Thanks,
Yuan



More information about the sheepdog mailing list