[sheepdog] [PATCH v0, RFC] sheep: writeback cache semantics in backend store

Fri Aug 17 05:01:41 CEST 2012

On 08/16/2012 07:45 PM, Hitoshi Mitake wrote:
> From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> 
> Hi sheepdog list, nice to meet you.
> 
> This patch implements writeback cache semantics in backend
> store of sheep. Current backend store farm calls open() with
> O_DSYNC, so every object write causes slow disk access. This incurs
> overhead and this overhead is not necessary. Because current qemu
> block driver invokes SD_OP_FLUSH_VDI explicitly for object
> cache. Flushing disk cache with the invocation of SD_OP_FLUSH_VDI
> instead of every object write is enough for current sheep.
> 
> For improving performance by reducing needless disk access, this patch
> adds new inter-sheep operation SD_OP_FLUSH_PEER. This operation is
> used in a situation like this:
> qemu sends SD_OP_FLUSH_VDI -> gateway sheep sends SD_OP_FLUSH_PEER ->
> other sheeps
> And sheeps which received SD_OP_FLUSH_PEER flush disk cache with
> syncfs() system call.
> 
> 
> Below is the evaluation result with dbench:
> 
>  Before applying this patch, without -s (O_SYNC) option:
>  Throughput 13.9269 MB/sec  1 clients  1 procs  max_latency=818.428 ms
>  Before applying this patch, with -s option:
>  Throughput 2.76792 MB/sec (sync open)  1 clients  1 procs
>  max_latency=291.670 ms
> 
>  After applying this patch, without -s option:
>  Throughput 29.7306 MB/sec  1 clients  1 procs  max_latency=1344.463 ms
>  After applying this patch, with -s option:
>  Throughput 4.29357 MB/sec (sync open)  1 clients  1 procs
>  max_latency=450.045 ms
> 
> 
> This patch adds new command line option -W to sheep. With -W, sheep
> uses writeback cache semantics in backend store. I added this new
> option mainly for easy testing and evaluation. If writeback cache
> semantics is more suitable than the previous writethrough semantics as
> default, I'll delete this option again.
> 
> This patch may contain lots of bad code because I'm new to sheepdog.
> I'd like to hear your comments.
> 
> Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> Signed-off-by: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> 
> ---
>  include/internal_proto.h |    1 +
>  sheep/farm/farm.c        |    3 +++
>  sheep/gateway.c          |    2 +-
>  sheep/ops.c              |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  sheep/sheep.c            |    7 ++++++-
>  sheep/sheep_priv.h       |    5 +++++
>  6 files changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/include/internal_proto.h b/include/internal_proto.h
> index 83d98f1..f34ef92 100644
> --- a/include/internal_proto.h
> +++ b/include/internal_proto.h
> @@ -63,6 +63,7 @@
>  #define SD_OP_ENABLE_RECOVER 0xA8
>  #define SD_OP_DISABLE_RECOVER 0xA9
>  #define SD_OP_INFO_RECOVER 0xAA
> +#define SD_OP_FLUSH_PEER 0xAB
>  
>  /* internal flags for hdr.flags, must be above 0x80 */
>  #define SD_FLAG_CMD_RECOVERY 0x0080
> diff --git a/sheep/farm/farm.c b/sheep/farm/farm.c
> index 7eeae9a..991c009 100644
> --- a/sheep/farm/farm.c
> +++ b/sheep/farm/farm.c
> @@ -362,6 +362,9 @@ static int farm_init(char *p)
>  	iocb.epoch = sys->epoch ? sys->epoch - 1 : 0;
>  	farm_cleanup_sys_obj(&iocb);
> 
> +	if (sys->store_writeback)
> +		def_open_flags &= ~O_DSYNC;
> +
>  	return SD_RES_SUCCESS;
>  err:
>  	return SD_RES_EIO;
> diff --git a/sheep/gateway.c b/sheep/gateway.c
> index bdbd08c..79fdd07 100644
> --- a/sheep/gateway.c
> +++ b/sheep/gateway.c
> @@ -225,7 +225,7 @@ static inline void gateway_init_fwd_hdr(struct sd_req *fwd, struct sd_req *hdr)
>  	fwd->proto_ver = SD_SHEEP_PROTO_VER;
>  }
>  
> -static int gateway_forward_request(struct request *req)
> +int gateway_forward_request(struct request *req)
>  {
>  	int i, err_ret = SD_RES_SUCCESS, ret, local = -1;
>  	unsigned wlen;
> diff --git a/sheep/ops.c b/sheep/ops.c
> index 8ca8748..0ec4b63 100644
> --- a/sheep/ops.c
> +++ b/sheep/ops.c
> @@ -22,6 +22,11 @@
>  #include <sys/stat.h>
>  #include <pthread.h>
>  
> +#include <asm/unistd.h>                /* for __NR_syncfs */
> +#ifndef __NR_syncfs
> +#define __NR_syncfs 306
> +#endif
> +
>  #include "sheep_priv.h"
>  #include "strbuf.h"
>  #include "trace/trace.h"
> @@ -584,6 +589,9 @@ static int local_get_snap_file(struct request *req)
>  
>  static int local_flush_vdi(struct request *req)
>  {
> +	if (sys->store_writeback)
> +		gateway_forward_request(req);
> +
>  	if (!sys->enable_write_cache)
>  		return SD_RES_SUCCESS;
>  	return object_cache_flush_vdi(req);
> @@ -837,6 +845,35 @@ out:
>  	return ret;
>  }
>  
> +static int syncfs(int fd)
> +{
> +	return syscall(__NR_syncfs, fd);
> +}
> +

syncfs only appeared in Linux 2.6.39, rather new, I think most of
systems won't support it.

> +int peer_flush_dcache(struct request *req)
> +{
> +	int fd;
> +	struct sd_req *hdr = &req->rq;
> +	uint64_t oid = hdr->obj.oid;
> +	char path[PATH_MAX];
> +
> +	sprintf(path, "%s%016"PRIx64, obj_path, oid);
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0) {
> +		eprintf("error at open() %s, %s\n", path, strerror(errno));
> +		return SD_RES_NO_OBJ;
> +	}
> +
> +	if (syncfs(fd)) {
> +		eprintf("error at syncfs(), %s\n", strerror(errno));
> +		return SD_RES_EIO;
> +	}
> +
> +	close(fd);
> +
> +	return SD_RES_SUCCESS;
> +}
> +

With current implementation, you only flush inode object, not all the
data objects belong to targeted VDI. I think this is hardest part to
implement. Probably you need to call exec_local_request() to take
advantage of retry mechanism.

Thanks,
Yuan