[sheepdog] [PATCH v4] sheepdog: selectable object size support

Liu Yuan namei.unix at gmail.com
Thu Feb 12 08:42:17 CET 2015


On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote:
> At Thu, 12 Feb 2015 15:00:49 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
> > > At Tue, 10 Feb 2015 18:35:58 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
> > > > > (2015/02/10 17:58), Liu Yuan wrote:
> > > > > >On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
> > > > > >>(2015/02/10 12:10), Liu Yuan wrote:
> > > > > >>>On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
> > > > > >>>>Previously, qemu block driver of sheepdog used hard-coded VDI object size.
> > > > > >>>>This patch enables users to handle "block_size_shift" value for
> > > > > >>>>calculating VDI object size.
> > > > > >>>>
> > > > > >>>>When you start qemu, you don't need to specify additional command option.
> > > > > >>>>
> > > > > >>>>But when you create the VDI which doesn't have default object size
> > > > > >>>>with qemu-img command, you specify block_size_shift option.
> > > > > >>>>
> > > > > >>>>If you want to create a VDI of 8MB(1 << 23) object size,
> > > > > >>>>you need to specify following command option.
> > > > > >>>>
> > > > > >>>>  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
> > > > > >>>>
> > > > > >>>>In addition, when you don't specify qemu-img command option,
> > > > > >>>>a default value of sheepdog cluster is used for creating VDI.
> > > > > >>>>
> > > > > >>>>  # qemu-img create sheepdog:test2 100M
> > > > > >>>>
> > > > > >>>>Signed-off-by: Teruaki Ishizaki <ishizaki.teruaki at lab.ntt.co.jp>
> > > > > >>>>---
> > > > > >>>>V4:
> > > > > >>>>  - Limit a read/write buffer size for creating a preallocated VDI.
> > > > > >>>>  - Replace a parse function for the block_size_shift option.
> > > > > >>>>  - Fix an error message.
> > > > > >>>>
> > > > > >>>>V3:
> > > > > >>>>  - Delete the needless operation of buffer.
> > > > > >>>>  - Delete the needless operations of request header.
> > > > > >>>>    for SD_OP_GET_CLUSTER_DEFAULT.
> > > > > >>>>  - Fix coding style problems.
> > > > > >>>>
> > > > > >>>>V2:
> > > > > >>>>  - Fix coding style problem (white space).
> > > > > >>>>  - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
> > > > > >>>>  - Initialize request header to use block_size_shift specified by user.
> > > > > >>>>---
> > > > > >>>>  block/sheepdog.c          |  138 ++++++++++++++++++++++++++++++++++++++-------
> > > > > >>>>  include/block/block_int.h |    1 +
> > > > > >>>>  2 files changed, 119 insertions(+), 20 deletions(-)
> > > > > >>>>
> > > > > >>>>diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > > > >>>>index be3176f..a43b947 100644
> > > > > >>>>--- a/block/sheepdog.c
> > > > > >>>>+++ b/block/sheepdog.c
> > > > > >>>>@@ -37,6 +37,7 @@
> > > > > >>>>  #define SD_OP_READ_VDIS      0x15
> > > > > >>>>  #define SD_OP_FLUSH_VDI      0x16
> > > > > >>>>  #define SD_OP_DEL_VDI        0x17
> > > > > >>>>+#define SD_OP_GET_CLUSTER_DEFAULT   0x18
> > > > > >>>>
> > > > > >>>>  #define SD_FLAG_CMD_WRITE    0x01
> > > > > >>>>  #define SD_FLAG_CMD_COW      0x02
> > > > > >>>>@@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
> > > > > >>>>      uint32_t base_vdi_id;
> > > > > >>>>      uint8_t copies;
> > > > > >>>>      uint8_t copy_policy;
> > > > > >>>>-    uint8_t reserved[2];
> > > > > >>>>+    uint8_t store_policy;
> > > > > >>>>+    uint8_t block_size_shift;
> > > > > >>>>      uint32_t snapid;
> > > > > >>>>      uint32_t type;
> > > > > >>>>      uint32_t pad[2];
> > > > > >>>>@@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
> > > > > >>>>      uint32_t pad[5];
> > > > > >>>>  } SheepdogVdiRsp;
> > > > > >>>>
> > > > > >>>>+typedef struct SheepdogClusterRsp {
> > > > > >>>>+    uint8_t proto_ver;
> > > > > >>>>+    uint8_t opcode;
> > > > > >>>>+    uint16_t flags;
> > > > > >>>>+    uint32_t epoch;
> > > > > >>>>+    uint32_t id;
> > > > > >>>>+    uint32_t data_length;
> > > > > >>>>+    uint32_t result;
> > > > > >>>>+    uint8_t nr_copies;
> > > > > >>>>+    uint8_t copy_policy;
> > > > > >>>>+    uint8_t block_size_shift;
> > > > > >>>>+    uint8_t __pad1;
> > > > > >>>>+    uint32_t __pad2[6];
> > > > > >>>>+} SheepdogClusterRsp;
> > > > > >>>>+
> > > > > >>>>  typedef struct SheepdogInode {
> > > > > >>>>      char name[SD_MAX_VDI_LEN];
> > > > > >>>>      char tag[SD_MAX_VDI_TAG_LEN];
> > > > > >>>>@@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> > > > > >>>>      hdr.vdi_size = s->inode.vdi_size;
> > > > > >>>>      hdr.copy_policy = s->inode.copy_policy;
> > > > > >>>>      hdr.copies = s->inode.nr_copies;
> > > > > >>>>+    hdr.block_size_shift = s->inode.block_size_shift;
> > > > > >>>>
> > > > > >>>>      ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen);
> > > > > >>>>
> > > > > >>>>@@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> > > > > >>>>  static int sd_prealloc(const char *filename, Error **errp)
> > > > > >>>>  {
> > > > > >>>>      BlockDriverState *bs = NULL;
> > > > > >>>>+    BDRVSheepdogState *base = NULL;
> > > > > >>>>+    unsigned long buf_size;
> > > > > >>>>      uint32_t idx, max_idx;
> > > > > >>>>+    uint32_t object_size;
> > > > > >>>>      int64_t vdi_size;
> > > > > >>>>-    void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
> > > > > >>>>+    void *buf = NULL;
> > > > > >>>>      int ret;
> > > > > >>>>
> > > > > >>>>      ret = bdrv_open(&bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
> > > > > >>>>@@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp)
> > > > > >>>>          ret = vdi_size;
> > > > > >>>>          goto out;
> > > > > >>>>      }
> > > > > >>>>-    max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
> > > > > >>>>+
> > > > > >>>>+    base = bs->opaque;
> > > > > >>>>+    object_size = (UINT32_C(1) << base->inode.block_size_shift);
> > > > > >>>>+    buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
> > > > > >>>>+    buf = g_malloc0(buf_size);
> > > > > >>>>+
> > > > > >>>>+    max_idx = DIV_ROUND_UP(vdi_size, buf_size);
> > > > > >>>>
> > > > > >>>>      for (idx = 0; idx < max_idx; idx++) {
> > > > > >>>>          /*
> > > > > >>>>           * The created image can be a cloned image, so we need to read
> > > > > >>>>           * a data from the source image.
> > > > > >>>>           */
> > > > > >>>>-        ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> > > > > >>>>+        ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
> > > > > >>>>          if (ret < 0) {
> > > > > >>>>              goto out;
> > > > > >>>>          }
> > > > > >>>>-        ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> > > > > >>>>+        ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
> > > > > >>>>          if (ret < 0) {
> > > > > >>>>              goto out;
> > > > > >>>>          }
> > > > > >>>>@@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt)
> > > > > >>>>      return 0;
> > > > > >>>>  }
> > > > > >>>>
> > > > > >>>>+static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt)
> > > > > >>>>+{
> > > > > >>>>+    struct SheepdogInode *inode = &s->inode;
> > > > > >>>>+    inode->block_size_shift =
> > > > > >>>>+        (uint8_t)qemu_opt_get_number_del(opt, "block_size_shift", 0);
> > > > > >>>>+    if (inode->block_size_shift == 0) {
> > > > > >>>>+        /* block_size_shift is set for cluster default value by sheepdog */
> > > > > >>>>+        return 0;
> > > > > >>>>+    } else if (inode->block_size_shift < 20 || inode->block_size_shift > 31) {
> > > > > >>>>+        return -EINVAL;
> > > > > >>>>+    }
> > > > > >>>>+
> > > > > >>>>+    return 0;
> > > > > >>>>+}
> > > > > >>>>+
> > > > > >>>>  static int sd_create(const char *filename, QemuOpts *opts,
> > > > > >>>>                       Error **errp)
> > > > > >>>>  {
> > > > > >>>>@@ -1679,6 +1721,7 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > > >>>>      BDRVSheepdogState *s;
> > > > > >>>>      char tag[SD_MAX_VDI_TAG_LEN];
> > > > > >>>>      uint32_t snapid;
> > > > > >>>>+    uint64_t max_vdi_size;
> > > > > >>>>      bool prealloc = false;
> > > > > >>>>
> > > > > >>>>      s = g_new0(BDRVSheepdogState, 1);
> > > > > >>>>@@ -1718,9 +1761,10 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > > >>>>          }
> > > > > >>>>      }
> > > > > >>>>
> > > > > >>>>-    if (s->inode.vdi_size > SD_MAX_VDI_SIZE) {
> > > > > >>>>-        error_setg(errp, "too big image size");
> > > > > >>>>-        ret = -EINVAL;
> > > > > >>>>+    ret = parse_block_size_shift(s, opts);
> > > > > >>>>+    if (ret < 0) {
> > > > > >>>>+        error_setg(errp, "Invalid block_size_shift:"
> > > > > >>>>+                         " '%s'. please use 20-31", buf);
> > > > > >>>>          goto out;
> > > > > >>>>      }
> > > > > >>>>
> > > > > >>>>@@ -1757,6 +1801,47 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > > >>>>      }
> > > > > >>>>
> > > > > >>>>      s->aio_context = qemu_get_aio_context();
> > > > > >>>>+
> > > > > >>>>+    /* if block_size_shift is not specified, get cluster default value */
> > > > > >>>>+    if (s->inode.block_size_shift == 0) {
> > > > > >>>
> > > > > >>>Seems that we don't keep backward compatibility for new QEMU and old sheep that
> > > > > >>>doesn't support SD_OP_GET_CLUSTER_DEFAULT.
> > > > > >>Then, I'll add the operation that if block_size_shift from
> > > > > >>SD_OP_GET_CLUSTER_DEFAULT is zero, block_size_shift is set to 22.
> > > > > >>Is it OK?
> > > > > >
> > > > > >If sheep doesn't support SD_OP_GET_CLUSTER_DEFAULT, sheep will return
> > > > > >SD_RES_INVALID_PARMS. So to keep backward compatibility, we shouldn't issue
> > > > > >SD_OP_GET_CLUSTER_DEFAULT to sheep daemon.
> > > > > OK, I think that the way is better.
> > > > > >
> > > > > >My point is that, after user upgrading the sheep and still stick with old QEMU,
> > > > > >(I guess many users will), any operations in the past sholudn't fail right after
> > > > > >upgrading.
> > > > > >
> > > > > >>
> > > > > >>>
> > > > > >>>What will happen if old QEMU with new sheep that support block_size_shift? Most
> > > > > >>>distributions will ship the old stable qemu that wouldn't be aware of
> > > > > >>>block_size_shift.
> > > > > >>If old QEMU with new sheep is used, VDI is created with cluster default
> > > > > >>block_size_shift and accessed as 4MB object_size.
> > > > > >>So I think that for backward compatibility, users must do cluster
> > > > > >>format command with default block_size_shift 22 equal to
> > > > > >>4MB object_size.
> > > > > >
> > > > > >how old QEMU know about block_size_shift? For old QEMU, block_size_shift is
> > > > > >encoded as 0 and then send the create request to sheep. Does sheep can handle
> > > > > >block_size_shift = 0? You know, we can't pass any value to old QEMU for
> > > > > >block_size_shift.
> > > > > Sheep can handle block_size_shift = 0, when users create new VDI.
> > > > > Old QEMU does do_sd_create() without setting hdr.block_size_shift and
> > > > > sends a request SD_OP_NEW_VDI.
> > > > > If block_size_shift is set to zero, new sheep sets cluster default value
> > > > > in cluster_new_vdi() like copies and copy_policy.
> > > > 
> > > > Okay, so new sheep can handle old QEMU request. By the way, how about the
> > > > suggestion in my last email? (x + 1) * 4MB stuff...
> > > 
> > > I think specifying object size in multiple of 4MB is overkill. Because
> > > we can specify both of block_size_shift and a number of objects which
> > > belong to VDI. More precisely,
> > > 2 ^ block_size_shift * #objects = VDI size
> > > we can choose the block_size_shift between 20 and 30, and #objects
> > > from 1 < 2 ^ 20.
> > > # #objects is specified via VDI size implicitly
> > 
> > I can't understand this paragraph. If object size is fixed, we can calculate
> > # of object easily. It has nothing to do with block_size_shift.
> 
> I wanted to say that we can choose both of block_size_shift and # of
> objects from wide range, so granualarity is flexible and steps between
> VDI sizes are reasonably small.
> 
> > 
> > > The granualarity of VDI sizes seems to be really fine (even
> > > block_size_shift == 30, step is 1GB). So supporting block size with
> > > multiple of 4MB will not provide practical benefit.
> > 
> > Prabably yes, but finer granularity doesn't cause trouble and gives more
> > flexibility. We can allow 12MB at user's will, for example. Even it doesn't
> > provide practical benefits, what benefit block_size_shift will provide over
> > (x + 1) * 4MB scheme? My point is, give a wider range won't hurt and will
> > pose less constaint in the future.
> > 
> > The main reason I suggest (x + 1) * 4MB, is that we can easily keep backward
> > compatibility because x = 0 means 4MB, but with this patch proposal, x = 22
> > means 4MB.
> 
> Utilizing block_size_shift of inode doesn't break
> compatibility. Because older versions of sheepdog initialize with
> inode->block_size_shift with 22.

Older version? What about the old sheep that doesn't support block_size_shift?
If I remember well, block_size_shift is a new thing introduced by
fd9e4a28fad7c16b2237f4f48c9d264af192e40e, at Dec 12, 2014, very new. This means
all our stable sheep won't have any idea of what block_size_shift is.

The main trouble is old QEMU, which will always set inode->block_size_shift as 0
and expect the object size is 4MB.

Think about following case:

1 Old qemu and old sheep runs a long time with many vdis that has 4MB object.
2 Users upgrade sheep into new sheep but doesn't upgrade QEMU. This is quit
  normal case because many users use stable QEMU branch.
3 He still use qemu-img to create new vdi and expect the object size is 4MB
  If new sheep doesn't set object size as 4MB, this qemu block driver will
  malfunction.

Thanks
Yuan



More information about the sheepdog mailing list