[sheepdog] [PATCH v4] sheepdog: selectable object size support

Hitoshi Mitake mitake.hitoshi at lab.ntt.co.jp
Thu Feb 12 08:28:01 CET 2015


At Thu, 12 Feb 2015 15:00:49 +0800,
Liu Yuan wrote:
> 
> On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
> > At Tue, 10 Feb 2015 18:35:58 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
> > > > (2015/02/10 17:58), Liu Yuan wrote:
> > > > >On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
> > > > >>(2015/02/10 12:10), Liu Yuan wrote:
> > > > >>>On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
> > > > >>>>Previously, qemu block driver of sheepdog used hard-coded VDI object size.
> > > > >>>>This patch enables users to handle "block_size_shift" value for
> > > > >>>>calculating VDI object size.
> > > > >>>>
> > > > >>>>When you start qemu, you don't need to specify additional command option.
> > > > >>>>
> > > > >>>>But when you create the VDI which doesn't have default object size
> > > > >>>>with qemu-img command, you specify block_size_shift option.
> > > > >>>>
> > > > >>>>If you want to create a VDI of 8MB(1 << 23) object size,
> > > > >>>>you need to specify following command option.
> > > > >>>>
> > > > >>>>  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
> > > > >>>>
> > > > >>>>In addition, when you don't specify qemu-img command option,
> > > > >>>>a default value of sheepdog cluster is used for creating VDI.
> > > > >>>>
> > > > >>>>  # qemu-img create sheepdog:test2 100M
> > > > >>>>
> > > > >>>>Signed-off-by: Teruaki Ishizaki <ishizaki.teruaki at lab.ntt.co.jp>
> > > > >>>>---
> > > > >>>>V4:
> > > > >>>>  - Limit a read/write buffer size for creating a preallocated VDI.
> > > > >>>>  - Replace a parse function for the block_size_shift option.
> > > > >>>>  - Fix an error message.
> > > > >>>>
> > > > >>>>V3:
> > > > >>>>  - Delete the needless operation of buffer.
> > > > >>>>  - Delete the needless operations of request header.
> > > > >>>>    for SD_OP_GET_CLUSTER_DEFAULT.
> > > > >>>>  - Fix coding style problems.
> > > > >>>>
> > > > >>>>V2:
> > > > >>>>  - Fix coding style problem (white space).
> > > > >>>>  - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
> > > > >>>>  - Initialize request header to use block_size_shift specified by user.
> > > > >>>>---
> > > > >>>>  block/sheepdog.c          |  138 ++++++++++++++++++++++++++++++++++++++-------
> > > > >>>>  include/block/block_int.h |    1 +
> > > > >>>>  2 files changed, 119 insertions(+), 20 deletions(-)
> > > > >>>>
> > > > >>>>diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > > >>>>index be3176f..a43b947 100644
> > > > >>>>--- a/block/sheepdog.c
> > > > >>>>+++ b/block/sheepdog.c
> > > > >>>>@@ -37,6 +37,7 @@
> > > > >>>>  #define SD_OP_READ_VDIS      0x15
> > > > >>>>  #define SD_OP_FLUSH_VDI      0x16
> > > > >>>>  #define SD_OP_DEL_VDI        0x17
> > > > >>>>+#define SD_OP_GET_CLUSTER_DEFAULT   0x18
> > > > >>>>
> > > > >>>>  #define SD_FLAG_CMD_WRITE    0x01
> > > > >>>>  #define SD_FLAG_CMD_COW      0x02
> > > > >>>>@@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
> > > > >>>>      uint32_t base_vdi_id;
> > > > >>>>      uint8_t copies;
> > > > >>>>      uint8_t copy_policy;
> > > > >>>>-    uint8_t reserved[2];
> > > > >>>>+    uint8_t store_policy;
> > > > >>>>+    uint8_t block_size_shift;
> > > > >>>>      uint32_t snapid;
> > > > >>>>      uint32_t type;
> > > > >>>>      uint32_t pad[2];
> > > > >>>>@@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
> > > > >>>>      uint32_t pad[5];
> > > > >>>>  } SheepdogVdiRsp;
> > > > >>>>
> > > > >>>>+typedef struct SheepdogClusterRsp {
> > > > >>>>+    uint8_t proto_ver;
> > > > >>>>+    uint8_t opcode;
> > > > >>>>+    uint16_t flags;
> > > > >>>>+    uint32_t epoch;
> > > > >>>>+    uint32_t id;
> > > > >>>>+    uint32_t data_length;
> > > > >>>>+    uint32_t result;
> > > > >>>>+    uint8_t nr_copies;
> > > > >>>>+    uint8_t copy_policy;
> > > > >>>>+    uint8_t block_size_shift;
> > > > >>>>+    uint8_t __pad1;
> > > > >>>>+    uint32_t __pad2[6];
> > > > >>>>+} SheepdogClusterRsp;
> > > > >>>>+
> > > > >>>>  typedef struct SheepdogInode {
> > > > >>>>      char name[SD_MAX_VDI_LEN];
> > > > >>>>      char tag[SD_MAX_VDI_TAG_LEN];
> > > > >>>>@@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> > > > >>>>      hdr.vdi_size = s->inode.vdi_size;
> > > > >>>>      hdr.copy_policy = s->inode.copy_policy;
> > > > >>>>      hdr.copies = s->inode.nr_copies;
> > > > >>>>+    hdr.block_size_shift = s->inode.block_size_shift;
> > > > >>>>
> > > > >>>>      ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen);
> > > > >>>>
> > > > >>>>@@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> > > > >>>>  static int sd_prealloc(const char *filename, Error **errp)
> > > > >>>>  {
> > > > >>>>      BlockDriverState *bs = NULL;
> > > > >>>>+    BDRVSheepdogState *base = NULL;
> > > > >>>>+    unsigned long buf_size;
> > > > >>>>      uint32_t idx, max_idx;
> > > > >>>>+    uint32_t object_size;
> > > > >>>>      int64_t vdi_size;
> > > > >>>>-    void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
> > > > >>>>+    void *buf = NULL;
> > > > >>>>      int ret;
> > > > >>>>
> > > > >>>>      ret = bdrv_open(&bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
> > > > >>>>@@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp)
> > > > >>>>          ret = vdi_size;
> > > > >>>>          goto out;
> > > > >>>>      }
> > > > >>>>-    max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
> > > > >>>>+
> > > > >>>>+    base = bs->opaque;
> > > > >>>>+    object_size = (UINT32_C(1) << base->inode.block_size_shift);
> > > > >>>>+    buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
> > > > >>>>+    buf = g_malloc0(buf_size);
> > > > >>>>+
> > > > >>>>+    max_idx = DIV_ROUND_UP(vdi_size, buf_size);
> > > > >>>>
> > > > >>>>      for (idx = 0; idx < max_idx; idx++) {
> > > > >>>>          /*
> > > > >>>>           * The created image can be a cloned image, so we need to read
> > > > >>>>           * a data from the source image.
> > > > >>>>           */
> > > > >>>>-        ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> > > > >>>>+        ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
> > > > >>>>          if (ret < 0) {
> > > > >>>>              goto out;
> > > > >>>>          }
> > > > >>>>-        ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> > > > >>>>+        ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
> > > > >>>>          if (ret < 0) {
> > > > >>>>              goto out;
> > > > >>>>          }
> > > > >>>>@@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt)
> > > > >>>>      return 0;
> > > > >>>>  }
> > > > >>>>
> > > > >>>>+static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt)
> > > > >>>>+{
> > > > >>>>+    struct SheepdogInode *inode = &s->inode;
> > > > >>>>+    inode->block_size_shift =
> > > > >>>>+        (uint8_t)qemu_opt_get_number_del(opt, "block_size_shift", 0);
> > > > >>>>+    if (inode->block_size_shift == 0) {
> > > > >>>>+        /* block_size_shift is set for cluster default value by sheepdog */
> > > > >>>>+        return 0;
> > > > >>>>+    } else if (inode->block_size_shift < 20 || inode->block_size_shift > 31) {
> > > > >>>>+        return -EINVAL;
> > > > >>>>+    }
> > > > >>>>+
> > > > >>>>+    return 0;
> > > > >>>>+}
> > > > >>>>+
> > > > >>>>  static int sd_create(const char *filename, QemuOpts *opts,
> > > > >>>>                       Error **errp)
> > > > >>>>  {
> > > > >>>>@@ -1679,6 +1721,7 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > >>>>      BDRVSheepdogState *s;
> > > > >>>>      char tag[SD_MAX_VDI_TAG_LEN];
> > > > >>>>      uint32_t snapid;
> > > > >>>>+    uint64_t max_vdi_size;
> > > > >>>>      bool prealloc = false;
> > > > >>>>
> > > > >>>>      s = g_new0(BDRVSheepdogState, 1);
> > > > >>>>@@ -1718,9 +1761,10 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > >>>>          }
> > > > >>>>      }
> > > > >>>>
> > > > >>>>-    if (s->inode.vdi_size > SD_MAX_VDI_SIZE) {
> > > > >>>>-        error_setg(errp, "too big image size");
> > > > >>>>-        ret = -EINVAL;
> > > > >>>>+    ret = parse_block_size_shift(s, opts);
> > > > >>>>+    if (ret < 0) {
> > > > >>>>+        error_setg(errp, "Invalid block_size_shift:"
> > > > >>>>+                         " '%s'. please use 20-31", buf);
> > > > >>>>          goto out;
> > > > >>>>      }
> > > > >>>>
> > > > >>>>@@ -1757,6 +1801,47 @@ static int sd_create(const char *filename, QemuOpts *opts,
> > > > >>>>      }
> > > > >>>>
> > > > >>>>      s->aio_context = qemu_get_aio_context();
> > > > >>>>+
> > > > >>>>+    /* if block_size_shift is not specified, get cluster default value */
> > > > >>>>+    if (s->inode.block_size_shift == 0) {
> > > > >>>
> > > > >>>Seems that we don't keep backward compatibility for new QEMU and old sheep that
> > > > >>>doesn't support SD_OP_GET_CLUSTER_DEFAULT.
> > > > >>Then, I'll add the operation that if block_size_shift from
> > > > >>SD_OP_GET_CLUSTER_DEFAULT is zero, block_size_shift is set to 22.
> > > > >>Is it OK?
> > > > >
> > > > >If sheep doesn't support SD_OP_GET_CLUSTER_DEFAULT, sheep will return
> > > > >SD_RES_INVALID_PARMS. So to keep backward compatibility, we shouldn't issue
> > > > >SD_OP_GET_CLUSTER_DEFAULT to sheep daemon.
> > > > OK, I think that the way is better.
> > > > >
> > > > >My point is that, after user upgrading the sheep and still stick with old QEMU,
> > > > >(I guess many users will), any operations in the past sholudn't fail right after
> > > > >upgrading.
> > > > >
> > > > >>
> > > > >>>
> > > > >>>What will happen if old QEMU with new sheep that support block_size_shift? Most
> > > > >>>distributions will ship the old stable qemu that wouldn't be aware of
> > > > >>>block_size_shift.
> > > > >>If old QEMU with new sheep is used, VDI is created with cluster default
> > > > >>block_size_shift and accessed as 4MB object_size.
> > > > >>So I think that for backward compatibility, users must do cluster
> > > > >>format command with default block_size_shift 22 equal to
> > > > >>4MB object_size.
> > > > >
> > > > >how old QEMU know about block_size_shift? For old QEMU, block_size_shift is
> > > > >encoded as 0 and then send the create request to sheep. Does sheep can handle
> > > > >block_size_shift = 0? You know, we can't pass any value to old QEMU for
> > > > >block_size_shift.
> > > > Sheep can handle block_size_shift = 0, when users create new VDI.
> > > > Old QEMU does do_sd_create() without setting hdr.block_size_shift and
> > > > sends a request SD_OP_NEW_VDI.
> > > > If block_size_shift is set to zero, new sheep sets cluster default value
> > > > in cluster_new_vdi() like copies and copy_policy.
> > > 
> > > Okay, so new sheep can handle old QEMU request. By the way, how about the
> > > suggestion in my last email? (x + 1) * 4MB stuff...
> > 
> > I think specifying object size in multiple of 4MB is overkill. Because
> > we can specify both of block_size_shift and a number of objects which
> > belong to VDI. More precisely,
> > 2 ^ block_size_shift * #objects = VDI size
> > we can choose the block_size_shift between 20 and 30, and #objects
> > from 1 < 2 ^ 20.
> > # #objects is specified via VDI size implicitly
> 
> I can't understand this paragraph. If object size is fixed, we can calculate
> # of object easily. It has nothing to do with block_size_shift.

I wanted to say that we can choose both of block_size_shift and # of
objects from wide range, so granualarity is flexible and steps between
VDI sizes are reasonably small.

> 
> > The granualarity of VDI sizes seems to be really fine (even
> > block_size_shift == 30, step is 1GB). So supporting block size with
> > multiple of 4MB will not provide practical benefit.
> 
> Prabably yes, but finer granularity doesn't cause trouble and gives more
> flexibility. We can allow 12MB at user's will, for example. Even it doesn't
> provide practical benefits, what benefit block_size_shift will provide over
> (x + 1) * 4MB scheme? My point is, give a wider range won't hurt and will
> pose less constaint in the future.
> 
> The main reason I suggest (x + 1) * 4MB, is that we can easily keep backward
> compatibility because x = 0 means 4MB, but with this patch proposal, x = 22
> means 4MB.

Utilizing block_size_shift of inode doesn't break
compatibility. Because older versions of sheepdog initialize with
inode->block_size_shift with 22.

Thanks,
Hitoshi

> 
> > Expression of the block size is another topic. I don't have strong
> > opinion about this.
> 
> I'm fine with whatever way we choose to represent object size internally,
> instead I'm more concerned about the user option form to specify object size.
> Even if we choose block_size_shift, we should make the option user friendly as
> much as possible. Expose block_size_shift to users is not a good idea.
> 
> Thanks
> Yuan
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> https://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list