[sheepdog] [PATCH v4] sheepdog: selectable object size support
Liu Yuan
namei.unix at gmail.com
Thu Feb 12 09:11:10 CET 2015
On Thu, Feb 12, 2015 at 05:01:05PM +0900, Teruaki Ishizaki wrote:
> (2015/02/12 16:42), Liu Yuan wrote:
> >On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote:
> >>At Thu, 12 Feb 2015 15:00:49 +0800,
> >>Liu Yuan wrote:
> >>>
> >>>On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
> >>>>At Tue, 10 Feb 2015 18:35:58 +0800,
> >>>>Liu Yuan wrote:
> >>>>>
> >>>>>On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
> >>>>>>(2015/02/10 17:58), Liu Yuan wrote:
> >>>>>>>On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
> >>>>>>>>(2015/02/10 12:10), Liu Yuan wrote:
> >>>>>>>>>On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
> >>>>>>>>>>Previously, qemu block driver of sheepdog used hard-coded VDI object size.
> >>>>>>>>>>This patch enables users to handle "block_size_shift" value for
> >>>>>>>>>>calculating VDI object size.
> >>>>>>>>>>
> >>>>>>>>>>When you start qemu, you don't need to specify additional command option.
> >>>>>>>>>>
> >>>>>>>>>>But when you create the VDI which doesn't have default object size
> >>>>>>>>>>with qemu-img command, you specify block_size_shift option.
> >>>>>>>>>>
> >>>>>>>>>>If you want to create a VDI of 8MB(1 << 23) object size,
> >>>>>>>>>>you need to specify following command option.
> >>>>>>>>>>
> >>>>>>>>>> # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
> >>>>>>>>>>
> >>>>>>>>>>In addition, when you don't specify qemu-img command option,
> >>>>>>>>>>a default value of sheepdog cluster is used for creating VDI.
> >>>>>>>>>>
> >>>>>>>>>> # qemu-img create sheepdog:test2 100M
> >>>>>>>>>>
> >>>>>>>>>>Signed-off-by: Teruaki Ishizaki <ishizaki.teruaki at lab.ntt.co.jp>
> >>>>>>>>>>---
> >>>>>>>>>>V4:
> >>>>>>>>>> - Limit a read/write buffer size for creating a preallocated VDI.
> >>>>>>>>>> - Replace a parse function for the block_size_shift option.
> >>>>>>>>>> - Fix an error message.
> >>>>>>>>>>
> >>>>>>>>>>V3:
> >>>>>>>>>> - Delete the needless operation of buffer.
> >>>>>>>>>> - Delete the needless operations of request header.
> >>>>>>>>>> for SD_OP_GET_CLUSTER_DEFAULT.
> >>>>>>>>>> - Fix coding style problems.
> >>>>>>>>>>
> >>>>>>>>>>V2:
> >>>>>>>>>> - Fix coding style problem (white space).
> >>>>>>>>>> - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
> >>>>>>>>>> - Initialize request header to use block_size_shift specified by user.
> >>>>>>>>>>---
> >>>>>>>>>> block/sheepdog.c | 138 ++++++++++++++++++++++++++++++++++++++-------
> >>>>>>>>>> include/block/block_int.h | 1 +
> >>>>>>>>>> 2 files changed, 119 insertions(+), 20 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>>diff --git a/block/sheepdog.c b/block/sheepdog.c
> >>>>>>>>>>index be3176f..a43b947 100644
> >>>>>>>>>>--- a/block/sheepdog.c
> >>>>>>>>>>+++ b/block/sheepdog.c
> >>>>>>>>>>@@ -37,6 +37,7 @@
> >>>>>>>>>> #define SD_OP_READ_VDIS 0x15
> >>>>>>>>>> #define SD_OP_FLUSH_VDI 0x16
> >>>>>>>>>> #define SD_OP_DEL_VDI 0x17
> >>>>>>>>>>+#define SD_OP_GET_CLUSTER_DEFAULT 0x18
> >>>>>>>>>>
> >>>>>>>>>> #define SD_FLAG_CMD_WRITE 0x01
> >>>>>>>>>> #define SD_FLAG_CMD_COW 0x02
> >>>>>>>>>>@@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
> >>>>>>>>>> uint32_t base_vdi_id;
> >>>>>>>>>> uint8_t copies;
> >>>>>>>>>> uint8_t copy_policy;
> >>>>>>>>>>- uint8_t reserved[2];
> >>>>>>>>>>+ uint8_t store_policy;
> >>>>>>>>>>+ uint8_t block_size_shift;
> >>>>>>>>>> uint32_t snapid;
> >>>>>>>>>> uint32_t type;
> >>>>>>>>>> uint32_t pad[2];
> >>>>>>>>>>@@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
> >>>>>>>>>> uint32_t pad[5];
> >>>>>>>>>> } SheepdogVdiRsp;
> >>>>>>>>>>
> >>>>>>>>>>+typedef struct SheepdogClusterRsp {
> >>>>>>>>>>+ uint8_t proto_ver;
> >>>>>>>>>>+ uint8_t opcode;
> >>>>>>>>>>+ uint16_t flags;
> >>>>>>>>>>+ uint32_t epoch;
> >>>>>>>>>>+ uint32_t id;
> >>>>>>>>>>+ uint32_t data_length;
> >>>>>>>>>>+ uint32_t result;
> >>>>>>>>>>+ uint8_t nr_copies;
> >>>>>>>>>>+ uint8_t copy_policy;
> >>>>>>>>>>+ uint8_t block_size_shift;
> >>>>>>>>>>+ uint8_t __pad1;
> >>>>>>>>>>+ uint32_t __pad2[6];
> >>>>>>>>>>+} SheepdogClusterRsp;
> >>>>>>>>>>+
> >>>>>>>>>> typedef struct SheepdogInode {
> >>>>>>>>>> char name[SD_MAX_VDI_LEN];
> >>>>>>>>>> char tag[SD_MAX_VDI_TAG_LEN];
> >>>>>>>>>>@@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> >>>>>>>>>> hdr.vdi_size = s->inode.vdi_size;
> >>>>>>>>>> hdr.copy_policy = s->inode.copy_policy;
> >>>>>>>>>> hdr.copies = s->inode.nr_copies;
> >>>>>>>>>>+ hdr.block_size_shift = s->inode.block_size_shift;
> >>>>>>>>>>
> >>>>>>>>>> ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen);
> >>>>>>>>>>
> >>>>>>>>>>@@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
> >>>>>>>>>> static int sd_prealloc(const char *filename, Error **errp)
> >>>>>>>>>> {
> >>>>>>>>>> BlockDriverState *bs = NULL;
> >>>>>>>>>>+ BDRVSheepdogState *base = NULL;
> >>>>>>>>>>+ unsigned long buf_size;
> >>>>>>>>>> uint32_t idx, max_idx;
> >>>>>>>>>>+ uint32_t object_size;
> >>>>>>>>>> int64_t vdi_size;
> >>>>>>>>>>- void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
> >>>>>>>>>>+ void *buf = NULL;
> >>>>>>>>>> int ret;
> >>>>>>>>>>
> >>>>>>>>>> ret = bdrv_open(&bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
> >>>>>>>>>>@@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp)
> >>>>>>>>>> ret = vdi_size;
> >>>>>>>>>> goto out;
> >>>>>>>>>> }
> >>>>>>>>>>- max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
> >>>>>>>>>>+
> >>>>>>>>>>+ base = bs->opaque;
> >>>>>>>>>>+ object_size = (UINT32_C(1) << base->inode.block_size_shift);
> >>>>>>>>>>+ buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
> >>>>>>>>>>+ buf = g_malloc0(buf_size);
> >>>>>>>>>>+
> >>>>>>>>>>+ max_idx = DIV_ROUND_UP(vdi_size, buf_size);
> >>>>>>>>>>
> >>>>>>>>>> for (idx = 0; idx < max_idx; idx++) {
> >>>>>>>>>> /*
> >>>>>>>>>> * The created image can be a cloned image, so we need to read
> >>>>>>>>>> * a data from the source image.
> >>>>>>>>>> */
> >>>>>>>>>>- ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> >>>>>>>>>>+ ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
> >>>>>>>>>> if (ret < 0) {
> >>>>>>>>>> goto out;
> >>>>>>>>>> }
> >>>>>>>>>>- ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
> >>>>>>>>>>+ ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
> >>>>>>>>>> if (ret < 0) {
> >>>>>>>>>> goto out;
> >>>>>>>>>> }
> >>>>>>>>>>@@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt)
> >>>>>>>>>> return 0;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>+static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt)
> >>>>>>>>>>+{
> >>>>>>>>>>+ struct SheepdogInode *inode = &s->inode;
> >>>>>>>>>>+ inode->block_size_shift =
> >>>>>>>>>>+ (uint8_t)qemu_opt_get_number_del(opt, "block_size_shift", 0);
> >>>>>>>>>>+ if (inode->block_size_shift == 0) {
> >>>>>>>>>>+ /* block_size_shift is set for cluster default value by sheepdog */
> >>>>>>>>>>+ return 0;
> >>>>>>>>>>+ } else if (inode->block_size_shift < 20 || inode->block_size_shift > 31) {
> >>>>>>>>>>+ return -EINVAL;
> >>>>>>>>>>+ }
> >>>>>>>>>>+
> >>>>>>>>>>+ return 0;
> >>>>>>>>>>+}
> >>>>>>>>>>+
> >>>>>>>>>> static int sd_create(const char *filename, QemuOpts *opts,
> >>>>>>>>>> Error **errp)
> >>>>>>>>>> {
> >>>>>>>>>>@@ -1679,6 +1721,7 @@ static int sd_create(const char *filename, QemuOpts *opts,
> >>>>>>>>>> BDRVSheepdogState *s;
> >>>>>>>>>> char tag[SD_MAX_VDI_TAG_LEN];
> >>>>>>>>>> uint32_t snapid;
> >>>>>>>>>>+ uint64_t max_vdi_size;
> >>>>>>>>>> bool prealloc = false;
> >>>>>>>>>>
> >>>>>>>>>> s = g_new0(BDRVSheepdogState, 1);
> >>>>>>>>>>@@ -1718,9 +1761,10 @@ static int sd_create(const char *filename, QemuOpts *opts,
> >>>>>>>>>> }
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>- if (s->inode.vdi_size > SD_MAX_VDI_SIZE) {
> >>>>>>>>>>- error_setg(errp, "too big image size");
> >>>>>>>>>>- ret = -EINVAL;
> >>>>>>>>>>+ ret = parse_block_size_shift(s, opts);
> >>>>>>>>>>+ if (ret < 0) {
> >>>>>>>>>>+ error_setg(errp, "Invalid block_size_shift:"
> >>>>>>>>>>+ " '%s'. please use 20-31", buf);
> >>>>>>>>>> goto out;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>@@ -1757,6 +1801,47 @@ static int sd_create(const char *filename, QemuOpts *opts,
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> s->aio_context = qemu_get_aio_context();
> >>>>>>>>>>+
> >>>>>>>>>>+ /* if block_size_shift is not specified, get cluster default value */
> >>>>>>>>>>+ if (s->inode.block_size_shift == 0) {
> >>>>>>>>>
> >>>>>>>>>Seems that we don't keep backward compatibility for new QEMU and old sheep that
> >>>>>>>>>doesn't support SD_OP_GET_CLUSTER_DEFAULT.
> >>>>>>>>Then, I'll add the operation that if block_size_shift from
> >>>>>>>>SD_OP_GET_CLUSTER_DEFAULT is zero, block_size_shift is set to 22.
> >>>>>>>>Is it OK?
> >>>>>>>
> >>>>>>>If sheep doesn't support SD_OP_GET_CLUSTER_DEFAULT, sheep will return
> >>>>>>>SD_RES_INVALID_PARMS. So to keep backward compatibility, we shouldn't issue
> >>>>>>>SD_OP_GET_CLUSTER_DEFAULT to sheep daemon.
> >>>>>>OK, I think that the way is better.
> >>>>>>>
> >>>>>>>My point is that, after user upgrading the sheep and still stick with old QEMU,
> >>>>>>>(I guess many users will), any operations in the past sholudn't fail right after
> >>>>>>>upgrading.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>What will happen if old QEMU with new sheep that support block_size_shift? Most
> >>>>>>>>>distributions will ship the old stable qemu that wouldn't be aware of
> >>>>>>>>>block_size_shift.
> >>>>>>>>If old QEMU with new sheep is used, VDI is created with cluster default
> >>>>>>>>block_size_shift and accessed as 4MB object_size.
> >>>>>>>>So I think that for backward compatibility, users must do cluster
> >>>>>>>>format command with default block_size_shift 22 equal to
> >>>>>>>>4MB object_size.
> >>>>>>>
> >>>>>>>how old QEMU know about block_size_shift? For old QEMU, block_size_shift is
> >>>>>>>encoded as 0 and then send the create request to sheep. Does sheep can handle
> >>>>>>>block_size_shift = 0? You know, we can't pass any value to old QEMU for
> >>>>>>>block_size_shift.
> >>>>>>Sheep can handle block_size_shift = 0, when users create new VDI.
> >>>>>>Old QEMU does do_sd_create() without setting hdr.block_size_shift and
> >>>>>>sends a request SD_OP_NEW_VDI.
> >>>>>>If block_size_shift is set to zero, new sheep sets cluster default value
> >>>>>>in cluster_new_vdi() like copies and copy_policy.
> >>>>>
> >>>>>Okay, so new sheep can handle old QEMU request. By the way, how about the
> >>>>>suggestion in my last email? (x + 1) * 4MB stuff...
> >>>>
> >>>>I think specifying object size in multiple of 4MB is overkill. Because
> >>>>we can specify both of block_size_shift and a number of objects which
> >>>>belong to VDI. More precisely,
> >>>>2 ^ block_size_shift * #objects = VDI size
> >>>>we can choose the block_size_shift between 20 and 30, and #objects
> >>>>from 1 < 2 ^ 20.
> >>>># #objects is specified via VDI size implicitly
> >>>
> >>>I can't understand this paragraph. If object size is fixed, we can calculate
> >>># of object easily. It has nothing to do with block_size_shift.
> >>
> >>I wanted to say that we can choose both of block_size_shift and # of
> >>objects from wide range, so granualarity is flexible and steps between
> >>VDI sizes are reasonably small.
> >>
> >>>
> >>>>The granualarity of VDI sizes seems to be really fine (even
> >>>>block_size_shift == 30, step is 1GB). So supporting block size with
> >>>>multiple of 4MB will not provide practical benefit.
> >>>
> >>>Prabably yes, but finer granularity doesn't cause trouble and gives more
> >>>flexibility. We can allow 12MB at user's will, for example. Even it doesn't
> >>>provide practical benefits, what benefit block_size_shift will provide over
> >>>(x + 1) * 4MB scheme? My point is, give a wider range won't hurt and will
> >>>pose less constaint in the future.
> >>>
> >>>The main reason I suggest (x + 1) * 4MB, is that we can easily keep backward
> >>>compatibility because x = 0 means 4MB, but with this patch proposal, x = 22
> >>>means 4MB.
> >>
> >>Utilizing block_size_shift of inode doesn't break
> >>compatibility. Because older versions of sheepdog initialize with
> >>inode->block_size_shift with 22.
> >
> >Older version? What about the old sheep that doesn't support block_size_shift?
> >If I remember well, block_size_shift is a new thing introduced by
> >fd9e4a28fad7c16b2237f4f48c9d264af192e40e, at Dec 12, 2014, very new. This means
> >all our stable sheep won't have any idea of what block_size_shift is.
>
> Old sheepdog initialize with inode->block_size_shift with 22.
> The implementation of Sheepdog 0.9 is following
>
> =======
> static struct sd_inode *alloc_inode(const struct vdi_iocb *iocb,
> uint32_t new_snapid, uint32_t new_vid,
> uint32_t *data_vdi_id,
> struct generation_reference *gref)
> {
> struct sd_inode *new = xzalloc(sizeof(*new));
> unsigned long block_size = SD_DATA_OBJ_SIZE;
>
> ...(snip)
> new->block_size_shift = find_next_bit(&block_size,
> BITS_PER_LONG, 0);
> =======
>
> It means that block_size_shift is initialized with 22.
Thanks for your code snappet, it works as you said.
> >
> >The main trouble is old QEMU, which will always set inode->block_size_shift as 0
> >and expect the object size is 4MB.
> >
> >Think about following case:
> >
> >1 Old qemu and old sheep runs a long time with many vdis that has 4MB object.
> >2 Users upgrade sheep into new sheep but doesn't upgrade QEMU. This is quit
> > normal case because many users use stable QEMU branch.
> >3 He still use qemu-img to create new vdi and expect the object size is 4MB
> > If new sheep doesn't set object size as 4MB, this qemu block driver will
> > malfunction.
>
> As I said,
>
> Qe/sd new old
> new CDS Ign
> old CDS NULL
>
> So, when we use old Qemu with new Sheepdog, we should do cluster format
> with default block_size_shift 22.
Okay, it looks our sheep can handle it. It is fine to me to use block_size_shift
though it pose some constraint to the size range. But considering sheepdog
already have the block_size_shift in the inode, it is okay to make use of it for
QEMU.
But for user option, I stick with the 'object_size' and translate it into
block_size_shift internally in the driver.
Thanks
Yuan
More information about the sheepdog
mailing list