[Sheepdog] [PATCH v2] sheep: update inode cache first

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Fri May 11 16:59:49 CEST 2012


At Fri, 11 May 2012 20:37:16 +0800,
Liu Yuan wrote:
> 
> On 05/11/2012 02:43 AM, MORITA Kazutaka wrote:
> 
> > Note that this doesn't still pass the testcase I posted before:
> > 
> > ==
> >  $ qemu-img convert ~/linux.img sheepdog:linux
> >  $ collie vdi read linux 0 512 | hd | head -1
> >  00000000  fa eb 7c 6c 62 61 4c 49  4c 4f 01 00 15 04 09 00  |..|lbaLILO......|
> >  $ qemu-img snapshot -c snap sheepdog:linux
> >  $ collie vdi read linux 0 512 | hd | head -1
> >  00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > ==
> > 
> > I guess it is because read_object() doesn't read the cached data.
> > 
> > This is a wrong behavior as a block storage because we need to read
> > the latest data even if it is not flushed, so I hope it would be fixed
> > too.
> 
> 
> No, read_object() will try to read cache first. In above example, the
> inode object in the cache is NOT correct due to unsafe mode converting.
> So behavior of object cache looks right even to this case.

The below patch fixes the problem on my environment.  I think it is
because read_object() reads the cached data.  Am I wrong something?

==
diff --git a/sheep/store.c b/sheep/store.c
index 8ca71dc..745dbfd 100644
--- a/sheep/store.c
+++ b/sheep/store.c
@@ -1034,6 +1034,42 @@ static int read_object_local(uint64_t oid, char *data, unsigned int datalen,
 	free(req);
 	return ret;
 }
+
+static int read_object_cache(uint64_t oid, char *data, unsigned int datalen,
+			     uint64_t offset, int copies, uint32_t epoch)
+{
+	int ret;
+	struct request *req;
+	struct sd_obj_req *hdr;
+	uint32_t vid = oid_to_vid(oid);
+	uint32_t idx = data_oid_to_idx(oid);
+	struct object_cache *cache;
+
+	if (is_vdi_obj(oid))
+		idx |= 1 << CACHE_VDI_SHIFT;
+
+	cache = find_object_cache(vid, 0);
+
+	req = zalloc(sizeof(*req));
+	if (!req)
+		return SD_RES_NO_MEM;
+	hdr = (struct sd_obj_req *)&req->rq;
+
+	hdr->oid = oid;
+	hdr->opcode = SD_OP_READ_OBJ;
+	hdr->copies = copies;
+	hdr->offset = offset;
+	hdr->data_length = datalen;
+	req->data = data;
+	req->op = get_sd_op(hdr->opcode);
+
+	ret = object_cache_rw(cache, idx, req);
+
+	free(req);
+
+	return ret;
+}
+
 int read_object(struct vnode_info *vnodes, uint32_t node_version,
 		uint64_t oid, char *data, unsigned int datalen,
 		uint64_t offset, int nr_copies)
@@ -1044,6 +1080,15 @@ int read_object(struct vnode_info *vnodes, uint32_t node_version,
 	char name[128];
 	int i = 0, fd, ret, last_error = SD_RES_SUCCESS;
 
+	if (object_is_cached(oid)) {
+		ret = read_object_cache(oid, data, datalen, offset,
+					nr_copies, node_version);
+		if (ret != SD_RES_SUCCESS)
+			eprintf("fail %"PRIx64" %"PRIx32"\n", oid, ret);
+
+		return ret;
+	}
+
 	/* search a local object first */
 	for (i = 0; i < nr_copies; i++) {
 		v = oid_to_vnode(vnodes, oid, i);

==


> 
> I spent some time on this problem, the real problem is 'qemu-img
> convert' use UNSAFE mode as default, which doesn't flush at all (for
> writeback mode, qemu-img will issue a flush at the close(), but for
> unsafe, no flush at all).

Any read requests should read the latest data even if it is not
flushed because Sheepdog is a block storage system.  If we cannot
guarantee it, we should be careful so that users don't use Sheepdog
wrongly.

> 
> I have tried 'qemu-img convert -t writeback' (NOT writethrough), it
> works well with qemu-img snapshot.
> 
> So what is the point to use unsafe mode to 'qemu-img convert' for Sheepdog?
> 
> I'd suggest to patch our README to use 'qemu-img -t writethrough
> convert'. How about it?

I believe we can solve the above problem without it.

Note that there still exists a problem; if the cached data is on
another node, it is really difficult to read the latest data.  You may
think this is a corner case, but should pay attention as long as we
call Sheepdog a block storage.  My suggestion is to document the
restriction (e.g. we shouldn't access the same image from the
different nodes even if it is not the same time) instead of solving
this difficult problem.

Thanks,

Kazutaka



More information about the sheepdog mailing list