[Sheepdog] [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Thu Jun 3 18:23:38 CEST 2010


At Wed, 02 Jun 2010 15:55:42 +0200,
Kevin Wolf wrote:
> 
> Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
> > Sheepdog is a distributed storage system for QEMU. It provides highly
> > available block level storage volumes to VMs like Amazon EBS.  This
> > patch adds a qemu block driver for Sheepdog.
> > 
> > Sheepdog features are:
> > - No node in the cluster is special (no metadata node, no control
> >   node, etc)
> > - Linear scalability in performance and capacity
> > - No single point of failure
> > - Autonomous management (zero configuration)
> > - Useful volume management support such as snapshot and cloning
> > - Thin provisioning
> > - Autonomous load balancing
> > 
> > The more details are available at the project site:
> >     http://www.osrg.net/sheepdog/
> > 
> > Signed-off-by: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> > ---
> >  Makefile.objs    |    2 +-
> >  block/sheepdog.c | 1835 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 1836 insertions(+), 1 deletions(-)
> >  create mode 100644 block/sheepdog.c
> 
> One general thing: The code uses some mix of spaces and tabs for
> indentation, with the greatest part using tabs. According to
> CODING_STYLE it should consistently use four spaces instead.
> 

OK.  I'll fix the indentation according to CODYING_STYLE.


> > +
> > +typedef struct SheepdogInode {
> > +	char name[SD_MAX_VDI_LEN];
> > +	uint64_t ctime;
> > +	uint64_t snap_ctime;
> > +	uint64_t vm_clock_nsec;
> > +	uint64_t vdi_size;
> > +	uint64_t vm_state_size;
> > +	uint16_t copy_policy;
> > +	uint8_t  nr_copies;
> > +	uint8_t  block_size_shift;
> > +	uint32_t snap_id;
> > +	uint32_t vdi_id;
> > +	uint32_t parent_vdi_id;
> > +	uint32_t child_vdi_id[MAX_CHILDREN];
> > +	uint32_t data_vdi_id[MAX_DATA_OBJS];
> 
> Wow, this is a huge array. :-)
> 
> So Sheepdog has a fixed limit of 16 TB, right?
> 

MAX_DATA_OBJS is (1 << 20), and the size of a object is 4 MB.  So the
limit of the Sheepdog image size is 4 TB.

These values are hard-coded, and I guess they should be configurable.


> 
> > +} SheepdogInode;
> > +

> > +
> > +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
> > +{
> > +	SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
> > +
> > +	acb->canceled = 1;
> > +}
> 
> Does this provide the right semantics? You haven't really cancelled the
> request, but you pretend to. So you actually complete the request in the
> background and then throw the return code away.
> 
> I seem to remember that posix-aio-compat.c waits at this point for
> completion of the requests, calls the callbacks and only afterwards
> returns from aio_cancel when no more requests are in flight.
> 
> Or if you can really cancel requests, it would be the best option, of
> course.
> 

Sheepdog cannot cancel the requests which are already sent to the
servers.  So, as you say, we pretend to cancel the requests without
waiting for completion of them.  However, are there any situation
where pretending to cancel causes problems in practice?

To wait for completion of the requests here, we may need to create
another thread for processing I/O like posix-aio-compat.c.


> > +
> > +static int do_send_recv(int sockfd, struct iovec *iov, int len, int offset,
> > +			int write)
> 
> I've spent at least 15 minutes figuring out what this function does. I
> think I've got it now more or less, but I've come to the conclusion that
> this code needs more comments.
> 
> I'd suggest to add a header comment to all non-trivial functions and
> maybe somewhere on the top a general description of how things work.
> 
> As far as I understood now, there are basically two parts of request
> handling:
> 
> 1. The request is sent to the server. Its AIOCB is saved in a list in
> the BDRVSheepdogState. It doesn't pass a callback or anything for the
> completion.
> 
> 2. aio_read_response is registered as a fd handler to the sheepdog
> connection. When the server responds, it searches the right AIOCB in the
> list and the second part of request handling starts.
> 
> do_send_recv is the function that is used to do all communication with
> the server. The iov stuff looks like it's only used for some data, but
> seems this is not true - it's also used for the metadata of the protocol.
> 
> Did I understand it right so far?
> 

Yes, exactly.  I'll add comments to make codes more readable.


> > +{
> > +	struct msghdr msg;
> > +	int ret, diff;
> > +
> > +	memset(&msg, 0, sizeof(msg));
> > +	msg.msg_iov = iov;
> > +	msg.msg_iovlen = 1;
> > +
> > +	len += offset;
> > +
> > +	while (iov->iov_len < len) {
> > +		len -= iov->iov_len;
> > +
> > +		iov++;
> > +		msg.msg_iovlen++;
> > +	}
> 
> You're counting the number of elements in the iov here. qemu_iovec would
> already have these (and also len), wouldn't it make sense to use it as
> the abstraction? Though I'm not sure where these iovecs come from, so
> the answer might be no.
> 

We uses struct msghdr for sendmsg/recvmsg here, so using iovec
directly looks simpler.


> > +
> > +	diff = iov->iov_len - len;
> > +	iov->iov_len -= diff;
> > +
> > +	while (msg.msg_iov->iov_len <= offset) {
> > +		offset -= msg.msg_iov->iov_len;
> > +
> > +		msg.msg_iov++;
> > +		msg.msg_iovlen--;
> > +	}
> > +
> > +	msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base + offset;
> > +	msg.msg_iov->iov_len -= offset;
> > +
> > +	if (write) {
> > +		ret = sendmsg(sockfd, &msg, 0);
> > +	} else {
> > +		ret = recvmsg(sockfd, &msg, MSG_WAITALL);
> > +	}
> > +
> > +	msg.msg_iov->iov_base = (char *) msg.msg_iov->iov_base - offset;
> > +	msg.msg_iov->iov_len += offset;
> > +
> > +	iov->iov_len += diff;
> > +	return ret;
> > +}
> > +
> > +static int connect_to_sdog(const char *addr)
> > +{
> > +	char buf[64];
> > +	char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
> > +	char name[256], *p;
> > +	int fd, ret;
> > +	struct addrinfo hints, *res, *res0;
> > +	int port = 0;
> > +
> > +	if (!addr) {
> > +		addr = SD_DEFAULT_ADDR;
> > +	}
> > +
> > +	strcpy(name, addr);
> 
> This smells like buffer overflows. In practice it's s->addr for all
> callers and I think this values comes indirectly from filename in
> sd_open - for which I didn't find a length check, so it could overflow
> indeed.
> 

Yes, this would causes overflows.  I'll fix it for the next post.


> > +
> > +static void aio_read_response(void *opaque)
> > +{
> > +	SheepdogObjReq hdr;
> > +	SheepdogObjRsp *rsp = (SheepdogObjRsp *)&hdr;
> 
> Why do you declare an otherwise unused variable hdr, take a pointer to
> it and cast it to an incompatible type? This is scary.
> 

We don't need hdr, so we should remove it.


> > +	BDRVSheepdogState *s = (BDRVSheepdogState *)opaque;
> > +	int fd = s->fd;
> > +	int ret;
> > +	AIOReq *aio_req = NULL;
> > +	SheepdogAIOCB *acb;
> > +	int rest;
> > +	unsigned long idx;
> > +
> > +	if (QLIST_EMPTY(&s->outstanding_aio_head)) {
> > +		return;
> > +	}
> > +
> > +	ret = do_read(fd, (void *)rsp, sizeof(*rsp));
> 
> This cast looks scary, too. But do_read wants a void* anyway, so it's
> not even necessary. Please drop it.
> 

OK. I'll do it.


> > +	if (ret) {
> > +		error_report("failed to get the header, %m\n");
> > +		return;
> > +	}
> > +
> > +	QLIST_FOREACH(aio_req, &s->outstanding_aio_head, outstanding_aio_siblings) {
> > +		if (aio_req->id == rsp->id) {
> > +			break;
> > +		}
> > +	}
> > +	if (!aio_req) {
> > +		error_report("cannot find aio_req %x\n", rsp->id);
> > +		return;
> > +	}
> > +
> > +	acb = aio_req->aiocb;
> > +
> > +	switch (acb->aiocb_type) {
> > +	case AIOCB_WRITE_UDATA:
> > +		if (!is_data_obj(aio_req->oid)) {
> > +			break;
> > +		}
> > +		idx = data_oid_to_idx(aio_req->oid);
> > +
> > +		if (s->inode.data_vdi_id[idx] != s->inode.vdi_id) {
> > +			s->inode.data_vdi_id[idx] = s->inode.vdi_id;
> > +			s->max_dirty_data_idx = max_t(uint32_t, idx,
> > +						      s->max_dirty_data_idx);
> > +			s->min_dirty_data_idx = min_t(uint32_t, idx,
> > +						      s->min_dirty_data_idx);
> > +
> > +			send_pending_req(s, vid_to_data_oid(s->inode.vdi_id, idx),
> > +					 rsp->id);
> > +		}
> > +		break;
> > +	case AIOCB_READ_UDATA:
> > +		ret = do_readv(fd, acb->qiov->iov, rsp->data_length,
> > +			       aio_req->iov_offset);
> > +		if (ret) {
> > +			error_report("failed to get the data, %m\n");
> > +			return;
> > +		}
> > +		break;
> > +	}
> > +
> > +	if (rsp->result != SD_RES_SUCCESS) {
> > +		acb->ret = -EIO;
> > +		error_report("%s\n", sd_strerror(rsp->result));
> > +	}
> > +
> > +	rest = free_aio_req(s, aio_req);
> > +	if (!rest) {
> > +		acb->aio_done_func(acb);
> > +	}
> > +}
> > +
> > +static int aio_flush_request(void *opaque)
> > +{
> > +	BDRVSheepdogState *s = (BDRVSheepdogState *)opaque;
> 
> Unnecessary cast.
> 

I'll drop it.

> > +
> > +	return !QLIST_EMPTY(&s->outstanding_aio_head);
> > +}
> > +
> > +static int set_nonblocking(int fd)
> > +{
> > +	int ret;
> > +
> > +	ret = fcntl(fd, F_GETFL);
> > +	if (ret < 0) {
> > +		error_report("can't fcntl (F_GETFL), %m\n");
> > +		close(fd);
> > +	} else {
> > +		ret = fcntl(fd, F_SETFL, ret | O_NONBLOCK);
> > +		if (ret < 0) {
> > +			error_report("can't fcntl (O_NONBLOCK), %m\n");
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int set_nodelay(int fd)
> > +{
> > +	int ret, opt;
> > +
> > +	opt = 1;
> > +	ret = setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Return a socket discriptor to read/write objects.
> > + * We cannot use this discriptor for other operations because
> > + * the block driver may be on waiting response from the server.
> > + */
> > +static int get_sheep_fd(BDRVSheepdogState *s)
> > +{
> > +	int ret, fd;
> > +
> > +	fd = connect_to_sdog(s->addr);
> > +	if (fd < 0) {
> > +		error_report("%m\n");
> 
> %m is Linux specific, as far as I know.
> 

Yes, we'll use strerror(errno) instead of it.


> > +
> > +static int sd_create(const char *filename, QEMUOptionParameter *options)
> > +{
> > +	int ret;
> > +	uint32_t vid = 0;
> > +	int64_t total_sectors = 0;
> > +	char *backing_file = NULL;
> > +
> > +	strstart(filename, "sheepdog:", (const char **)&filename);
> > +
> > +	while (options && options->name) {
> > +		if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
> > +			total_sectors = options->value.n / 512;
> > +		} else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) {
> > +			backing_file = options->value.s;
> > +		}
> > +		options++;
> > +	}
> > +
> > +	if (backing_file) {
> > +		BlockDriverState bs;
> > +		char vdi[SD_MAX_VDI_LEN];
> > +		uint32_t snapid;
> > +
> > +		strstart(backing_file, "sheepdog:", (const char **)&backing_file);
> > +		memset(&bs, 0, sizeof(bs));
> > +
> > +		bs.opaque = qemu_malloc(sizeof(BDRVSheepdogState));
> > +
> > +		ret = sd_open(&bs, backing_file, 0);
> > +		if (ret < 0) {
> > +			return -1;
> > +		}
> > +
> > +		if (parse_vdiname(bs.opaque, backing_file, vdi, sizeof(vdi), &snapid) < 0) {
> > +			return -1;
> > +		}
> > +
> > +		/* cannot clone from a current inode */
> > +		if (snapid == CURRENT_VDI_ID) {
> > +			return -1;
> > +		}
> > +
> > +		ret = find_vdi_name(bs.opaque, vdi, snapid, &vid, 1);
> > +		if (ret) {
> > +			return -1;
> > +		}
> 
> Is it really necessary to do Sheepdog specific stuff here? With other
> formats I can use any format I like for a backing file. VMDK did assume
> that backing files could only be VMDK initially, but it was considered a
> bug.
> 
> If Sheepdog can't handle other image formats (e.g. because it can't
> communicate that a request touches unallocated space), you need to check
> this at least an return an error if something else is used.
> 

Sheepdog cannot handle other formats because a copy-on-write of
Sheepdog assumes that the base image is also Sheepdog one.  So we'll
check the format here to avoid other formats.

> 
> One more thing, I'll ask the same as for Ceph here: Do you have any
> specific tests for Sheepdog that could be used or should we try to use
> generic qemu-iotests cases?
> 

No, we don't have such tests, and it is great if qemu-iotests could
support protocol tests.


Thanks,

Kazutaka



More information about the sheepdog mailing list