[sheepdog] [PATCH v5 04/14] sheep: introduce generational reference counting for object reclaim

Hitoshi Mitake mitake.hitoshi at gmail.com
Wed Mar 5 06:50:08 CET 2014


At Wed, 5 Mar 2014 13:34:59 +0800,
Liu Yuan wrote:
> 
> On Wed, Mar 05, 2014 at 02:20:00PM +0900, Hitoshi Mitake wrote:
> > At Tue, 4 Mar 2014 21:28:07 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Tue, Mar 04, 2014 at 02:42:48PM +0900, Hitoshi Mitake wrote:
> > > > From: Hitoshi Mitake <mitake.hitoshi at gmail.com>
> > > > 
> > > > Generational reference counting is an algorithm to reclaim data
> > > > efficiently without race conditions on distributed system.  This
> > > > extends vdi objects structure to store generational reference counts,
> > > > and increments the counts when creating snapshots.
> > > > 
> > > > Cc: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> > > > Cc: Valerio Pachera <sirio81 at gmail.com>
> > > > Cc: Alessandro Bolgia <alessandro at extensys.it>
> > > > Signed-off-by: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > ---
> > > > 
> > > > v5:
> > > >  - update store version and break compatibility explicitly
> > > >  - rename data_ref -> gref
> > > > 
> > > > v4:
> > > >  - remove a bug in snapshot_vdi(), storing an invalid number of references
> > > > 
> > > >  include/sheepdog_proto.h |    6 +++++
> > > >  sheep/config.c           |    2 +-
> > > >  sheep/migrate.c          |    8 +++++++
> > > >  sheep/vdi.c              |   58 +++++++++++++++++++++++++++++++++++-----------
> > > >  4 files changed, 59 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/include/sheepdog_proto.h b/include/sheepdog_proto.h
> > > > index 9361bad..9937497 100644
> > > > --- a/include/sheepdog_proto.h
> > > > +++ b/include/sheepdog_proto.h
> > > > @@ -212,6 +212,11 @@ struct sd_rsp {
> > > >  	};
> > > >  };
> > > >  
> > > > +struct generation_reference {
> > > > +	int32_t generation;
> > > > +	int32_t count;
> > > > +};
> > > > +
> > > >  struct sd_inode {
> > > >  	char name[SD_MAX_VDI_LEN];
> > > >  	char tag[SD_MAX_VDI_TAG_LEN];
> > > > @@ -230,6 +235,7 @@ struct sd_inode {
> > > >  	uint32_t child_vdi_id[MAX_CHILDREN];
> > > >  	uint32_t data_vdi_id[SD_INODE_DATA_INDEX];
> > > >  	uint32_t btree_counter;
> > > > +	struct generation_reference gref[SD_INODE_DATA_INDEX];
> > > >  };
> > > 
> > > This patch set passes tests on my box, great!
> > > 
> > > For better compatibility, I'd suggest
> > > 
> > > make gref array in a spectial object like btree intermedia node object, instead
> > > of embedding into inode and put 'btree_counter' in the unused field (child_vdi_id)
> > > 
> > > Then we can keep the current inode layout without modification of QEMU and TGT 
> > > backend code to support hyper volume later.
> > > 
> > > This way 
> > > 
> > > - inode won't become cumbersome and too big as more and more field adds in. 
> > > - upper layer won't be aware of inode layout change and consistent with sd_inode
> > 
> > Sorry, I forgot to mention about this point. Current gref scheme
> > doesn't break protocol between qemu, tgt and sheep. Because gref is
> > only appended so qemu, tgt don't have to care about it.
> 
> Yes, it doesn't break the compatibility, but it indeed make people frown at it
> when they find inode objects are much bigger than the one that QEMU or TGT read
> in.

The new object reclaim scheme introduces very big change. So having a
surprise for users isn't avoidable and should be allowed.
# I'll update CHANGELOG.md in the next version.

> 
> Appending more fields in the inode isn't a scalable approach that make inode size
> inconsistent between client and server. I think we need a flag to add xattr-like
> objects to inode, so it works as
> 
>  [ sd_inode ] [ iobj1 ] [ iobj2 ] ... [ iobjN]
> 
> which will make inode more extensible.
> 
> how about storing gref arrays in our xattr objects? Then we can easily make use
> of current code to store it and we can even easily dump this information by
> 'vdi attr get' for debug purpse.

Using attribute object for managing gref is not a good idea. It means
we have "reserved" attribute objects. We have to prepare special rules
for preventing accidental add/delete/modify by users.

If we need a new area for storing special information, let's use a
remaining bit for object type and implement internally used attribute
when the time comes. But gref is a very fundamental one so I think
appending it to inode itself is suitable design.

Thanks,
Hitoshi



More information about the sheepdog mailing list