[Stgt-devel] Data deduplication, stgt backing store & SSC

Fri Aug 3 04:44:56 CEST 2007

Re: deduplication

Just so we are both on the same page...

My understanding of the term deduplication is where the is some sort
of finger print for each block of data. The block of data is only
stored if its finger print is unique. Non-unique finger prints blocks
are then 'referenced' back to the original finger-print block of data.

i.e. Very similar to Single Instance Store but at a block layer.
Single Instance Store (SIS) is more used at an application layer (e.g.
MS Exchange does SIS on each email)

May be adding some sort of field (fit md5sum ??) in the header so
indexing virtual media will be quicker ?

64bytes for each block of data seems a little extravagant at this point in time.
Although unlike disk, tape blocks are typically 64k, 128k or larger.

Regards
Mark

On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote:
> On Fri, 2007-08-03 at 11:07 +1000, Mark Harvey wrote:
> > On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote:
> > > On Fri, 2007-08-03 at 10:53 +1000, Mark Harvey wrote:
> > > > On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote:
> > > > > On Thu, 2007-08-02 at 14:58 -0400, Pete Wyckoff wrote:
> > > > > > markh794 at gmail.com wrote on Tue, 31 Jul 2007 19:57 +1000:
> > > > > > > For Variable block SSC device, the block size written needs to be
> > > > > > > tracked.
> > > > > > [..]
> > > > > > > My current thoughts of a solution:
> > > > > > > ========================
> > > > > > > A block header describes each block written -> Analogy to the 'tar'
> > > > > > > format where a header is written, followed by the 'data' followed by
> > > > > > > another header, followed by more data...repeat...until blank header...
> > > > > > [..]
> > > > > > > However the current implementation for iSCSI -> bs_sync uses a
> > > > > > > pread64()/pwrite64() and writes data based on information stored in
> > > > > > > scsi_cmd -
> > > > > > >  pwrite64(fd, cmd->uaddr, cmd->len, cmd->offset)
> > > > > > >  pread64(fd, cmd->uaddr, cmd->len, cmd->offset)
> > > > > > >
> > > > > > >
> > > > > > > Would it be OK to add a 'blk_header' structure to struct scsi_cmd and if
> > > > > > > blk_header is set, write this blk header as well ?
> > > > > > > I will attempt to put the above idea into code and submit for comment...
> > > > > >
> > > > > > I think rather than trying to add stuff to existing backing stores
> > > > > > that you should consider writing your own.  You need to store both
> > > > > > "metadata" (block descriptors) and data, and none of the BSes are
> > > > > > set up for that.
> > > > > >
> > > > > > The complexity of trying to glue in the blk_header so that all BSes
> > > > > > know how to tack that on top seems big.  Then you have to tell them
> > > > > > to read the header, and consider fields in that when determining how
> > > > > > much further data to read.  It gets messy fast.
> > > > >
> > > > > also considering tape can have compression and encryption, then each
> > > > > block is variable size even in fixed size mode. so support variable size
> > > > > is a must eventually.
> > > >
> > > > Yep. The SSC metadata (header for each tape block) contains a
> > > > structure with a 'block type' , size of original data & size of data
> > > > stored.
> > > >
> > > > All accounted for. Just no code for compression or encryption - as yet..
> > > >
> > > > enum {
> > > >         BLK_NOOP,
> > > >         BLK_UNCOMPRESS_DATA,
> > > >         BLK_COMPRESSED_DATA,
> > > >         BLK_ENCRYPTED_DATA,
> > > >         BLK_FILEMARK,
> > > >         BLK_BOT,
> > > >         BLK_EOD,
> > > > };
> > >
> > > are these bits? can i have multiple bits set like compressed before
> > > encryption? also leave room for data deduplication for example storing a
> > > token. ;)
> > >
> >
> > Arr.. no.
> >
> > Good catch.
> >
> > I'll re-do it so they are bits.
>
> thx. bit might now be good enough as well. ;)
>
> >
> > Re: Data deduplication
> > I was thinking a separate 'cache' file will be required and 'hashed'
> > some way for this.
> >
> > I can not really see any advantage of storing a fixed deduplication
> > method in the SSC block header.
>
> dedup is just another compression method. even with compression u can
> have different algorithms and with encryption u have different
> algorithms or ways or all these just kinds of data transformation.
>
> just need a type to mark what kind of data transformation it uses and
> each transform provide a pluggable way to fetch/store data.
>
> for example like de-dup. u can store a token here and then when need to
> read this data, u check the header to see if overread or under read
> because of the block size issue. (forgot how SSC call these...:-()
>
> then with token u can contact another data service to find data and
> return.
>
> so once u define the infrastructure, these can be added later...
>
>
> >
> > I have not thought this thru... So any input on this will be welcome.
> >
> > Cheers
> > Mark
> >
> > >
> > > --
> > > Ming Zhang
> > >
> > >
> > > @#$%^ purging memory... (*!%
> > > http://blackmagic02881.wordpress.com/
> > > http://www.linkedin.com/in/blackmagic02881
> > > --------------------------------------------
> > >
> > >
> --
> Ming Zhang
>
>
> @#$%^ purging memory... (*!%
> http://blackmagic02881.wordpress.com/
> http://www.linkedin.com/in/blackmagic02881
> --------------------------------------------
>
>