On Fri, 2007-08-03 at 12:44 +1000, Mark Harvey wrote: > Re: deduplication > > Just so we are both on the same page... > > My understanding of the term deduplication is where the is some sort > of finger print for each block of data. The block of data is only > stored if its finger print is unique. Non-unique finger prints blocks > are then 'referenced' back to the original finger-print block of data. yes. exactly. just different solution has different pros and cons. fix block, variable block, algorithm, ... i do not expect a easy OSS de-dup can have here but at least our infrasture should enable this possibility. > > i.e. Very similar to Single Instance Store but at a block layer. > Single Instance Store (SIS) is more used at an application layer (e.g. > MS Exchange does SIS on each email) yes > > May be adding some sort of field (fit md5sum ??) in the header so > indexing virtual media will be quicker ? > > 64bytes for each block of data seems a little extravagant at this point in time. > Although unlike disk, tape blocks are typically 64k, 128k or larger. > > Regards > Mark > > On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote: > > On Fri, 2007-08-03 at 11:07 +1000, Mark Harvey wrote: > > > On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote: > > > > On Fri, 2007-08-03 at 10:53 +1000, Mark Harvey wrote: > > > > > On 8/3/07, Ming Zhang <blackmagic02881 at gmail.com> wrote: > > > > > > On Thu, 2007-08-02 at 14:58 -0400, Pete Wyckoff wrote: > > > > > > > markh794 at gmail.com wrote on Tue, 31 Jul 2007 19:57 +1000: > > > > > > > > For Variable block SSC device, the block size written needs to be > > > > > > > > tracked. > > > > > > > [..] > > > > > > > > My current thoughts of a solution: > > > > > > > > ======================== > > > > > > > > A block header describes each block written -> Analogy to the 'tar' > > > > > > > > format where a header is written, followed by the 'data' followed by > > > > > > > > another header, followed by more data...repeat...until blank header... > > > > > > > [..] > > > > > > > > However the current implementation for iSCSI -> bs_sync uses a > > > > > > > > pread64()/pwrite64() and writes data based on information stored in > > > > > > > > scsi_cmd - > > > > > > > > pwrite64(fd, cmd->uaddr, cmd->len, cmd->offset) > > > > > > > > pread64(fd, cmd->uaddr, cmd->len, cmd->offset) > > > > > > > > > > > > > > > > > > > > > > > > Would it be OK to add a 'blk_header' structure to struct scsi_cmd and if > > > > > > > > blk_header is set, write this blk header as well ? > > > > > > > > I will attempt to put the above idea into code and submit for comment... > > > > > > > > > > > > > > I think rather than trying to add stuff to existing backing stores > > > > > > > that you should consider writing your own. You need to store both > > > > > > > "metadata" (block descriptors) and data, and none of the BSes are > > > > > > > set up for that. > > > > > > > > > > > > > > The complexity of trying to glue in the blk_header so that all BSes > > > > > > > know how to tack that on top seems big. Then you have to tell them > > > > > > > to read the header, and consider fields in that when determining how > > > > > > > much further data to read. It gets messy fast. > > > > > > > > > > > > also considering tape can have compression and encryption, then each > > > > > > block is variable size even in fixed size mode. so support variable size > > > > > > is a must eventually. > > > > > > > > > > Yep. The SSC metadata (header for each tape block) contains a > > > > > structure with a 'block type' , size of original data & size of data > > > > > stored. > > > > > > > > > > All accounted for. Just no code for compression or encryption - as yet.. > > > > > > > > > > enum { > > > > > BLK_NOOP, > > > > > BLK_UNCOMPRESS_DATA, > > > > > BLK_COMPRESSED_DATA, > > > > > BLK_ENCRYPTED_DATA, > > > > > BLK_FILEMARK, > > > > > BLK_BOT, > > > > > BLK_EOD, > > > > > }; > > > > > > > > are these bits? can i have multiple bits set like compressed before > > > > encryption? also leave room for data deduplication for example storing a > > > > token. ;) > > > > > > > > > > Arr.. no. > > > > > > Good catch. > > > > > > I'll re-do it so they are bits. > > > > thx. bit might now be good enough as well. ;) > > > > > > > > Re: Data deduplication > > > I was thinking a separate 'cache' file will be required and 'hashed' > > > some way for this. > > > > > > I can not really see any advantage of storing a fixed deduplication > > > method in the SSC block header. > > > > dedup is just another compression method. even with compression u can > > have different algorithms and with encryption u have different > > algorithms or ways or all these just kinds of data transformation. > > > > just need a type to mark what kind of data transformation it uses and > > each transform provide a pluggable way to fetch/store data. > > > > for example like de-dup. u can store a token here and then when need to > > read this data, u check the header to see if overread or under read > > because of the block size issue. (forgot how SSC call these...:-() > > > > then with token u can contact another data service to find data and > > return. > > > > so once u define the infrastructure, these can be added later... > > > > > > > > > > I have not thought this thru... So any input on this will be welcome. > > > > > > Cheers > > > Mark > > > > > > > > > > > -- > > > > Ming Zhang > > > > > > > > > > > > @#$%^ purging memory... (*!% > > > > http://blackmagic02881.wordpress.com/ > > > > http://www.linkedin.com/in/blackmagic02881 > > > > -------------------------------------------- > > > > > > > > > > -- > > Ming Zhang > > > > > > @#$%^ purging memory... (*!% > > http://blackmagic02881.wordpress.com/ > > http://www.linkedin.com/in/blackmagic02881 > > -------------------------------------------- > > > > -- Ming Zhang @#$%^ purging memory... (*!% http://blackmagic02881.wordpress.com/ http://www.linkedin.com/in/blackmagic02881 -------------------------------------------- |