[Sheepdog] object sizes

Fri Nov 11 17:04:47 CET 2011

On 11/11/2011 11:22 PM, MORITA Kazutaka wrote:

> At Fri, 11 Nov 2011 07:59:54 -0500,
> Christoph Hellwig wrote:
>>
>> Currently sheepdog uses a fixed object size of 4 Megabytes.  For getting
>> good performance out of spinning media this relatively small sizes has
>> a few issues:
>>
>>  - disks need 64 to 128 contiguous megabytes to make the seek penality
>>    invisible during normal streaming workloads.  While at least XFS
>>    is fairly good at laying out multiple sheepdog objects contiguously
>>    for a single writer we still get occasinal metadata allocation
>>    inbetween.  The situation is much worse if we have multiple parallell
>>    writers and they hit the same allocation group.
>>  - there is a non-trivial metadata overhead, e.g. for a 1GB streaming
>>    write to a sheepdog volume we need to allocate 256 inodes, and flush
>>    them out before returning to the caller with the current write
>>    through cache model, which all cause seeks.
>>
>> To fix this I'd love to see the option of an object size ~128MB.  There
>> are two issues with that:
>>
>>  - we will use up more space if randomly written into the volume.
>>    For most cloud setups that this is entirely acceptable, though.
>>  - we need to copy a lot of data when processing copy on write requests.
>>
>> The latter is more of a concern to me. I can think of two mitigation
>> strategies:
>>
>>  - make the COW block size smaller than the object size.  This is
>>    similar to the subclusters under development for qcow2.  This
>>    could e.g. be implemented by an extended attribute on the file
>>    containing a bitmap of regions in an object that haven't been
>>    copied up yet.
>>  - make use features in some filesystem to create a new file that
>>    shares the data with an existing file, aka reflinks in ocfs2
>>    and btrfs.
> 
> Currently, Yuan is trying to add some features to Sheepdog store, and
> I heard that he'll send the patch next week.
> 
> Yuan, something like cluster-wide reflink will be supported in the
> framework you are implementing?
> 

I am not sure what is reflink. is it synonym to snapshot? IIC, yes, the
new storage infrastructure named 'farm' will support clsuter-wide snapshot.

Simply put, it somewhat resembles git a lot (both code and idea level).
there are three object type, named 'data, trunk, snapshot' that is
similar to git's 'blob, tree, commit'.

'data' is just sheepdog's data object, only named by its sha1ed content.
So the data objects with the same content will be mapped to only single
sha1 name, thus achieve node-wide data sharing.

'snapshot' object will serve to support snapshot which contains the
snapshoted trunk, that is 'directory' of the that-time data objects on
each node. The trunk object will provide a means to find data objects.
This will support cluster-wide snapshot.

The 'farm' doesn't has any constraint to object data size. Hope this helps.

Probably the prototype will be released at the end of the next week.
Then we can discuss concretely.

Thanks,
Yuan