Fri Nov 11 13:59:54 CET 2011

Currently sheepdog uses a fixed object size of 4 Megabytes.  For getting
good performance out of spinning media this relatively small sizes has
a few issues:

 - disks need 64 to 128 contiguous megabytes to make the seek penality
   invisible during normal streaming workloads.  While at least XFS
   is fairly good at laying out multiple sheepdog objects contiguously
   for a single writer we still get occasinal metadata allocation
   inbetween.  The situation is much worse if we have multiple parallell
   writers and they hit the same allocation group.
 - there is a non-trivial metadata overhead, e.g. for a 1GB streaming
   write to a sheepdog volume we need to allocate 256 inodes, and flush
   them out before returning to the caller with the current write
   through cache model, which all cause seeks.

To fix this I'd love to see the option of an object size ~128MB.  There
are two issues with that:

 - we will use up more space if randomly written into the volume.
   For most cloud setups that this is entirely acceptable, though.
 - we need to copy a lot of data when processing copy on write requests.

The latter is more of a concern to me. I can think of two mitigation

 - make the COW block size smaller than the object size.  This is
   similar to the subclusters under development for qcow2.  This
   could e.g. be implemented by an extended attribute on the file
   containing a bitmap of regions in an object that haven't been
   copied up yet.
 - make use features in some filesystem to create a new file that
   shares the data with an existing file, aka reflinks in ocfs2
   and btrfs.

Did anyone look into larger object sizes?

