Currently sheepdog uses a fixed object size of 4 Megabytes. For getting good performance out of spinning media this relatively small sizes has a few issues: - disks need 64 to 128 contiguous megabytes to make the seek penality invisible during normal streaming workloads. While at least XFS is fairly good at laying out multiple sheepdog objects contiguously for a single writer we still get occasinal metadata allocation inbetween. The situation is much worse if we have multiple parallell writers and they hit the same allocation group. - there is a non-trivial metadata overhead, e.g. for a 1GB streaming write to a sheepdog volume we need to allocate 256 inodes, and flush them out before returning to the caller with the current write through cache model, which all cause seeks. To fix this I'd love to see the option of an object size ~128MB. There are two issues with that: - we will use up more space if randomly written into the volume. For most cloud setups that this is entirely acceptable, though. - we need to copy a lot of data when processing copy on write requests. The latter is more of a concern to me. I can think of two mitigation strategies: - make the COW block size smaller than the object size. This is similar to the subclusters under development for qcow2. This could e.g. be implemented by an extended attribute on the file containing a bitmap of regions in an object that haven't been copied up yet. - make use features in some filesystem to create a new file that shares the data with an existing file, aka reflinks in ocfs2 and btrfs. Did anyone look into larger object sizes? |