[sheepdog] [PATCH 09/17] block: Refactor bdrv_has_zero_init{, _truncate}
Eric Blake
eblake at redhat.com
Wed Feb 5 19:39:48 CET 2020
On 2/5/20 11:22 AM, Max Reitz wrote:
>
>>> And thus callers which just want the trivially obtainable
>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>> even though they don’t care about that flag.
>>
>> True, but only to a minor extent; and the documentation mentions that
>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>> block_status loop.
>
> So it must be less expensive than an arbitrarily complex loop. I think
> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?
If I recall, the tmpfs bug was that it was O(n) where n was based on the
initial offset and the number of extents prior to that offset. The
probe at offset 0 is O(1) (because there are no prior extents), whether
it reaches the end of the file (the entire image is a hole) or hits data
beforehand. It is only probes at later offsets where the speed penalty
sets in, and where an O(n) loop over all extents turned into an O(n^2)
traversal time due to the O(n) nature of each later lookup.
>
> What I’m trying to say is that this is not a good limit and can mean
> anything.
>
> I do think this limit definition makes sense for callers that want to
> know about ZERO_OPEN. But I don’t know why we would have to let other
> callers wait, too.
Keeping separate functions may still be the right approach for v2,
although I'd still like to name things better ('has_zero_init' vs.
'has_zero_init_truncate' did not work well for me). And if I'm renaming
things, then I'm touching just as much code whether I rename and keep
separate functions or rename and consolidate into one.
>
>> Meanwhile, callers tend to only care about
>> bdrv_known_zeroes() right after opening an image or right before
>> resizing (not repeatedly during runtime);
>
> Hm, yes. I was thinking of parallels, but that only checks once in
> parallels_open(), so it’s OK.
>
>> and you also argued elsewhere
>> in this thread that it may be worth having the block layer cache
>> BDRV_ZERO_OPEN and update the cache on any write,
>
> I didn’t say the block layer, but it if makes sense.
>
>> at which point, the
>> expense in the driver callback really is a one-time call during
>> bdrv_co_open().
>
> It definitely doesn’t make sense to me to do that call unconditionally
> in bdrv_co_open().
Okay, you have a point there - while 'qemu-img convert' cares, not all
clients of bdrv_co_open() are worried about whether the existing
contents are zero; so unconditionally priming a cache during
bdrv_co_open is not as wise as doing things when it will actually be
useful information. On the other hand, if it is something that clients
only use when first opening an image, caching data doesn't make much
sense either.
So, we know that bdrv_has_zero_init() is only viable on a just-created
image, bdrv_has_zero_init_truncate() is only viable if you are about to
resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).
Hmm - thinking aloud: our ultimate goal is that we want to make it
easier for algorithms that can be sped up IF the image is currently
known to be all zero. Maybe what this means is that we really want to
be tweaking bdrv_make_zeroes() to do all the work, something along the
lines of:
- if the image is known to already be all zeroes using an O(1) probe
(this includes if the image was freshly created and creation sees all
zeroes, or if a block_status at offset 0 shows a hole for the entire
image, or if an NBD extension advertises all zero at connection
time...), return success
- if the image has a FAST truncate, and resizing reads zeroes, we can
truncate to size 0 and back to the desired size, then return success;
determining if truncate is fast should be similar to how
BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
- if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can
request a write zeroes over the whole image, which will either succeed
(the image is now quickly zero) or fail (writing zeroes as we go is the
best we can do)
- if the image could report that it is all zeroes, but only at the cost
of O(n) work such as a loop over block_status (or even O(n^2) with the
tmpfs lseek bug), it's easier to report failure than to worry about
making the image read all zeroes
qemu-img would then only ever need to consult --target-is-zero and
bdrv_make_zero(), and not worry about any other function calls; while
the block layer would take care of coordinating whatever other call
sequences make the most sense in reporting success or failure in getting
the image into an all-zero state if it was not already there.
>
>> And in that case, whether the one-time expense is done
>> via a single function call or via three driver callbacks, the amount of
>> work is the same; but the driver callback interface is easier if there
>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>> though BlockDriverInfo tracks much more than that boolean).
>>
>> In fact, it may be worth consolidating known zeroes support into
>> BlockDriverInfo.
>
> I’m very skeptical of that. BDI already has the problem that it doesn’t
> know which of the information the caller actually wants and that it is
> sometimes used in a quasi-hot path.
>
> Maybe that means it is indeed time to incorporate it into BDI, but the
> caller should have a way of specifying what parts of BDI it actually
> needs and then drivers can skip anything that isn’t trivially obtainable
> that the caller doesn’t need.
I'm reminded of the recent kernel addition of xstat(); the traditional
stat/fstat interfaces really don't know which bits of information you
care about, so you get everything, but with xstat(), you can request
only what you plan to use, which may indeed result in speedups.
>> Those are still viable options, but before I repaint the bikeshed along
>> those lines, I'd at least like a review of whether the overall idea of
>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>> regardless of how we implement it as one vs. three driver callbacks.
>
> I’m as hesitant as ever to give a review that this notion is useful,
> because I haven’t seen a practical example yet where the problem isn’t
> the fact that NBD doesn’t have 64-bit write_zeroes support.
Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers
will be compliant with that extension. This will speed up operations
when talking to older servers which do not support 64-bit writes, even
if newer qemu is never such a server.
>
> So far, it looks to me like this notion is only really useful for cases
> where we expect a management layer on top of qemu anyway. And then I’m
> not sure that this new feature works reliably enough for such a
> management layer.
>
> (I’m not saying it isn’t useful. Again, intuitively it does seem
> useful. Intuition can be enough to merge a sufficiently simple series
> that doesn’t increase code complexity too much. But I’m still asking
> for actual practical examples, because that would make a better
> argument, of course.)
I'm hoping when I post my NBD patches that I can also provide some
benchmark timing numbers to make the case a bit more concrete.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
More information about the sheepdog
mailing list