[sheepdog] [PATCH 09/17] block: Refactor bdrv_has_zero_init{, _truncate}

Wed Feb 5 19:39:48 CET 2020

On 2/5/20 11:22 AM, Max Reitz wrote:

> 
>>> And thus callers which just want the trivially obtainable
>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>> even though they don’t care about that flag.
>>
>> True, but only to a minor extent; and the documentation mentions that
>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>> block_status loop.
> 
> So it must be less expensive than an arbitrarily complex loop.  I think
> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

If I recall, the tmpfs bug was that it was O(n) where n was based on the 
initial offset and the number of extents prior to that offset.  The 
probe at offset 0 is O(1) (because there are no prior extents), whether 
it reaches the end of the file (the entire image is a hole) or hits data 
beforehand.  It is only probes at later offsets where the speed penalty 
sets in, and where an O(n) loop over all extents turned into an O(n^2) 
traversal time due to the O(n) nature of each later lookup.

> 
> What I’m trying to say is that this is not a good limit and can mean
> anything.
> 
> I do think this limit definition makes sense for callers that want to
> know about ZERO_OPEN.  But I don’t know why we would have to let other
> callers wait, too.

Keeping separate functions may still be the right approach for v2, 
although I'd still like to name things better ('has_zero_init' vs. 
'has_zero_init_truncate' did not work well for me).  And if I'm renaming 
things, then I'm touching just as much code whether I rename and keep 
separate functions or rename and consolidate into one.

> 
>> Meanwhile, callers tend to only care about
>> bdrv_known_zeroes() right after opening an image or right before
>> resizing (not repeatedly during runtime);
> 
> Hm, yes.  I was thinking of parallels, but that only checks once in
> parallels_open(), so it’s OK.
> 
>> and you also argued elsewhere
>> in this thread that it may be worth having the block layer cache
>> BDRV_ZERO_OPEN and update the cache on any write,
> 
> I didn’t say the block layer, but it if makes sense.
> 
>> at which point, the
>> expense in the driver callback really is a one-time call during
>> bdrv_co_open().
> 
> It definitely doesn’t make sense to me to do that call unconditionally
> in bdrv_co_open().

Okay, you have a point there - while 'qemu-img convert' cares, not all 
clients of bdrv_co_open() are worried about whether the existing 
contents are zero; so unconditionally priming a cache during 
bdrv_co_open is not as wise as doing things when it will actually be 
useful information.  On the other hand, if it is something that clients 
only use when first opening an image, caching data doesn't make much 
sense either.

So, we know that bdrv_has_zero_init() is only viable on a just-created 
image, bdrv_has_zero_init_truncate() is only viable if you are about to 
resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).

Hmm - thinking aloud: our ultimate goal is that we want to make it 
easier for algorithms that can be sped up IF the image is currently 
known to be all zero.  Maybe what this means is that we really want to 
be tweaking bdrv_make_zeroes() to do all the work, something along the 
lines of:
- if the image is known to already be all zeroes using an O(1) probe 
(this includes if the image was freshly created and creation sees all 
zeroes, or if a block_status at offset 0 shows a hole for the entire 
image, or if an NBD extension advertises all zero at connection 
time...), return success
- if the image has a FAST truncate, and resizing reads zeroes, we can 
truncate to size 0 and back to the desired size, then return success; 
determining if truncate is fast should be similar to how 
BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
- if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can 
request a write zeroes over the whole image, which will either succeed 
(the image is now quickly zero) or fail (writing zeroes as we go is the 
best we can do)
- if the image could report that it is all zeroes, but only at the cost 
of O(n) work such as a loop over block_status (or even O(n^2) with the 
tmpfs lseek bug), it's easier to report failure than to worry about 
making the image read all zeroes

qemu-img would then only ever need to consult --target-is-zero and 
bdrv_make_zero(), and not worry about any other function calls; while 
the block layer would take care of coordinating whatever other call 
sequences make the most sense in reporting success or failure in getting 
the image into an all-zero state if it was not already there.

> 
>> And in that case, whether the one-time expense is done
>> via a single function call or via three driver callbacks, the amount of
>> work is the same; but the driver callback interface is easier if there
>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>> though BlockDriverInfo tracks much more than that boolean).
>>
>> In fact, it may be worth consolidating known zeroes support into
>> BlockDriverInfo.
> 
> I’m very skeptical of that.  BDI already has the problem that it doesn’t
> know which of the information the caller actually wants and that it is
> sometimes used in a quasi-hot path.
> 
> Maybe that means it is indeed time to incorporate it into BDI, but the
> caller should have a way of specifying what parts of BDI it actually
> needs and then drivers can skip anything that isn’t trivially obtainable
> that the caller doesn’t need.

I'm reminded of the recent kernel addition of xstat(); the traditional 
stat/fstat interfaces really don't know which bits of information you 
care about, so you get everything, but with xstat(), you can request 
only what you plan to use, which may indeed result in speedups.

>> Those are still viable options, but before I repaint the bikeshed along
>> those lines, I'd at least like a review of whether the overall idea of
>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>> regardless of how we implement it as one vs. three driver callbacks.
> 
> I’m as hesitant as ever to give a review that this notion is useful,
> because I haven’t seen a practical example yet where the problem isn’t
> the fact that NBD doesn’t have 64-bit write_zeroes support.

Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers 
will be compliant with that extension.  This will speed up operations 
when talking to older servers which do not support 64-bit writes, even 
if newer qemu is never such a server.

> 
> So far, it looks to me like this notion is only really useful for cases
> where we expect a management layer on top of qemu anyway.  And then I’m
> not sure that this new feature works reliably enough for such a
> management layer.
> 
> (I’m not saying it isn’t useful.  Again, intuitively it does seem
> useful.  Intuition can be enough to merge a sufficiently simple series
> that doesn’t increase code complexity too much.  But I’m still asking
> for actual practical examples, because that would make a better
> argument, of course.)

I'm hoping when I post my NBD patches that I can also provide some 
benchmark timing numbers to make the case a bit more concrete.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org