[sheepdog] [PATCH 09/17] block: Refactor bdrv_has_zero_init{, _truncate}

Thu Feb 6 10:18:38 CET 2020

On 05.02.20 19:39, Eric Blake wrote:
> On 2/5/20 11:22 AM, Max Reitz wrote:
> 
>>
>>>> And thus callers which just want the trivially obtainable
>>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>>> even though they don’t care about that flag.
>>>
>>> True, but only to a minor extent; and the documentation mentions that
>>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>>> block_status loop.
>>
>> So it must be less expensive than an arbitrarily complex loop.  I think
>> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?
> 
> If I recall, the tmpfs bug was that it was O(n) where n was based on the
> initial offset and the number of extents prior to that offset.  The
> probe at offset 0 is O(1) (because there are no prior extents), whether
> it reaches the end of the file (the entire image is a hole) or hits data
> beforehand.  It is only probes at later offsets where the speed penalty
> sets in, and where an O(n) loop over all extents turned into an O(n^2)
> traversal time due to the O(n) nature of each later lookup.

So it’s O(n/2) for each lookup on average, which is O(n). O:-)

>> What I’m trying to say is that this is not a good limit and can mean
>> anything.
>>
>> I do think this limit definition makes sense for callers that want to
>> know about ZERO_OPEN.  But I don’t know why we would have to let other
>> callers wait, too.
> 
> Keeping separate functions may still be the right approach for v2,
> although I'd still like to name things better ('has_zero_init' vs.
> 'has_zero_init_truncate' did not work well for me).  And if I'm renaming
> things, then I'm touching just as much code whether I rename and keep
> separate functions or rename and consolidate into one.

I definitely don’t disagree about renaming, and if you think that
consolidating the functions is worth it, then it probably makes sense
(because you have the experience there, given this series).

But I’d still like to throw in that a rename is a more easily doable and
reviewable change than a consolidation, even if you get the same number
of hunks in the end.

>>> Meanwhile, callers tend to only care about
>>> bdrv_known_zeroes() right after opening an image or right before
>>> resizing (not repeatedly during runtime);
>>
>> Hm, yes.  I was thinking of parallels, but that only checks once in
>> parallels_open(), so it’s OK.
>>
>>> and you also argued elsewhere
>>> in this thread that it may be worth having the block layer cache
>>> BDRV_ZERO_OPEN and update the cache on any write,
>>
>> I didn’t say the block layer, but it if makes sense.
>>
>>> at which point, the
>>> expense in the driver callback really is a one-time call during
>>> bdrv_co_open().
>>
>> It definitely doesn’t make sense to me to do that call unconditionally
>> in bdrv_co_open().
> 
> Okay, you have a point there - while 'qemu-img convert' cares, not all
> clients of bdrv_co_open() are worried about whether the existing
> contents are zero; so unconditionally priming a cache during
> bdrv_co_open is not as wise as doing things when it will actually be
> useful information.  On the other hand, if it is something that clients
> only use when first opening an image, caching data doesn't make much
> sense either.
> 
> So, we know that bdrv_has_zero_init() is only viable on a just-created
> image, bdrv_has_zero_init_truncate() is only viable if you are about to
> resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).
> 
> Hmm - thinking aloud: our ultimate goal is that we want to make it
> easier for algorithms that can be sped up IF the image is currently
> known to be all zero.  Maybe what this means is that we really want to
> be tweaking bdrv_make_zeroes() to do all the work, something along the
> lines of:
> - if the image is known to already be all zeroes using an O(1) probe
> (this includes if the image was freshly created and creation sees all
> zeroes, or if a block_status at offset 0 shows a hole for the entire
> image, or if an NBD extension advertises all zero at connection
> time...), return success

[Insert case here: If the image has a custom make_zeroes implementation,
use it, and return success]

> - if the image has a FAST truncate, and resizing reads zeroes, we can
> truncate to size 0 and back to the desired size, then return success;
> determining if truncate is fast should be similar to how
> BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
> - if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can
> request a write zeroes over the whole image, which will either succeed
> (the image is now quickly zero) or fail (writing zeroes as we go is the
> best we can do)
> - if the image could report that it is all zeroes, but only at the cost
> of O(n) work such as a loop over block_status (or even O(n^2) with the
> tmpfs lseek bug), it's easier to report failure than to worry about
> making the image read all zeroes
> 
> qemu-img would then only ever need to consult --target-is-zero and
> bdrv_make_zero(), and not worry about any other function calls; while
> the block layer would take care of coordinating whatever other call
> sequences make the most sense in reporting success or failure in getting
> the image into an all-zero state if it was not already there.

(As I just wrote on the cover letter thread:) Sounds good to me.

>>> And in that case, whether the one-time expense is done
>>> via a single function call or via three driver callbacks, the amount of
>>> work is the same; but the driver callback interface is easier if there
>>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>>> though BlockDriverInfo tracks much more than that boolean).
>>>
>>> In fact, it may be worth consolidating known zeroes support into
>>> BlockDriverInfo.
>>
>> I’m very skeptical of that.  BDI already has the problem that it doesn’t
>> know which of the information the caller actually wants and that it is
>> sometimes used in a quasi-hot path.
>>
>> Maybe that means it is indeed time to incorporate it into BDI, but the
>> caller should have a way of specifying what parts of BDI it actually
>> needs and then drivers can skip anything that isn’t trivially obtainable
>> that the caller doesn’t need.
> 
> I'm reminded of the recent kernel addition of xstat(); the traditional
> stat/fstat interfaces really don't know which bits of information you
> care about, so you get everything, but with xstat(), you can request
> only what you plan to use, which may indeed result in speedups.

I hope we can put off thinking about it if the known-zeroes check can
simply be put into make_zero(). O:-)

>>> Those are still viable options, but before I repaint the bikeshed along
>>> those lines, I'd at least like a review of whether the overall idea of
>>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>>> regardless of how we implement it as one vs. three driver callbacks.
>>
>> I’m as hesitant as ever to give a review that this notion is useful,
>> because I haven’t seen a practical example yet where the problem isn’t
>> the fact that NBD doesn’t have 64-bit write_zeroes support.
> 
> Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers
> will be compliant with that extension.  This will speed up operations
> when talking to older servers which do not support 64-bit writes, even
> if newer qemu is never such a server.

The same applies to reads-all-zeroes, though.  Only if both server and
client provide/understand this flag will it do something.

Max

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20200206/70265b5b/attachment.sig>