[Stgt-devel] iSER multiple readers

Sat Feb 9 19:18:58 CET 2008

robin.humble+stgt at anu.edu.au wrote on Sat, 09 Feb 2008 11:39 -0500:
> a few of these
>   lmdd of=internal if=/dev/sdc bs=1M count=7000 ipat=1 mismatch=1
> gives:
>  off=116000000 want=6f80000 got=6fa1000
>  off=518000000 want=1eec0000 got=1eee1000
>  off=12000000 want=c40000 got=c5d000
>  off=627000000 want=256e0000 got=256ee000
>  off=344000000 want=148b6000 got=148c0000
>  off=163000000 want=9c40000 got=9c5b000
>  off=11000000 want=b40000 got=b47000
>  off=514000000 want=1eb20000 got=1eb21000
>  off=28000000 want=1b80000 got=1b93000
>  off=78000000 want=4b3d000 got=4b41000
>  off=70000000 want=4360000 got=4381000
>  off=0 want=e0000 got=fb000
>  off=20000000 want=13e0000 got=13fa000
> so always on MB boundaries?

Note that 1M means 1.0e6 here.  Not on MB boundaries.

We know lmdd produces the pattern in ints:  0,4,8,...
It reports the offset of the beginning of the block, with len 1M in
this case.  So the first complaint says the word at byte offset 0x6f80000
in the stream, read as part of the block 0x6ea0500 to 0x6f94740,
actually contained the data at 0x6fa1000, a spot 132 kB further up
in the data file (or 33 pages).

I was going to complain about the non-power-of-2 reads, but in this
case, I think it confirms what we suspected, that the problem is in
page mapping somewhere, either initiator or target, and not in say
bs_rdwr.  That code just does pread, which is pretty unlikely to be
broken, and it does it to non-page boundaries in the mempool buffer
provided by iser.

Here's the list of (decimal) page offsets for the above:

    33 33 29 14 10 27 7 1 19 4 33 27 26

Doesn't tell me much.

Did you modify your lmdd only to show the first error on a transfer?
The code here looks like it would print a want/got line for each
word that differed.  I would be surprised if only the first word of
the page was wrong.

> a few tests show that it's pretty hard to get mismatches with ~ bs=384
> and below.
> 
> with bs=512
>   lmdd of=internal if=/dev/sdc bs=512 count=7000000 ipat=1 mismatch=1
> I get
>  off=1010024448 want=3c33c000 got=3c350000
>  off=1693302784 want=64edc000 got=64eea000
>  off=45203456 want=2b1c000 got=2b27000
>  off=289783808 want=1145c000 got=11460000
>  off=507494400 want=1e3fc000 got=1e40f000
>  off=282181632 want=10d1c000 got=10d30000
>  off=334217216 want=13ebc000 got=13ebe000

Again, page offsets:  20 14 11 4 19 20 2

The offsets are always positive, which fits in with the theory that
future RDMAs are overwriting earlier ones.  This goes against the
theory in your (my) patch, which guesses that the SCSI response
message is sneaking ahead of RDMA operations.

lmdd is just doing reads of 1.0e6 byte size, into a valloc-ed (so
page aligned) buffer.  It just memcpys from the page cache into this
buf.  We don't see non-4k-aligned errors there, so the problem has
to be in the page cache or below.  Just like it can't be in the
mempool handling, or we'd see non-4k-aligned issues.

> so with ye olde
>   [PATCH 20/20] iser wait for rdma completion
> applied, now single and multiple readers with stock centos5.1 kernels
> and userland work ok. odd.

Ugh.  I hate that patch.  All it does is to slow things down,
effectively.

> is there any way to check more definitively whether the ordering is
> getting messed up with my hardware/OS/OFED combo? perhaps some sort of
> a micro-verbs/rdma benchmark that would convice the IB guys one way or
> the other?

I'm pretty sold on the idea that ye olde 20/20 is not the problem.
I'm leaning towards something with FMR in the iser initiator.  It's
the only place we get page size operations.  The target-side mapping
is one contiguous 192 * 512 kB chunk with a single MR.

We could write a verbs benchmark that just sends data, but I fear
the interaction is with memory registration handling either on the
initiator or the target, so we may miss the problem.

You're using oldish 2.6.18 and 2.6.22, both of which now show this
issue.  I don't suppose you'd be willing to test a more recent
kernel initiatior?  In the mean time, I'll go take a look at the
iser fmr code, comparing it to srp's fmr code.

Another possibility.  Change /sys/block/.../max_sectors_kb to 8 or
less so that each initiator request will fit in a page, disabling
FMR.  Not sure exactly how this works.  May need some debugging to
verify.  Could be too slow to provoke the problem.

		-- Pete