[Stgt-devel] iSER multiple readers

Sun Feb 10 13:53:35 CET 2008

On Sat, Feb 09, 2008 at 01:18:58PM -0500, Pete Wyckoff wrote:
>robin.humble+stgt at anu.edu.au wrote on Sat, 09 Feb 2008 11:39 -0500:
>> a few of these
>>   lmdd of=internal if=/dev/sdc bs=1M count=7000 ipat=1 mismatch=1
>> gives:
>>  off=116000000 want=6f80000 got=6fa1000
>>  off=518000000 want=1eec0000 got=1eee1000
>>  off=12000000 want=c40000 got=c5d000
>>  off=627000000 want=256e0000 got=256ee000
>>  off=344000000 want=148b6000 got=148c0000
>>  off=163000000 want=9c40000 got=9c5b000
>>  off=11000000 want=b40000 got=b47000
>>  off=514000000 want=1eb20000 got=1eb21000
>>  off=28000000 want=1b80000 got=1b93000
>>  off=78000000 want=4b3d000 got=4b41000
>>  off=70000000 want=4360000 got=4381000
>>  off=0 want=e0000 got=fb000
>>  off=20000000 want=13e0000 got=13fa000
>> so always on MB boundaries?
>....
>
>I was going to complain about the non-power-of-2 reads, but in this
>case, I think it confirms what we suspected, that the problem is in
>page mapping somewhere, either initiator or target, and not in say
>bs_rdwr.  That code just does pread, which is pretty unlikely to be
>broken, and it does it to non-page boundaries in the mempool buffer
>provided by iser.
>
>Here's the list of (decimal) page offsets for the above:
>
>    33 33 29 14 10 27 7 1 19 4 33 27 26
>
>Doesn't tell me much.
>
>Did you modify your lmdd only to show the first error on a transfer?

'mismatch=1' tells lmdd to just print the first error.

>> a few tests show that it's pretty hard to get mismatches with ~ bs=384
>> and below.
>> 
>> with bs=512
>>   lmdd of=internal if=/dev/sdc bs=512 count=7000000 ipat=1 mismatch=1
>> I get
>>  off=1010024448 want=3c33c000 got=3c350000
>>  off=1693302784 want=64edc000 got=64eea000
>>  off=45203456 want=2b1c000 got=2b27000
>>  off=289783808 want=1145c000 got=11460000
>>  off=507494400 want=1e3fc000 got=1e40f000
>>  off=282181632 want=10d1c000 got=10d30000
>>  off=334217216 want=13ebc000 got=13ebe000
>
>Again, page offsets:  20 14 11 4 19 20 2
>
>The offsets are always positive, which fits in with the theory that
>future RDMAs are overwriting earlier ones.  This goes against the
>theory in your (my) patch, which guesses that the SCSI response
>message is sneaking ahead of RDMA operations.

ok.

>lmdd is just doing reads of 1.0e6 byte size, into a valloc-ed (so
>page aligned) buffer.  It just memcpys from the page cache into this
>buf.  We don't see non-4k-aligned errors there, so the problem has
>to be in the page cache or below.  Just like it can't be in the
>mempool handling, or we'd see non-4k-aligned issues.

ok.

>> so with ye olde
>>   [PATCH 20/20] iser wait for rdma completion
>> applied, now single and multiple readers with stock centos5.1 kernels
>> and userland work ok. odd.
>Ugh.  I hate that patch.  All it does is to slow things down,
>effectively.

:-/

>> is there any way to check more definitively whether the ordering is
>> getting messed up with my hardware/OS/OFED combo? perhaps some sort of
>> a micro-verbs/rdma benchmark that would convice the IB guys one way or
>> the other?
>I'm pretty sold on the idea that ye olde 20/20 is not the problem.
>I'm leaning towards something with FMR in the iser initiator.  It's
>the only place we get page size operations.  The target-side mapping
>is one contiguous 192 * 512 kB chunk with a single MR.
>
>We could write a verbs benchmark that just sends data, but I fear
>the interaction is with memory registration handling either on the
>initiator or the target, so we may miss the problem.
>
>You're using oldish 2.6.18 and 2.6.22, both of which now show this
>issue.  I don't suppose you'd be willing to test a more recent
>kernel initiatior?  In the mean time, I'll go take a look at the
>iser fmr code, comparing it to srp's fmr code.

how about 2.6.24 at both ends? below are several runs of:
  lmdd of=internal if=/dev/sdc bs=1M count=7000 ipat=1 mismatch=10

off=608000000 want=2445ca00 got=2447ca00
off=608000000 want=2445ca04 got=2447ca04
off=608000000 want=2445ca08 got=2447ca08
off=608000000 want=2445ca0c got=2447ca0c
off=608000000 want=2445ca10 got=2447ca10
off=608000000 want=2445ca14 got=2447ca14
off=608000000 want=2445ca18 got=2447ca18
off=608000000 want=2445ca1c got=2447ca1c
off=608000000 want=2445ca20 got=2447ca20
off=608000000 want=2445ca24 got=2447ca24
608.0000 MB in 1.5089 secs, 402.9511 MB/sec

off=9000000 want=89d000 got=8a1000
off=9000000 want=89d004 got=8a1004
off=9000000 want=89d008 got=8a1008
off=9000000 want=89d00c got=8a100c
off=9000000 want=89d010 got=8a1010
off=9000000 want=89d014 got=8a1014
off=9000000 want=89d018 got=8a1018
off=9000000 want=89d01c got=8a101c
off=9000000 want=89d020 got=8a1020
off=9000000 want=89d024 got=8a1024
9.0000 MB in 0.0272 secs, 331.2477 MB/sec

off=355000000 want=15296200 got=152b6200
off=355000000 want=15296204 got=152b6204
off=355000000 want=15296208 got=152b6208
off=355000000 want=1529620c got=152b620c
off=355000000 want=15296210 got=152b6210
off=355000000 want=15296214 got=152b6214
off=355000000 want=15296218 got=152b6218
off=355000000 want=1529621c got=152b621c
off=355000000 want=15296220 got=152b6220
off=355000000 want=15296224 got=152b6224
355.0000 MB in 0.8903 secs, 398.7272 MB/sec

>Another possibility.  Change /sys/block/.../max_sectors_kb to 8 or
>less so that each initiator request will fit in a page, disabling
>FMR.  Not sure exactly how this works.  May need some debugging to
>verify.  Could be too slow to provoke the problem.

ok - I changed max_sectors_kb to 8 on the initiator side (was 512) and
it seemed to work slightly better, but repeated runs still fail eg.

off=264032704 want=fbe0000 got=fbe8000
off=264032704 want=fbe0004 got=fbe8004
off=264032704 want=fbe0008 got=fbe8008
off=264032704 want=fbe000c got=fbe800c
off=264032704 want=fbe0010 got=fbe8010
off=264032704 want=fbe0014 got=fbe8014
off=264032704 want=fbe0018 got=fbe8018
off=264032704 want=fbe001c got=fbe801c
off=264032704 want=fbe0020 got=fbe8020
off=264032704 want=fbe0024 got=fbe8024
4559.0000 MB in 16.8547 secs, 270.4877 MB/sec

changing max_sectors_kb to 4:

off=4113000000 want=f5284c00 got=f5288c00
off=4113000000 want=f5284c04 got=f5288c04
off=4113000000 want=f5284c08 got=f5288c08
off=4113000000 want=f5284c0c got=f5288c0c
off=4113000000 want=f5284c10 got=f5288c10
off=4113000000 want=f5284c14 got=f5288c14
off=4113000000 want=f5284c18 got=f5288c18
off=4113000000 want=f5284c1c got=f5288c1c
off=4113000000 want=f5284c20 got=f5288c20
off=4113000000 want=f5284c24 got=f5288c24
4113.0000 MB in 27.6913 secs, 148.5305 MB/sec

I'll be travelling the next few days so will be in contact only
intermittently, so apologies in advance for any slow replies.
an account for you on the cluster would be possible if that would
help.

cheers,
robin