On Sat, Feb 09, 2008 at 01:18:58PM -0500, Pete Wyckoff wrote: >robin.humble+stgt at anu.edu.au wrote on Sat, 09 Feb 2008 11:39 -0500: >> a few of these >> lmdd of=internal if=/dev/sdc bs=1M count=7000 ipat=1 mismatch=1 >> gives: >> off=116000000 want=6f80000 got=6fa1000 >> off=518000000 want=1eec0000 got=1eee1000 >> off=12000000 want=c40000 got=c5d000 >> off=627000000 want=256e0000 got=256ee000 >> off=344000000 want=148b6000 got=148c0000 >> off=163000000 want=9c40000 got=9c5b000 >> off=11000000 want=b40000 got=b47000 >> off=514000000 want=1eb20000 got=1eb21000 >> off=28000000 want=1b80000 got=1b93000 >> off=78000000 want=4b3d000 got=4b41000 >> off=70000000 want=4360000 got=4381000 >> off=0 want=e0000 got=fb000 >> off=20000000 want=13e0000 got=13fa000 >> so always on MB boundaries? >.... > >I was going to complain about the non-power-of-2 reads, but in this >case, I think it confirms what we suspected, that the problem is in >page mapping somewhere, either initiator or target, and not in say >bs_rdwr. That code just does pread, which is pretty unlikely to be >broken, and it does it to non-page boundaries in the mempool buffer >provided by iser. > >Here's the list of (decimal) page offsets for the above: > > 33 33 29 14 10 27 7 1 19 4 33 27 26 > >Doesn't tell me much. > >Did you modify your lmdd only to show the first error on a transfer? 'mismatch=1' tells lmdd to just print the first error. >> a few tests show that it's pretty hard to get mismatches with ~ bs=384 >> and below. >> >> with bs=512 >> lmdd of=internal if=/dev/sdc bs=512 count=7000000 ipat=1 mismatch=1 >> I get >> off=1010024448 want=3c33c000 got=3c350000 >> off=1693302784 want=64edc000 got=64eea000 >> off=45203456 want=2b1c000 got=2b27000 >> off=289783808 want=1145c000 got=11460000 >> off=507494400 want=1e3fc000 got=1e40f000 >> off=282181632 want=10d1c000 got=10d30000 >> off=334217216 want=13ebc000 got=13ebe000 > >Again, page offsets: 20 14 11 4 19 20 2 > >The offsets are always positive, which fits in with the theory that >future RDMAs are overwriting earlier ones. This goes against the >theory in your (my) patch, which guesses that the SCSI response >message is sneaking ahead of RDMA operations. ok. >lmdd is just doing reads of 1.0e6 byte size, into a valloc-ed (so >page aligned) buffer. It just memcpys from the page cache into this >buf. We don't see non-4k-aligned errors there, so the problem has >to be in the page cache or below. Just like it can't be in the >mempool handling, or we'd see non-4k-aligned issues. ok. >> so with ye olde >> [PATCH 20/20] iser wait for rdma completion >> applied, now single and multiple readers with stock centos5.1 kernels >> and userland work ok. odd. >Ugh. I hate that patch. All it does is to slow things down, >effectively. :-/ >> is there any way to check more definitively whether the ordering is >> getting messed up with my hardware/OS/OFED combo? perhaps some sort of >> a micro-verbs/rdma benchmark that would convice the IB guys one way or >> the other? >I'm pretty sold on the idea that ye olde 20/20 is not the problem. >I'm leaning towards something with FMR in the iser initiator. It's >the only place we get page size operations. The target-side mapping >is one contiguous 192 * 512 kB chunk with a single MR. > >We could write a verbs benchmark that just sends data, but I fear >the interaction is with memory registration handling either on the >initiator or the target, so we may miss the problem. > >You're using oldish 2.6.18 and 2.6.22, both of which now show this >issue. I don't suppose you'd be willing to test a more recent >kernel initiatior? In the mean time, I'll go take a look at the >iser fmr code, comparing it to srp's fmr code. how about 2.6.24 at both ends? below are several runs of: lmdd of=internal if=/dev/sdc bs=1M count=7000 ipat=1 mismatch=10 off=608000000 want=2445ca00 got=2447ca00 off=608000000 want=2445ca04 got=2447ca04 off=608000000 want=2445ca08 got=2447ca08 off=608000000 want=2445ca0c got=2447ca0c off=608000000 want=2445ca10 got=2447ca10 off=608000000 want=2445ca14 got=2447ca14 off=608000000 want=2445ca18 got=2447ca18 off=608000000 want=2445ca1c got=2447ca1c off=608000000 want=2445ca20 got=2447ca20 off=608000000 want=2445ca24 got=2447ca24 608.0000 MB in 1.5089 secs, 402.9511 MB/sec off=9000000 want=89d000 got=8a1000 off=9000000 want=89d004 got=8a1004 off=9000000 want=89d008 got=8a1008 off=9000000 want=89d00c got=8a100c off=9000000 want=89d010 got=8a1010 off=9000000 want=89d014 got=8a1014 off=9000000 want=89d018 got=8a1018 off=9000000 want=89d01c got=8a101c off=9000000 want=89d020 got=8a1020 off=9000000 want=89d024 got=8a1024 9.0000 MB in 0.0272 secs, 331.2477 MB/sec off=355000000 want=15296200 got=152b6200 off=355000000 want=15296204 got=152b6204 off=355000000 want=15296208 got=152b6208 off=355000000 want=1529620c got=152b620c off=355000000 want=15296210 got=152b6210 off=355000000 want=15296214 got=152b6214 off=355000000 want=15296218 got=152b6218 off=355000000 want=1529621c got=152b621c off=355000000 want=15296220 got=152b6220 off=355000000 want=15296224 got=152b6224 355.0000 MB in 0.8903 secs, 398.7272 MB/sec >Another possibility. Change /sys/block/.../max_sectors_kb to 8 or >less so that each initiator request will fit in a page, disabling >FMR. Not sure exactly how this works. May need some debugging to >verify. Could be too slow to provoke the problem. ok - I changed max_sectors_kb to 8 on the initiator side (was 512) and it seemed to work slightly better, but repeated runs still fail eg. off=264032704 want=fbe0000 got=fbe8000 off=264032704 want=fbe0004 got=fbe8004 off=264032704 want=fbe0008 got=fbe8008 off=264032704 want=fbe000c got=fbe800c off=264032704 want=fbe0010 got=fbe8010 off=264032704 want=fbe0014 got=fbe8014 off=264032704 want=fbe0018 got=fbe8018 off=264032704 want=fbe001c got=fbe801c off=264032704 want=fbe0020 got=fbe8020 off=264032704 want=fbe0024 got=fbe8024 4559.0000 MB in 16.8547 secs, 270.4877 MB/sec changing max_sectors_kb to 4: off=4113000000 want=f5284c00 got=f5288c00 off=4113000000 want=f5284c04 got=f5288c04 off=4113000000 want=f5284c08 got=f5288c08 off=4113000000 want=f5284c0c got=f5288c0c off=4113000000 want=f5284c10 got=f5288c10 off=4113000000 want=f5284c14 got=f5288c14 off=4113000000 want=f5284c18 got=f5288c18 off=4113000000 want=f5284c1c got=f5288c1c off=4113000000 want=f5284c20 got=f5288c20 off=4113000000 want=f5284c24 got=f5288c24 4113.0000 MB in 27.6913 secs, 148.5305 MB/sec I'll be travelling the next few days so will be in contact only intermittently, so apologies in advance for any slow replies. an account for you on the cluster would be possible if that would help. cheers, robin |