Dear Mr. Tomonori! We got read errors usinfg iser (over infiniband) transport with stgtd (0.9.3). I discussed this on the open-iscsi mailing list firstly. After review of our tests I found that restarting stgt cures the read-errors for the next access to the target. Here is what we have done: On Initiator writing: ares:~# lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000 mismatch=1 1000.0000 MB in 6.3606 secs, 157.2190 MB/sec Check on Target is fine: athene:~# lmdd of=internal if=/dev/vg0/test ipat=1 bs=1M count=1000 mismatch=1 1000.0000 MB in 0.8849 secs, 1130.0176 MB/sec On initiator reading: ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=1000000 want=1a0000 got=1b3000 off=1000000 want=1a0004 got=1b3004 off=1000000 want=1a0008 got=1b3008 off=1000000 want=1a000c got=1b300c off=1000000 want=1a0010 got=1b3010 off=1000000 want=1a0014 got=1b3014 off=1000000 want=1a0018 got=1b3018 off=1000000 want=1a001c got=1b301c off=1000000 want=1a0020 got=1b3020 off=1000000 want=1a0024 got=1b3024 1.0000 MB in 0.0064 secs, 157.2822 MB/sec But if I restart the TGT-Daemon on the target side: Every thing is ok. ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 22.2695 secs, 44.9045 MB/sec But only for the first run of lmdd! Then the error strikes reproducable every time. ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=0 want=8ae00 got=a9e00 off=0 want=8ae04 got=a9e04 off=0 want=8ae08 got=a9e08 off=0 want=8ae0c got=a9e0c off=0 want=8ae10 got=a9e10 off=0 want=8ae14 got=a9e14 off=0 want=8ae18 got=a9e18 off=0 want=8ae1c got=a9e1c off=0 want=8ae20 got=a9e20 off=0 want=8ae24 got=a9e24 0.0000 MB in 0.0029 secs, 0.0000 MB/sec ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=51000000 want=3129e00 got=3147e00 off=51000000 want=3129e04 got=3147e04 off=51000000 want=3129e08 got=3147e08 off=51000000 want=3129e0c got=3147e0c off=51000000 want=3129e10 got=3147e10 off=51000000 want=3129e14 got=3147e14 off=51000000 want=3129e18 got=3147e18 off=51000000 want=3129e1c got=3147e1c off=51000000 want=3129e20 got=3147e20 off=51000000 want=3129e24 got=3147e24 51.0000 MB in 0.1463 secs, 348.5702 MB/sec How to debug further? * I never have seen a single write corruption. Only reading is the problem. * Switching from ISER transport to TCPoverIPoverIB no problem at all. Since writing is no problem I do not think that the problem is related to the infiniband layer or the RDMA itself. But is the problem on the initiator or on the target side? * I tried an experimental debian kernel 2.6.28 with no other findings. * I changed the roles of initator and target - same result. * The amount of RAM that influenced the TioTest-runs does NOT affect the behavior of lmdd. * The read-corruption ocures with 256M as well as with 32GB RAM. * Number of CPUs does also not matter.Tried from one core to 8 cores. * BIOS of the servers is set to failsafe. * Firmware of the Mellanox cards is the actual version 1.2.0 and leaved anchanged. Maybe I used the wrong versions of the software packages: I used : Debian Lenny packages: - open-iscsi 2.0.870~rc3-0.4 - libibverbs1 1.1.2-1 - librdmacm1 1.0.7-1 >From OFED-1.3 self compiled: libibcommon 1.1.1-1 libibumad 1.2.1-1 opensm 3.2.2 STGT self compiled tgtd 0.9.3 against debian -dev packages libibverbs-dev 1.1.2-1 librdmacm-dev 1.0.7-1 Any help welcome Best regards Volker -- ==================================================== inqbus it-consulting +49 ( 341 ) 5643800 Dr. Volker Jaenisch http://www.inqbus.de Herloßsohnstr. 12 0 4 1 5 5 Leipzig N O T - F Ä L L E +49 ( 170 ) 3113748 ==================================================== -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html |