Hi Or! I tracked it further down. The cofactor seems to be SMP. When booting the server (8 cores) with maxcpus=1 the read error vanishes completely This is consistent with the findings of others http://lists.berlios.de/pipermail/stgt-devel/2008-February/001367.html in this case multiple simultaneous reads are the problem. Please note also that in February 2008 the git commit "iscsi: improve iser scheduling" from Sept. 2008 has not exsisted. So maybe the iser improvement has make things worse but the original error is older. Or Gerlitz schrieb: >> I found that subsequent reads on a small timescale often succeed. Also after >> a pause of some couple of seconds reads probably succeed. THe timescale for >> faillure lies inbetween. Please try a time distribution more randomly. >> How big is your sample I used 1GB. When testing by hand 1 out of 6 reads fail. >> Same behavior on stgt 0.8 and 0.9.0 . >> > > I use 1GB as well. It would be helpful if you provide me with a script > that does these random timings between reads. Also, I noted that after > one read, no I/O is going any more on the target side, as this 1GB > probably gets cached. My backing store is an sdb block device and I > wasn't sure what's your and if you have caching at all, maybe this > influences something. > As test target I use a LVM2 partition of 100GB size. athene:~/tgt-0.9.0/usr# lvdisplay --- Logical volume --- LV Name /dev/vg0/test VG Name vg0 LV UUID 85CXvj-YLoE-DrRt-mAVu-Ujaz-cLW4-RwQveP LV Write Access read/write LV Status available # open 1 LV Size 100,00 GB Current LE 25600 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 This sequence is used to start the target athene:~/tgt-0.9.0/usr# ./tgtd athene:~/tgt-0.9.0/usr# tgtadm --lld iscsi --op new --mode target --tid 1 -T de.inqbus.athene:test athene:~/tgt-0.9.0/usr# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vg0/test athene:~/tgt-0.9.0/usr# tgtadm --lld iscsi --op bind --mode target --tid 1 -I 10.1.3.0/24 athene:~/tgt-0.9.0/usr# ./tgtadm --m target --op show Target 1: de.inqbus.athene:test System information: Driver: iscsi State: ready I_T nexus information: I_T nexus: 4 Initiator: iqn.1993-08.org.debian:01:cb2c5d33d1f8 Connection: 0 RDMA IP Address: 10.1.3.33 LUN information: LUN: 0 Type: controller SCSI ID: deadbeaf1:0 SCSI SN: beaf10 Size: 0 MB Online: Yes Removable media: No Backing store: No backing store LUN: 1 Type: disk SCSI ID: deadbeaf1:1 SCSI SN: beaf11 Size: 107374 MB Online: Yes Removable media: No Backing store: /dev/vg0/test Account information: ACL information: 10.1.3.0/24 On initiator I do ares:~# iscsiadm -m session iscsiadm: No active sessions. ares:~# iscsi_discovery 10.1.3.32 -tiser -l iscsiadm: No active sessions. Set target de.inqbus.athene:test to automatic login over iser to portal 10.1.3.32:3260 discovered 1 targets at 10.1.3.32 The testscript: import random import os import time #writing writeCmd = 'lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000' readCmd = 'lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10' def tee( cmd ): print cmd os.system( cmd ) tee( writeCmd ) while True: # sleepTime = random.randrange(0,10) sleepTime = 1 print 'sleeping %s seconds ..' % sleepTime time.sleep( sleepTime ) tee( readCmd ) Output with maxcpus=1 mem=256M (to prevent caching): ares:~# python rndTest.py lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000 1000.0000 MB in 2.1589 secs, 463.2016 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7827 secs, 359.3686 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8148 secs, 355.2689 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7981 secs, 357.3900 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7960 secs, 357.6489 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7885 secs, 358.6165 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8016 secs, 356.9349 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8066 secs, 356.3076 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8142 secs, 355.3358 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8174 secs, 354.9415 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8349 secs, 352.7521 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8078 secs, 356.1559 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8028 secs, 356.7876 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7893 secs, 358.5088 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7898 secs, 358.4447 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7887 secs, 358.5943 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7846 secs, 359.1125 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7951 secs, 357.7686 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7817 secs, 359.4879 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.8016 secs, 356.9339 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7750 secs, 360.3620 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.7953 secs, 357.7491 MB/sec sleeping 1 seconds .. This seems to be ok. Now with 2 cores maxcpus=2 mem=256M ares:~# python rndTest.py lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000 1000.0000 MB in 2.3562 secs, 424.4106 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 1000.0000 MB in 2.2888 secs, 436.9008 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=119000000 want=71ff000 got=721f000 off=119000000 want=71ff004 got=721f004 off=119000000 want=71ff008 got=721f008 off=119000000 want=71ff00c got=721f00c off=119000000 want=71ff010 got=721f010 off=119000000 want=71ff014 got=721f014 off=119000000 want=71ff018 got=721f018 off=119000000 want=71ff01c got=721f01c off=119000000 want=71ff020 got=721f020 off=119000000 want=71ff024 got=721f024 119.0000 MB in 0.2676 secs, 444.6105 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=309000000 want=1273d600 got=1275d600 off=309000000 want=1273d604 got=1275d604 off=309000000 want=1273d608 got=1275d608 off=309000000 want=1273d60c got=1275d60c off=309000000 want=1273d610 got=1275d610 off=309000000 want=1273d614 got=1275d614 off=309000000 want=1273d618 got=1275d618 off=309000000 want=1273d61c got=1275d61c off=309000000 want=1273d620 got=1275d620 off=309000000 want=1273d624 got=1275d624 309.0000 MB in 0.7249 secs, 426.2704 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=126000000 want=7900000 got=7901000 off=126000000 want=7900004 got=7901004 off=126000000 want=7900008 got=7901008 off=126000000 want=790000c got=790100c off=126000000 want=7900010 got=7901010 off=126000000 want=7900014 got=7901014 off=126000000 want=7900018 got=7901018 off=126000000 want=790001c got=790101c off=126000000 want=7900020 got=7901020 off=126000000 want=7900024 got=7901024 126.0000 MB in 0.2968 secs, 424.4868 MB/sec sleeping 1 seconds .. lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10 off=80000000 want=4cdec00 got=4cfec00 off=80000000 want=4cdec04 got=4cfec04 off=80000000 want=4cdec08 got=4cfec08 off=80000000 want=4cdec0c got=4cfec0c off=80000000 want=4cdec10 got=4cfec10 off=80000000 want=4cdec14 got=4cfec14 off=80000000 want=4cdec18 got=4cfec18 off=80000000 want=4cdec1c got=4cfec1c off=80000000 want=4cdec20 got=4cfec20 off=80000000 want=4cdec24 got=4cfec24 80.0000 MB in 0.1790 secs, 447.0273 MB/sec This behavior is completely reproducable. My guess is that the AMD hyper-transport may interfere with the fmr. But I am no linux memory management specialist .. so please correct me if I am wrong. Maybe the following happens: Bootet with one CPU all FMR request goes to the 16GB RAM this single CPU directly addresses via its memory controller. In case of more than one active CPU the memory is fetched from both CPUs memory controllers with preference to local memory. In seldom cases the memory manager fetchs memory for the FMR process running on CPU0 from the CPU1 via the hyper-transport channel and something weird happens. This is shear gussing around I have no hard facts for this. Cheers Volker -- ==================================================== inqbus it-consulting +49 ( 341 ) 5643800 Dr. Volker Jaenisch http://www.inqbus.de Herloßsohnstr. 12 0 4 1 5 5 Leipzig N O T - F Ä L L E +49 ( 170 ) 3113748 ==================================================== -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html |