[stgt] stgtd 0.9.3 : Read-Errors using iser transport

Thu Feb 12 11:29:45 CET 2009

Dear Mr. Tomonori!

We got read errors usinfg iser (over infiniband) transport with stgtd (0.9.3).
I discussed this on the open-iscsi mailing list firstly. 

After review of our tests I found that restarting stgt 
cures the read-errors for the next access to the target.

Here is what we have done:

On Initiator writing:
ares:~# lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000 mismatch=1
1000.0000 MB in 6.3606 secs, 157.2190 MB/sec

Check on Target is fine:
athene:~# lmdd of=internal if=/dev/vg0/test ipat=1 bs=1M count=1000 
mismatch=1
1000.0000 MB in 0.8849 secs, 1130.0176 MB/sec

On initiator reading:
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=1000000 want=1a0000 got=1b3000
off=1000000 want=1a0004 got=1b3004
off=1000000 want=1a0008 got=1b3008
off=1000000 want=1a000c got=1b300c
off=1000000 want=1a0010 got=1b3010
off=1000000 want=1a0014 got=1b3014
off=1000000 want=1a0018 got=1b3018
off=1000000 want=1a001c got=1b301c
off=1000000 want=1a0020 got=1b3020
off=1000000 want=1a0024 got=1b3024
1.0000 MB in 0.0064 secs, 157.2822 MB/sec

But if I restart the TGT-Daemon on the target side: Every thing is ok.
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
1000.0000 MB in 22.2695 secs, 44.9045 MB/sec
But only for the first run of lmdd! Then the error strikes reproducable 
every time.

ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=0 want=8ae00 got=a9e00
off=0 want=8ae04 got=a9e04
off=0 want=8ae08 got=a9e08
off=0 want=8ae0c got=a9e0c
off=0 want=8ae10 got=a9e10
off=0 want=8ae14 got=a9e14
off=0 want=8ae18 got=a9e18
off=0 want=8ae1c got=a9e1c
off=0 want=8ae20 got=a9e20
off=0 want=8ae24 got=a9e24
0.0000 MB in 0.0029 secs, 0.0000 MB/sec
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=51000000 want=3129e00 got=3147e00
off=51000000 want=3129e04 got=3147e04
off=51000000 want=3129e08 got=3147e08
off=51000000 want=3129e0c got=3147e0c
off=51000000 want=3129e10 got=3147e10
off=51000000 want=3129e14 got=3147e14
off=51000000 want=3129e18 got=3147e18
off=51000000 want=3129e1c got=3147e1c
off=51000000 want=3129e20 got=3147e20
off=51000000 want=3129e24 got=3147e24
51.0000 MB in 0.1463 secs, 348.5702 MB/sec

How to debug further?

* I never have seen a single write corruption. Only reading is the problem.
* Switching from ISER transport to TCPoverIPoverIB no problem at all.

Since writing is no problem I do not think that the problem is related
to the infiniband layer or the RDMA itself. But is the problem on the 
initiator or on the target side?

* I tried an experimental debian kernel 2.6.28 with no other findings.
* I changed the roles of initator and target - same result.
* The amount of RAM that influenced the TioTest-runs does NOT affect the 
behavior of lmdd.
* The read-corruption ocures with 256M as well as with 32GB RAM.
* Number of CPUs does also not matter.Tried from one core to 8 cores.
* BIOS of the servers is set to failsafe.
* Firmware of the Mellanox cards is the actual version 1.2.0 and leaved 
anchanged.

Maybe I used the wrong versions of the software packages:

I used :
Debian Lenny packages:
- open-iscsi                         2.0.870~rc3-0.4
- libibverbs1                        1.1.2-1
- librdmacm1                         1.0.7-1

>From OFED-1.3 self compiled:
libibcommon                        1.1.1-1
libibumad                          1.2.1-1
opensm	                           3.2.2

STGT self compiled
tgtd				   0.9.3
against debian -dev packages 
libibverbs-dev                     1.1.2-1
librdmacm-dev                      1.0.7-1

Any help welcome

Best regards

Volker

-- 
====================================================
   inqbus it-consulting      +49 ( 341 )  5643800
   Dr.  Volker Jaenisch      http://www.inqbus.de
   Herloßsohnstr.    12      0 4 1 5 5    Leipzig
   N  O  T -  F Ä L L E      +49 ( 170 )  3113748
====================================================

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html