[Stgt-devel] iSER

Sun Sep 9 20:12:42 CEST 2007

robin.humble+stgt at anu.edu.au wrote on Sun, 09 Sep 2007 11:30 -0400:
> Summary:
>  - 2.6.21 seems to be a good kernel. 2.6.22 or newer, or RedHat's OFED 1.2
>    patched kernels all seem to have iSER bugs that make them unusable.
>  - as everything works in 2.6.21 presumably this means there's nothing
>    wrong with the iSER implementation in tgtd. well done! :)

Well, that's good and bad news.  Nice to know that things do work at times,
but we have to figure out what happened in the initiator now.  Or maybe tgt
is making some bad assumptions.

> with the 2.6.22.6 kernel and iSER I couldn't find any corruption
> issues using dd to /dev/sdc. however (as reported previously) if I put
> an ext3 filesystem on the iSER device and then dd to a file in the ext3
> filsystem then pretty much immediately I get:
>   Sep  9 21:46:22 x11 kernel: EXT3-fs error (device sdc): ext3_new_block: Allocating block in system zone - blocks from 196611, length 1
>   Sep  9 21:46:22 x11 kernel: EXT3-fs error (device sdc): ext3_new_block: Allocating block in system zone - blocks from 196612, length 1
>   Sep  9 21:46:22 x11 kernel: EXT3-fs error (device sdc): ext3_new_block: Allocating block in system zone - blocks from 196613, length 1
>   ...
> 
> I get the same type of errors with 2.6.23-rc5 too.

I'm still not been able to reproduce this, at least on my
2.6.22-rc5.  One of these days we'll move to some newer kernels
here, but have been sort of waiting for the bidi approaches to
stabilize somewhat.

The only issue I've found is a slight race condition when the
initiator unexpectly hangs up.  The target would exit if it saw
a work request flush before seeing the CM disconnect event.  Added
a new patch to the git to fix this.  But it doesn't explain your
corruption issues.

> with 2.6.21 (mem=512M) on the initiator side and 2.6.21 or 2.6.22.6
> (7.1g ramdisk as backing store) then everything seems to work fine.
> eg. bonnie++
> 
> Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> x11            512M 80329  99 521771 99 224506 44 85983  95 525440 49 +++++ +++
> x11              1G 80649  99 484939 92 207655 43 59377  98 488031 41 13703  14
> x11              2G 79976  99 461833 94 208618 42 74189  97 467245 39 10060  13
> x11              4G 79873  99 487361 97 210199 43 87312  98 484341 42  8459  13
>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
>                  64 80318  99 +++++ +++ 86949  99 80277  99 +++++ +++ 83630 100
>                 256 68904  97 436942 98 61886  83 67777  95 +++++ +++ 48291  69
>                 512 40226  62 34164  25 37500  65 44426  67 22325  18 28473  53

You're getting neighborhood of 500 MB/s for block reads _and_ writes
through ext3.  This is different from your earlier results with dd:

robin.humble+stgt at anu.edu.au wrote on Wed, 05 Sep 2007 10:46 -0400:
> bypassing the page cache (and readahead?) with O_DIRECT:
>  eg. dd if=/dev/zero of=/dev/sdc bs=1k count=8000 oflag=direct
>    bs   write MB/s read MB/s
>   10M     1200      520
>    1M      790      460
>  200k      480      350
>    4k       40       34
>    1k       11        9
> large writes look fabulous, but reads seem to be limited by something
> other than IB bandwidth.
> 
> in the more usual usage case via the page cache:
>  eg. dd if=/dev/zero of=/dev/sdc bs=1k count=8000000 
>    bs   write MB/s read MB/s
>   10M     1100      260
>    1M     1100      270
>    4k      960      270
>    1k       30      240
> so maybe extra copies to/from page cache are getting in the way of the
> read bandwidth and are lowering it by a factor of 2.
> I'm guessing the good small block read performance here is due to
> readahead, and the mostly better writes are from aggregation.

We see similar behavior as your earlier tests on dd, where reads are
too slow, and have started trying to figure out what's taking so
long in the code.  One would expect reads to be the fast case, as
they map to RDMA write operations.

Thanks for doing all this testing.

		-- Pete