[Stgt-devel] Tuning iSER for performance

FUJITA Tomonori tomof
Mon Mar 3 15:14:37 CET 2008


On Sun, 02 Mar 2008 15:17:16 +0200
Erez Zilber <erezz at Voltaire.COM> wrote:

> Pete Wyckoff wrote:
> 
> > erezz at Voltaire.COM wrote on Sun, 10 Feb 2008 14:54 +0200:
> >   
> >> I'm trying to run some performance tests on iSER. I'm using the latest
> >> stgt build (tgt-20071227) against open-iscsi r865.15. When I run sgp_dd
> >> on a single LUN directly from the target machine, I get ~400 MB/sec
> >> (which is ok because it's the max throughput of that device). When I run
> >> it from the initiator, I get only ~240 MB/sec.
> >>
> >> Did you use any specific configuration when you did your measurements?
> >> Pete - according to your numbers, you got wire speed. I wonder why do I
> >> get 40% less than the physical storage throughput. BTW - when I use
> >> another iSCSI over iSER target, I get 400 MB/sec from the initiator.
> >>     
> >
> > Here's a patch that will log timestamps.  I use these to pinpoint
> > where the time is going.  First check for anomalies:  plot it up
> > with line number on the x axis, time stamp on y, make sure it is
> > smooth.
> 
> It is almost smooth. From time to time, you can see that it takes time
> before iser_rx_progress is called. I guess that the cq is empty for a
> short time. I don't think that this is the real problem.
> 
> >   Then pick one operation:  zoom in with the plot and your
> > editor to look at the times for a single read/write, from when the
> > request first appears to when the response goes out.
> >   
> 
> Here's an example of a single command execution:
> 
> 27892 tgtd: [16:40:38.907547] iser_rx_progress(1101) entry
> 27893 tgtd: [16:40:38.907558] handle_wc(897) incoming cmd, len 76
> 27894 tgtd: [16:40:38.907568] iscsi_iser_read(1301) buf 0x529fc8 nbytes 48
> 27895 tgtd: [16:40:38.907578] iser_parse_hdr(1265) control type PDU
> 27896 tgtd: [16:40:38.907587] iser_parse_hdr(1271) rstag 9b320086 va
> 3f1fc000
> 27897 tgtd: [16:40:38.907597] iscsi_scsi_cmd_rx_start(1466) 1 28 0 0
> 524288 1 b
> 27898 tgtd: [16:40:38.907607] iscsi_rdma_alloc_data_buf(1659) malloc
> 0x2acf61554000 sz 524288
> 27899 tgtd: [16:40:38.907617] iscsi_task_queue(1411) 388b 388b 1
> 27900 tgtd: [16:40:38.907627] target_cmd_queue(763) 0x535de8 28 1
> 27901 tgtd: [16:40:38.907637] target_cmd_queue(783) 0x535de8 28 1 1
> 
> this is where bs work starts:
> 27902 tgtd: [16:40:38.907649] tgtd: [16:40:38.907650]
> bs_thread_worker_fn(151) got 0x535de8
> 27903 target_cmd_queue(794) b 28 (nil) 0x2acf61554000 262144 0 524288 0 4
> 27904 tgtd: [16:40:38.907676] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27905 tgtd: [16:40:38.907686] handle_wc(922) incoming cmd proc done, repost
> 27906 tgtd: [16:40:38.907697] iser_tx_progress(1017) entry
> 27907 tgtd: [16:40:38.907706] iser_tx_progress(1028) trying tx
> 27908 tgtd: [16:40:38.907716] iscsi_task_tx_start(1784) no more data
> 27909 tgtd: [16:40:38.907725] iscsi_rdma_event_modify(1633) tx ready
> removing 0x529eb0
> 
> this is where bs work ends (130 us):
> 27910 tgtd: [16:40:38.907779] bs_rdwr_request(114) io done 0x535de8 28
> 524288 524288
> 27911 tgtd: [16:40:38.907796] bs_thread_ack_fn(82) found 0x535de8
> 27912 tgtd: [16:40:38.907813] bs_thread_request_done(122) back to tgtd,
> 0x535de8
> 27913 tgtd: [16:40:38.907825] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27914 tgtd: [16:40:38.907837] iser_tx_progress(1017) entry
> 27915 tgtd: [16:40:38.907846] iser_tx_progress(1028) trying tx
> 27916 tgtd: [16:40:38.907856] iscsi_task_tx_start(1759) found a task b
> 524288 0 0
> 27917 tgtd: [16:40:38.907865] iscsi_data_rsp_build(974) 524288 524288
> 524288 524288 b
> 27918 tgtd: [16:40:38.907875] iscsi_rdma_rdma_write(1491) offset 0 len
> 524288 stag 9b320086 va 3f1fc000
> 27919 tgtd: [16:40:38.907885] iscsi_rdma_event_modify(1633) tx ready
> removing 0x529eb0
> 27920 tgtd: [16:40:38.907895] iscsi_scsi_cmd_tx_done(1707) more data or
> sense or bidir b
> 27921 tgtd: [16:40:38.908269] iser_rx_progress(1101) entry
> 27922 tgtd: [16:40:38.908280] handle_wc(932) RDMA write done
> 27923 tgtd: [16:40:38.908289] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27924 tgtd: [16:40:38.908298] iser_tx_progress(1017) entry
> 27925 tgtd: [16:40:38.908308] iser_tx_progress(1028) trying tx
> 27926 tgtd: [16:40:38.908317] iscsi_task_tx_start(1759) found a task b
> 524288 524288 0
> 27927 tgtd: [16:40:38.908327] iscsi_cmd_rsp_build(901) 0x535d28 28
> 27928 tgtd: [16:40:38.908336] iscsi_iser_write_begin(1352) new sendl
> 0x2aaab45f7090 len 48
> 27929 tgtd: [16:40:38.908347] iscsi_iser_write_end(1387) sendl
> 0x2aaab45f7090 len 76
> 27930 tgtd: [16:40:38.908356] iscsi_iser_write_end(1402) inc progress to
> finish cmd
> 27931 tgtd: [16:40:38.908366] __cmd_done(855) 8 (nil) 0x2acf61554000 0
> 524288 0
> 27932 tgtd: [16:40:38.908376] iscsi_rdma_free_data_buf(1678) free
> 0x2acf61554000
> 
> It seems that bs work takes a lot of time. For example: if I run
> disktest on the same device directly from the target machine (i.e. tgt
> is not involved) in the following way:
> 
> disktest -PT -T100 -h1 -K64 -B256k -ID /dev/sdc -r
> 
> I get ~13000 iops -> each IO takes ~77 us (compared to 130 us only for
> bs work with tgt). I'm not familiar with the bs code in tgt, so I'm not
> sure what should be done here.

The problem is that Linux lacks a nice event notification
mechanism. The main thread uses epoll to wait on events. I/O threads
(pthread) use pipe to notify the main thread of I/O completion. It's
not effective at all.


> Another thing that looks strange - from looking at the logs, I don't see
> that work is done in parallel. A command is received, iSCSI processing
> is done and it is sent to bs processing (which seems to be idle until
> that point in time). Now, nothing happens until bs processing is
> complete. I will try to look more into the code and try to understand
> it, but can someone try to elaborate on this?

sgp_dd issues only one SCSI command at a time?



More information about the stgt mailing list