[Stgt-devel] Tuning iSER for performance

Thu Mar 6 17:57:44 CET 2008

erezz at Voltaire.COM wrote on Thu, 06 Mar 2008 12:10 +0200:
> Here's a patch that works with the current version of stgt. 

Thanks for fixing it up; I'll hang onto it for debugging.  Hopefully
the new sync-range code you added isn't actually getting used in
your performance tests.  I doubt it.

> Now, the performance is even lower (~460 MB/sec with rdwr_sync compared to ~670 MB/sec with rdwr). I've noticed that it takes a lot of time between target_cmd_queue (time = 663673) & iscsi_task_tx_start (669209).

5.5 ms is impossibly slow.  Unless you're writing to disk and/or
syncing to disk.

You need to decide what you want to test.  Total throughput needs
threading on the target for best performance, and multiple
outstanding commands on the initiator.  Latency tests are best for
understanding which part of the system is being "slow":  network,
disk, context switch, etc.  To test latency effectively you need to
ensure that only 1 command is outstanding.

> I don't understand something in the behavior of
> iscsi_task_tx_start (this may be related to the long time
> mentioned above): when it is called, it handles only the 1st task
> in conn->tx_clist.

This would only matter in the multiple-command case, just to point
out the difference again.

> Why doesn't it try to handle all tasks on the
> list? What happens is that after bs completes is work, it takes a
> lot of time until iscsi_task_tx_start is called for that task.

That definitely sounds like a problem.  So just getting into
iscsi_task_tx_start is an issue, even if you only need to be there
for a single task.

> iscsi_task_tx_start *is* called immediately, but it handles the
> 1st task only (so the current task has to wait for this thread to
> wake up multiple times until it will be handled). Can anyone
> explain this design?

After it handles that task, it goes back to the main loop of
iser_tx_progress.  This function will continue to be called as long
as num_tx_ready is non-zero.  Various points increment that:
conn->tp->event_modify(.., ..|EPOLLOUT) and some completion events
from the NIC.

This is just like how TCP works.  We let the top-level epoll() drive
all the events for all the connections.  With this added counter so
that non-pollable RDMA events can be tracked too.

If you narrow down big delays, like the 5.5 ms, to exactly two
points, then look at the code and figure out what has to happen to
get from one to the other, that will help us figure out what to fix.
Like the previous mail where it looked like getting into the RX
progress function was slow, indicating something about notifications
from the NIC or a bug on that relatively short path.

		-- Pete