[stgt] [PATCH] use pthread per target

Thu May 27 12:14:21 CEST 2010

FUJITA Tomonori wrote:
> [...] this improves the performance. One target box exports four Intel SSD drivers as four logical units to one initiator boxes with 10GbE. The initiator runs disktest against each logical unit. The initiator gets about 500 MB/s in total. Running four tgt process (with "-C" option) produces about 850 MB/s in total. This patch also gives about 850MB/s. Seems that one process can handle 10GbE I/Os, however, it's not enough to handle 10GbE _and_ signalfd load generated by fast disk I/Os.
Hi Tomo,

Reading your email, I wasn't sure whether under the 1st test each one of 
the luns was exported through a different iscsi target (and all four 
targets by the same process) or all luns by the same iscsi target?

With a session being set per target and possible per-session bottlenecks 
existing in both the initiator and the target side, I think the 
performance aspect of the change should be based here on at least three 
numbers, e.g <one process / one target / four luns> vs <one proc / four 
targets / one lun each> vs applying the patch and then <four tgt 
pthreads / one target/lun associated with each of them>. If you want to 
dig further you can also test with only two target pthreads.

Using multiple tgt processes, from some runs I made, I saw that the 
performance difference is notable when one goes from one to two target 
procs and later on the additions of target processes has less notable 
effect. More complexities come into play when you are hitting a 
bottleneck on the initiator side and then you have to throw another 
initiator in ...

One more aspect of adding more and more target processes or pthreads, is 
the CPU contention caused by the per target/main tgt thread, e.g the one 
that interacts with the network and reaps the completions from the 
backing-store. When SSD is used, typically or at least in many cases I 
believe that the SSD provider software stack uses (kernel) threads which 
maintain the look-aside tables for the flash, etc. As you add more tgt 
processes, at some point your cpu consumption might be non optimal for 
the system.

Another point to consider is when the number of target processes gets 
larger, protocols for which their per process hardware end-point such as 
RDMA Completion Queue (CQ) is being a associated with dedicated/logical 
interrupt handler, will give the NIC harder time to coalesce interrupts 
caused by completions from different sessions. Under your patch with the 
model is being thread based it might be possible to have multiple 
pthreads/targets use the same CQ but then they would need to access it 
through lock, which may cause other issues

> I plan to clean up and merge this.
maybe this can be optional at this point?

Or.

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html