FUJITA Tomonori wrote: > [...] this improves the performance. One target box exports four Intel SSD drivers as four logical units to one initiator boxes with 10GbE. The initiator runs disktest against each logical unit. The initiator gets about 500 MB/s in total. Running four tgt process (with "-C" option) produces about 850 MB/s in total. This patch also gives about 850MB/s. Seems that one process can handle 10GbE I/Os, however, it's not enough to handle 10GbE _and_ signalfd load generated by fast disk I/Os. Hi Tomo, Reading your email, I wasn't sure whether under the 1st test each one of the luns was exported through a different iscsi target (and all four targets by the same process) or all luns by the same iscsi target? With a session being set per target and possible per-session bottlenecks existing in both the initiator and the target side, I think the performance aspect of the change should be based here on at least three numbers, e.g <one process / one target / four luns> vs <one proc / four targets / one lun each> vs applying the patch and then <four tgt pthreads / one target/lun associated with each of them>. If you want to dig further you can also test with only two target pthreads. Using multiple tgt processes, from some runs I made, I saw that the performance difference is notable when one goes from one to two target procs and later on the additions of target processes has less notable effect. More complexities come into play when you are hitting a bottleneck on the initiator side and then you have to throw another initiator in ... One more aspect of adding more and more target processes or pthreads, is the CPU contention caused by the per target/main tgt thread, e.g the one that interacts with the network and reaps the completions from the backing-store. When SSD is used, typically or at least in many cases I believe that the SSD provider software stack uses (kernel) threads which maintain the look-aside tables for the flash, etc. As you add more tgt processes, at some point your cpu consumption might be non optimal for the system. Another point to consider is when the number of target processes gets larger, protocols for which their per process hardware end-point such as RDMA Completion Queue (CQ) is being a associated with dedicated/logical interrupt handler, will give the NIC harder time to coalesce interrupts caused by completions from different sessions. Under your patch with the model is being thread based it might be possible to have multiple pthreads/targets use the same CQ but then they would need to access it through lock, which may cause other issues > I plan to clean up and merge this. maybe this can be optional at this point? Or. -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html |