This performance report uses ISER/IB as the iSCSI transport. It also makes use of the "null" backing store patch (that i sent recently), in various combinations. The general setup consisted of 3 machines. All machines ran RHEL-5, kernel 2.6.18-8.el5, IB OFED 1.3.1 There was one difference between the hardware: A,B: Intel quad core, 2.66Mhz, 6MB cpu cache, reported 5300 bogoMips, C: Intel quad core, 1.6Mhz, 4MB cpu cache, reported 3200 bogoMips. 1) Initiator to Null I/O Target measurements, unidirectional The first round of measurements evaluated stgt performance with null-bs as such. sgp_dd was executed in a series of runs with varying parameters from a machine, acting as ISER initiator, performing I/O with the target devices (null I/O backed). In all runs Direct I/O was employed with sgp_dd (dio=1 flag). There were 3 pairs of machines, representing different combinations of faster and slower CPUs. | C -> B (slow->fast) ---+---------------------------------+-------------------------------- | READ | WRITE | 10 threads | 2 threads | 10 threads | 2 threads | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt ---+----------------+----------------+----------------+--------------- 1 | 40.23 48 26 | 33.10 31 20 | 17.71 39 8 | 34.17 29 25 2 | 78.92 46 20 | 66.81 27 18 | 39.12 40 10 | 74.59 29 24 4 | 152.78 61 20 | 136.36 27 21 | 69.12 39 12 | 131.58 27 19 8 | 300.29 49 22 | 253.43 20 20 | 139.62 39 11 | 215.88 27 18 6 | 578.87 50 21 | 448.47 27 18 | 286.63 40 13 | 401.30 25 15 32 | 1102.55 48 21 | 785.83 20 15 | 534.02 40 11 | 719.35 20 15 64 | 1545.97 62 12 | 1276.97 22 10 | 1078.57 43 11 | 1018.94 13 7 128| 1554.66 18 9 | 1552.43 12 9 | 1372.32 17 7 | 1322.41 15 8 256| 1558.88 13 4 | 1558.74 12 4 | 1407.28 12 3 | 1408.01 9 4 | B -> C (fast -> slow) ---+---------------------------------+-------------------------------- | READ | WRITE | 10 threads | 2 threads | 10 threads | 2 threads | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt ---+----------------+----------------+----------------+--------------- 1 | 49.21 45 33 | 28.17 11 32 | 29.79 32 29 | 30.48 15 24 2 | 108.77 36 31 | 51.95 14 30 | 62.47 34 29 | 59.67 11 29 4 | 193.70 37 34 | 110.64 16 28 | 112.23 20 28 | 115.61 12 31 8 | 395.79 38 32 | 243.20 12 30 | 212.86 30 26 | 200.38 13 24 16 | 828.61 41 30 | 413.87 12 29 | 410.61 31 29 | 340.67 12 24 32 | 1404.48 38 37 | 719.95 12 26 | 1058.36 33 32 | 628.80 12 26 64 | 1441.04 17 28 | 1177.18 13 22 | 1382.02 17 22 | 998.00 12 18 128| 1462.53 8 12 | 1443.83 7 11 | 1433.16 11 11 | 1329.54 8 16 256| 1474.06 5 6 | 1463.56 4 6 | 1472.01 7 7 | 1468.33 6 9 | B -> A (fast -> fast) ---+---------------------------------+-------------------------------- | READ | WRITE | 10 threads | 2 threads | 10 threads | 2 threads | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | bandw. CPU CPU | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt | MBPS Ini Tgt ---+----------------+----------------+----------------+--------------- 1 | 63.32 49 33 | 48.13 19 32 | 35.94 35 15 | 46.80 19 27 2 | 129.74 47 30 | 102.34 24 30 | 71.61 36 17 | 86.90 23 29 4 | 258.35 47 33 | 201.42 22 20 | 136.61 36 18 | 168.67 22 21 8 | 535.39 50 31 | 359.67 22 18 | 273.17 33 15 | 277.22 18 19 16 | 940.08 47 31 | 673.61 18 20 | 496.28 38 20 | 481.95 12 22 32 | 1466.89 43 24 | 1073.30 16 18 | 1085.22 36 19 | 791.71 15 16 64 | 1514.24 19 15 | 1444.92 11 10 | 1386.33 15 12 | 1159.58 12 11 128| 1537.99 12 8 | 1526.96 10 6 | 1434.68 11 7 | 1414.49 8 6 256| 1550.57 5 4 | 1539.24 5 4 | 1472.73 6 3 | 1470.35 5 4 1-a) Bandwidth Infiniband link is DDR, with the theoretical limit of 2000 MBs, and the practical limit of 1550 MBs on PCI GEN1, with RDMA-transfers bound communications, as measured with the standard test tool, ib_rdma_bw. As a side note, on PCI GEN2 cards 1900 MB/s can be achieved. Thus, when doing null I/O, stgt is capable of utilizing full IB unidirectional bandwidth, in sustained manner, for large command data sizes (128K+). CPU utilization which is markedly shifted towards the initiator is slightly misleading because the target is performing null I/O, while the initiator is busy with the real I/O ops. For bandwidth, there is almost no influence of the target computing power, while slower initiator CPU degrades bandwidth figures a bit. 1-b) IOPS I/O latency, as measured by IOPs (with 1KB cmds), depends on many factors like the processing time of a single command, effective pipelining of commands execution etc, which are difficult to bound beforehand. A rough theoretical limit may be evaluated using the Send and RDMA latencies on the IB link. On our system, when measured by the test tools ib_write_lat and ib_rdma_lat with 1KB messages, the average write latency is approx 4-5 us, rdma latency 3-4 usec, which gives a rough limit of 250 KIOPs. -----------+---------------------------+----------------------------- | READ | WRITE | 10 threads | 2 threads | 10 threads | 2 threads | KIOPS/ CPU | KIOPS/ CPU | KIOPS/ CPU | KIOPS/ CPU | MBPS I T | MBPS I T | MBPS I T | MBPS I T -----------+-------------+-------------+-------------+--------------- slow->fast | 40.23 48 26 | 33.10 31 20 | 17.71 39 8 | 34.17 29 25 fast->slow | 49.21 45 33 | 28.17 11 32 | 29.79 32 29 | 30.48 15 24 fast->fast | 63.32 49 33 | 48.13 19 32 | 35.94 35 15 | 46.80 19 27 The above table recites the IOPS data from the previous tables. A single initiator achieves tens of kIOPS only, when accessing stgt target. This is far from the limit, but when we ran the same test in parallel from two initiator machines (B->A + C->A), each one was able to reach the same IOPS mark, as when operating separately: | READ | WRITE | | 10 threads | 2 threads | -----+-------------+-------------+ slow | 38.07 49 32 | 32.51 26 33 | fast | 64.65 49 32 | 44.36 15 33 | Similar results are obtained when same initiator runs two instances of sgp_dd, either to separate devices exported by the target or to the same device: | READ, 10 threads | WRITE, 2 threads -------+------------------------+----------------------- 1 dev | (48.13 + 50.13) 76 35 | (33.04 + 34.11) 33 30 2 devs | (50.43 + 52.89) 73 37 | (31.22 + 29.31) 35 31 This apparently means that stgt can sustain higher IOPS rates, and there are bottlenecks on the initiator side, either in iSCSI/iSER initiator module or sgp_dd application, or both. The same conclusion is suggested by the entire performance data, we can see that the initiator CPU speed has a high impact on the throughput with smaller command sizes. Another peculiar observation is that the observed throughput depends on the number of threads employed by sgp_dd in a non-intuitive manner. Peak throughput for write commands is achieved with 2 threads, while reads peak at 10 threads. This is the reason that all measurements above are taken with these values of threads parameter. 2) Initiator to Null I/O Target measurements, bi-directional Because IB is full-duplex, it is theoretically possible to achieve aggregated throughput of 2x1500 MBs, if an equal mix of read and write commands is executed. Bi-directional test ib_rdma_bw achieves aggregate 2900 MBs throughput in our setup. In practice, however, there are various bottlenecks along the entire IO path, where the commands processing and data flow are coupled, so that this limit may be not reached. | READ | WRITE | -------------+-------+-------+ single dir | 1550 | 1450 | -------------+-------+-------+ bi-dir 1 ini | 1540 | 795 | bi-dir 2 ini | 1550 | 1300 | When initiated by different initiators, stgt does quite well with a bi-dir mix, while two streams started from the same initiator show a drop in Write performance. 3) Full-fledged Target backed by a Null I/O target measurements Another round of measurements used the machines A -> B -> C, and evaluated real stgt performance, while the null-io target simulated a high-speed storage, currently unavailable to us. sgp_dd batch was executed from machine A, while the stgt target on machine B re-exported devices, primarily exported by null-io backed stgt running on C. [A, (iser initiator)] <--... ... ib --> [B, (stgt: iscsi_rdma + rdwr_bs) - (iser initiator)] <--... ... ib --> [C, (stgt: iscsi_rdma + null_bs)] This scheme puts the "real" (middle) target in a position with both links (back- and front-end iSCSI/iSERs) running on the same medium, subject to the same latency and throughput limitations. Two separate IB ports were used for these links. Because stgt on B is doing real I/O, from its perspective, caching effects are introduced when bs_rdwr is used. Thus we used two large devices (larger than the memory size on B) and, before taking measurements on one of them, performed the entire-device read from the other. ----+------+------- cmd | READ | WRITE KB | MBPS | MBPS ----+------+------- 1 | 31 | 35 2 | 55 | 67 4 | 63 | 83 8 | 123 | 153 16 | 179 | 300 32 | 258 | 549 64 | 308 | 662 128 | 437 | 710 256 | 737 | 753 512 | 800 | 820 These observations suggest that the main bottleneck of stgt is in the I/O layer currently implemented as bs_rdwr. This "theme" is not new, but i hope that the above setup provides a good benchmark to measure any further improvements from the current state. I did not perform extensive measurements, like swapping initiator and target machines, changing parameters etc. because qualitatively the conclusion is quite clear, and I want to do it in a separate session comparing different types of BS. Alexander Nezhinsky -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html |