[Stgt-devel] Tuning iSER for performance
Pete Wyckoff
pw
Mon Mar 3 19:08:01 CET 2008
erezz at Voltaire.COM wrote on Sun, 02 Mar 2008 15:17 +0200:
> Here's an example of a single command execution:
>
> 27892 tgtd: [16:40:38.907547] iser_rx_progress(1101) entry
> 27893 tgtd: [16:40:38.907558] handle_wc(897) incoming cmd, len 76
> 27894 tgtd: [16:40:38.907568] iscsi_iser_read(1301) buf 0x529fc8 nbytes 48
> 27895 tgtd: [16:40:38.907578] iser_parse_hdr(1265) control type PDU
> 27896 tgtd: [16:40:38.907587] iser_parse_hdr(1271) rstag 9b320086 va
> 3f1fc000
> 27897 tgtd: [16:40:38.907597] iscsi_scsi_cmd_rx_start(1466) 1 28 0 0
> 524288 1 b
> 27898 tgtd: [16:40:38.907607] iscsi_rdma_alloc_data_buf(1659) malloc
> 0x2acf61554000 sz 524288
> 27899 tgtd: [16:40:38.907617] iscsi_task_queue(1411) 388b 388b 1
> 27900 tgtd: [16:40:38.907627] target_cmd_queue(763) 0x535de8 28 1
> 27901 tgtd: [16:40:38.907637] target_cmd_queue(783) 0x535de8 28 1 1
>
> this is where bs work starts:
> 27902 tgtd: [16:40:38.907649] tgtd: [16:40:38.907650]
> bs_thread_worker_fn(151) got 0x535de8
> 27903 target_cmd_queue(794) b 28 (nil) 0x2acf61554000 262144 0 524288 0 4
> 27904 tgtd: [16:40:38.907676] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27905 tgtd: [16:40:38.907686] handle_wc(922) incoming cmd proc done, repost
> 27906 tgtd: [16:40:38.907697] iser_tx_progress(1017) entry
> 27907 tgtd: [16:40:38.907706] iser_tx_progress(1028) trying tx
> 27908 tgtd: [16:40:38.907716] iscsi_task_tx_start(1784) no more data
> 27909 tgtd: [16:40:38.907725] iscsi_rdma_event_modify(1633) tx ready
> removing 0x529eb0
>
> this is where bs work ends (130 us):
> 27910 tgtd: [16:40:38.907779] bs_rdwr_request(114) io done 0x535de8 28
> 524288 524288
> 27911 tgtd: [16:40:38.907796] bs_thread_ack_fn(82) found 0x535de8
> 27912 tgtd: [16:40:38.907813] bs_thread_request_done(122) back to tgtd,
> 0x535de8
> 27913 tgtd: [16:40:38.907825] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27914 tgtd: [16:40:38.907837] iser_tx_progress(1017) entry
> 27915 tgtd: [16:40:38.907846] iser_tx_progress(1028) trying tx
> 27916 tgtd: [16:40:38.907856] iscsi_task_tx_start(1759) found a task b
> 524288 0 0
> 27917 tgtd: [16:40:38.907865] iscsi_data_rsp_build(974) 524288 524288
> 524288 524288 b
> 27918 tgtd: [16:40:38.907875] iscsi_rdma_rdma_write(1491) offset 0 len
> 524288 stag 9b320086 va 3f1fc000
> 27919 tgtd: [16:40:38.907885] iscsi_rdma_event_modify(1633) tx ready
> removing 0x529eb0
> 27920 tgtd: [16:40:38.907895] iscsi_scsi_cmd_tx_done(1707) more data or
> sense or bidir b
> 27921 tgtd: [16:40:38.908269] iser_rx_progress(1101) entry
> 27922 tgtd: [16:40:38.908280] handle_wc(932) RDMA write done
> 27923 tgtd: [16:40:38.908289] iscsi_rdma_event_modify(1628) tx ready
> adding 0x529eb0
> 27924 tgtd: [16:40:38.908298] iser_tx_progress(1017) entry
> 27925 tgtd: [16:40:38.908308] iser_tx_progress(1028) trying tx
> 27926 tgtd: [16:40:38.908317] iscsi_task_tx_start(1759) found a task b
> 524288 524288 0
> 27927 tgtd: [16:40:38.908327] iscsi_cmd_rsp_build(901) 0x535d28 28
This is rather slow in here. It does the RDMA write, gets the local
completion that the write went out, then sends the command response
packet. Does it make sense that it takes 374 us for the adapter to
say that it finished the write?
Another comment: there is 10 us between consecutive printfs in this
trace. You're probably not using TSC for GTOD. It's not a big
deal, but be aware that your "with debugging" runs may be
significantly slower than without. Just watch to make sure that the
profiling output makes sense. We hack the kernel in a bad way to
get around this on our machines.
http://git.osc.edu/?p=linux.git;a=commitdiff;h=e7ff54222523a2ae6816c9ff17ad5a8fa86cd5e1
> 27928 tgtd: [16:40:38.908336] iscsi_iser_write_begin(1352) new sendl
> 0x2aaab45f7090 len 48
> 27929 tgtd: [16:40:38.908347] iscsi_iser_write_end(1387) sendl
> 0x2aaab45f7090 len 76
> 27930 tgtd: [16:40:38.908356] iscsi_iser_write_end(1402) inc progress to
> finish cmd
> 27931 tgtd: [16:40:38.908366] __cmd_done(855) 8 (nil) 0x2acf61554000 0
> 524288 0
> 27932 tgtd: [16:40:38.908376] iscsi_rdma_free_data_buf(1678) free
> 0x2acf61554000
>
> It seems that bs work takes a lot of time. For example: if I run
> disktest on the same device directly from the target machine (i.e. tgt
> is not involved) in the following way:
>
> disktest -PT -T100 -h1 -K64 -B256k -ID /dev/sdc -r
>
> I get ~13000 iops -> each IO takes ~77 us (compared to 130 us only for
> bs work with tgt). I'm not familiar with the bs code in tgt, so I'm not
> sure what should be done here.
It would be interesting to isolate the IO effects from the thread
switching to see where it all goes. I recall a lot of it goes to
non-IO overhead. A really-synchronous of bs_rdwr follows. It may
not build against the current tree; it's old.
> Another thing that looks strange - from looking at the logs, I don't see
> that work is done in parallel. A command is received, iSCSI processing
> is done and it is sent to bs processing (which seems to be idle until
> that point in time). Now, nothing happens until bs processing is
> complete. I will try to look more into the code and try to understand
> it, but can someone try to elaborate on this?
You need to give it some parallel work to do, if that's what you
intend, like Tomo hints. Of course, your logs will look much
messier!
-- Pete
commit d9b9a70ee808f63f05c81e368fe9125dafdc8254
Author: Pete Wyckoff <pw at osc.edu>
Date: Tue Oct 16 15:21:41 2007 -0400
bs-rdwr-sync
New file bs_rdwr_sync.c is similar in spirit to bs_rdwr.c but without all
the threading. For simpler debugging of core iscsi.
diff --git a/usr/Makefile b/usr/Makefile
index a16416b..7ee6596 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -17,7 +17,7 @@ CFLAGS += -DISCSI
TGTD_OBJS += $(addprefix iscsi/, conn.o param.o session.o \
iscsid.o target.o chap.o transport.o iscsi_tcp.o \
isns.o libcrc32c.o)
-TGTD_OBJS += bs_rdwr.o bs_aio.o
+TGTD_OBJS += bs_rdwr.o bs_rdwr_sync.o bs_aio.o
LIBS += -lcrypto
ifneq ($(ISCSI_RDMA),)
diff --git a/usr/bs_rdwr_sync.c b/usr/bs_rdwr_sync.c
new file mode 100644
index 0000000..af6a76f
--- /dev/null
+++ b/usr/bs_rdwr_sync.c
@@ -0,0 +1,118 @@
+/*
+ * Synchronous I/O file backing store routine, based on an old bs_rdwr.
+ *
+ * Copyright (C) 2006-2007 FUJITA Tomonori <tomof at acm.org>
+ * Copyright (C) 2006-2007 Mike Christie <michaelc at cs.wisc.edu>
+ * Copyright (C) 2006-2007 Pete Wyckoff <pw at osc.edu>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
+ * 02110-1301 USA
+ */
+#define _XOPEN_SOURCE 500
+
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+
+#include <linux/fs.h>
+#include <sys/epoll.h>
+
+#include "list.h"
+#include "util.h"
+#include "tgtd.h"
+#include "scsi.h"
+
+static int bs_rdwr_sync_open(struct scsi_lu *lu, char *path, int *fd,
+ uint64_t *size)
+{
+ *fd = backed_file_open(path, O_RDWR | O_LARGEFILE, size);
+ if (*fd < 0)
+ return *fd;
+ return 0;
+}
+
+static void bs_rdwr_sync_close(struct scsi_lu *lu)
+{
+ close(lu->fd);
+}
+
+static int bs_rdwr_sync_cmd_submit(struct scsi_cmd *cmd)
+{
+ struct scsi_lu *lu = cmd->dev;
+ int ret = 0, fd = lu->fd;
+ uint32_t length = 0;
+
+ dprintf("rw %d len %u off %llu cdb %02x\n", cmd->rw, cmd->len,
+ (unsigned long long) cmd->offset, cmd->scb[0]);
+
+ switch (cmd->scb[0]) {
+ case SYNCHRONIZE_CACHE:
+ case SYNCHRONIZE_CACHE_16:
+ ret = fsync(fd);
+ break;
+ case WRITE_6:
+ case WRITE_10:
+ case WRITE_12:
+ case WRITE_16:
+ length = scsi_get_out_length(cmd);
+ ret = pwrite64(fd, scsi_get_out_buffer(cmd), length,
+ cmd->offset);
+ break;
+ case READ_6:
+ case READ_10:
+ case READ_12:
+ case READ_16:
+ length = scsi_get_in_length(cmd);
+ ret = pread64(fd, scsi_get_in_buffer(cmd), length, cmd->offset);
+ break;
+ default:
+ break;
+ }
+
+ if (ret == length) {
+ scsi_set_result(cmd, SAM_STAT_GOOD);
+ } else {
+ eprintf("io error %p %x %d %d %" PRIu64 ", %m\n",
+ cmd, cmd->scb[0], ret, length, cmd->offset);
+ scsi_set_result(cmd, SAM_STAT_CHECK_CONDITION);
+ sense_data_build(cmd, MEDIUM_ERROR, ASC_READ_ERROR);
+ }
+
+ return 0;
+}
+
+static int bs_rdwr_sync_cmd_done(struct scsi_cmd *cmd)
+{
+ return 0;
+}
+
+static struct backingstore_template rdwr_sync_bst = {
+ .bs_name = "rdwr_sync",
+ .bs_datasize = 0,
+ .bs_open = bs_rdwr_sync_open,
+ .bs_close = bs_rdwr_sync_close,
+ .bs_cmd_submit = bs_rdwr_sync_cmd_submit,
+ .bs_cmd_done = bs_rdwr_sync_cmd_done,
+};
+
+__attribute__((constructor)) static void bs_rdwr_sync_constructor(void)
+{
+ register_backingstore_template(&rdwr_sync_bst);
+}
diff --git a/usr/iscsi/iscsid.c b/usr/iscsi/iscsid.c
index 16cd400..f8ce4e5 100644
--- a/usr/iscsi/iscsid.c
+++ b/usr/iscsi/iscsid.c
@@ -2180,7 +2180,7 @@ static struct tgt_driver iscsi = {
.show = iscsi_target_show,
.cmd_end_notify = iscsi_scsi_cmd_done,
.mgmt_end_notify = iscsi_tm_done,
- .default_bst = "rdwr",
+ .default_bst = "rdwr_sync",
};
__attribute__((constructor)) static void iscsi_driver_constructor(void)
More information about the stgt
mailing list