[Stgt-devel] Tuning iSER for performance
Erez Zilber
erezz
Wed Mar 5 15:08:10 CET 2008
Pete Wyckoff wrote:
> erezz at Voltaire.COM wrote on Sun, 02 Mar 2008 15:17 +0200:
>
>> Here's an example of a single command execution:
>>
>> 27892 tgtd: [16:40:38.907547] iser_rx_progress(1101) entry
>> 27893 tgtd: [16:40:38.907558] handle_wc(897) incoming cmd, len 76
>> 27894 tgtd: [16:40:38.907568] iscsi_iser_read(1301) buf 0x529fc8 nbytes 48
>> 27895 tgtd: [16:40:38.907578] iser_parse_hdr(1265) control type PDU
>> 27896 tgtd: [16:40:38.907587] iser_parse_hdr(1271) rstag 9b320086 va
>> 3f1fc000
>> 27897 tgtd: [16:40:38.907597] iscsi_scsi_cmd_rx_start(1466) 1 28 0 0
>> 524288 1 b
>> 27898 tgtd: [16:40:38.907607] iscsi_rdma_alloc_data_buf(1659) malloc
>> 0x2acf61554000 sz 524288
>> 27899 tgtd: [16:40:38.907617] iscsi_task_queue(1411) 388b 388b 1
>> 27900 tgtd: [16:40:38.907627] target_cmd_queue(763) 0x535de8 28 1
>> 27901 tgtd: [16:40:38.907637] target_cmd_queue(783) 0x535de8 28 1 1
>>
>> this is where bs work starts:
>> 27902 tgtd: [16:40:38.907649] tgtd: [16:40:38.907650]
>> bs_thread_worker_fn(151) got 0x535de8
>> 27903 target_cmd_queue(794) b 28 (nil) 0x2acf61554000 262144 0 524288 0 4
>> 27904 tgtd: [16:40:38.907676] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27905 tgtd: [16:40:38.907686] handle_wc(922) incoming cmd proc done, repost
>> 27906 tgtd: [16:40:38.907697] iser_tx_progress(1017) entry
>> 27907 tgtd: [16:40:38.907706] iser_tx_progress(1028) trying tx
>> 27908 tgtd: [16:40:38.907716] iscsi_task_tx_start(1784) no more data
>> 27909 tgtd: [16:40:38.907725] iscsi_rdma_event_modify(1633) tx ready
>> removing 0x529eb0
>>
>> this is where bs work ends (130 us):
>> 27910 tgtd: [16:40:38.907779] bs_rdwr_request(114) io done 0x535de8 28
>> 524288 524288
>> 27911 tgtd: [16:40:38.907796] bs_thread_ack_fn(82) found 0x535de8
>> 27912 tgtd: [16:40:38.907813] bs_thread_request_done(122) back to tgtd,
>> 0x535de8
>> 27913 tgtd: [16:40:38.907825] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27914 tgtd: [16:40:38.907837] iser_tx_progress(1017) entry
>> 27915 tgtd: [16:40:38.907846] iser_tx_progress(1028) trying tx
>> 27916 tgtd: [16:40:38.907856] iscsi_task_tx_start(1759) found a task b
>> 524288 0 0
>>
>
>
>> 27917 tgtd: [16:40:38.907865] iscsi_data_rsp_build(974) 524288 524288
>> 524288 524288 b
>> 27918 tgtd: [16:40:38.907875] iscsi_rdma_rdma_write(1491) offset 0 len
>> 524288 stag 9b320086 va 3f1fc000
>> 27919 tgtd: [16:40:38.907885] iscsi_rdma_event_modify(1633) tx ready
>> removing 0x529eb0
>> 27920 tgtd: [16:40:38.907895] iscsi_scsi_cmd_tx_done(1707) more data or
>> sense or bidir b
>> 27921 tgtd: [16:40:38.908269] iser_rx_progress(1101) entry
>> 27922 tgtd: [16:40:38.908280] handle_wc(932) RDMA write done
>> 27923 tgtd: [16:40:38.908289] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27924 tgtd: [16:40:38.908298] iser_tx_progress(1017) entry
>> 27925 tgtd: [16:40:38.908308] iser_tx_progress(1028) trying tx
>> 27926 tgtd: [16:40:38.908317] iscsi_task_tx_start(1759) found a task b
>> 524288 524288 0
>> 27927 tgtd: [16:40:38.908327] iscsi_cmd_rsp_build(901) 0x535d28 28
>>
>
> This is rather slow in here. It does the RDMA write, gets the local
> completion that the write went out, then sends the command response
> packet. Does it make sense that it takes 374 us for the adapter to
> say that it finished the write?
>
Where did you get the 374 us from?
> Another comment: there is 10 us between consecutive printfs in this
> trace. You're probably not using TSC for GTOD. It's not a big
> deal, but be aware that your "with debugging" runs may be
> significantly slower than without.
That's for sure. However, I tried to run without debugging, and
host_busy was <= 1.
> Just watch to make sure that the
> profiling output makes sense. We hack the kernel in a bad way to
> get around this on our machines.
> http://git.osc.edu/?p=linux.git;a=commitdiff;h=e7ff54222523a2ae6816c9ff17ad5a8fa86cd5e1
>
What does this patch do? I'm not familiar with that.
>
>> 27928 tgtd: [16:40:38.908336] iscsi_iser_write_begin(1352) new sendl
>> 0x2aaab45f7090 len 48
>> 27929 tgtd: [16:40:38.908347] iscsi_iser_write_end(1387) sendl
>> 0x2aaab45f7090 len 76
>> 27930 tgtd: [16:40:38.908356] iscsi_iser_write_end(1402) inc progress to
>> finish cmd
>> 27931 tgtd: [16:40:38.908366] __cmd_done(855) 8 (nil) 0x2acf61554000 0
>> 524288 0
>> 27932 tgtd: [16:40:38.908376] iscsi_rdma_free_data_buf(1678) free
>> 0x2acf61554000
>>
>> It seems that bs work takes a lot of time. For example: if I run
>> disktest on the same device directly from the target machine (i.e. tgt
>> is not involved) in the following way:
>>
>> disktest -PT -T100 -h1 -K64 -B256k -ID /dev/sdc -r
>>
>> I get ~13000 iops -> each IO takes ~77 us (compared to 130 us only for
>> bs work with tgt). I'm not familiar with the bs code in tgt, so I'm not
>> sure what should be done here.
>>
>
> It would be interesting to isolate the IO effects from the thread
> switching to see where it all goes. I recall a lot of it goes to
> non-IO overhead. A really-synchronous of bs_rdwr follows. It may
> not build against the current tree; it's old.
>
According to this patch, bs_rdwr_request will be called from the context
of iser_rx_progress (sounds good). Why doesn't the current stgt code do
that? Why do we need the extra context switch?
>
>> Another thing that looks strange - from looking at the logs, I don't see
>> that work is done in parallel. A command is received, iSCSI processing
>> is done and it is sent to bs processing (which seems to be idle until
>> that point in time). Now, nothing happens until bs processing is
>> complete. I will try to look more into the code and try to understand
>> it, but can someone try to elaborate on this?
>>
>
> You need to give it some parallel work to do, if that's what you
> intend, like Tomo hints. Of course, your logs will look much
> messier!
>
> -- Pete
>
>
> commit d9b9a70ee808f63f05c81e368fe9125dafdc8254
> Author: Pete Wyckoff <pw at osc.edu>
> Date: Tue Oct 16 15:21:41 2007 -0400
>
> bs-rdwr-sync
>
> New file bs_rdwr_sync.c is similar in spirit to bs_rdwr.c but without all
> the threading. For simpler debugging of core iscsi.
>
> diff --git a/usr/Makefile b/usr/Makefile
> index a16416b..7ee6596 100644
> --- a/usr/Makefile
> +++ b/usr/Makefile
> @@ -17,7 +17,7 @@ CFLAGS += -DISCSI
> TGTD_OBJS += $(addprefix iscsi/, conn.o param.o session.o \
> iscsid.o target.o chap.o transport.o iscsi_tcp.o \
> isns.o libcrc32c.o)
> -TGTD_OBJS += bs_rdwr.o bs_aio.o
> +TGTD_OBJS += bs_rdwr.o bs_rdwr_sync.o bs_aio.o
>
> LIBS += -lcrypto
> ifneq ($(ISCSI_RDMA),)
> diff --git a/usr/bs_rdwr_sync.c b/usr/bs_rdwr_sync.c
> new file mode 100644
> index 0000000..af6a76f
> --- /dev/null
> +++ b/usr/bs_rdwr_sync.c
> @@ -0,0 +1,118 @@
> +/*
> + * Synchronous I/O file backing store routine, based on an old bs_rdwr.
> + *
> + * Copyright (C) 2006-2007 FUJITA Tomonori <tomof at acm.org>
> + * Copyright (C) 2006-2007 Mike Christie <michaelc at cs.wisc.edu>
> + * Copyright (C) 2006-2007 Pete Wyckoff <pw at osc.edu>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation, version 2 of the
> + * License.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
> + * 02110-1301 USA
> + */
> +#define _XOPEN_SOURCE 500
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +
> +#include <linux/fs.h>
> +#include <sys/epoll.h>
> +
> +#include "list.h"
> +#include "util.h"
> +#include "tgtd.h"
> +#include "scsi.h"
> +
> +static int bs_rdwr_sync_open(struct scsi_lu *lu, char *path, int *fd,
> + uint64_t *size)
> +{
> + *fd = backed_file_open(path, O_RDWR | O_LARGEFILE, size);
> + if (*fd < 0)
> + return *fd;
> + return 0;
> +}
> +
> +static void bs_rdwr_sync_close(struct scsi_lu *lu)
> +{
> + close(lu->fd);
> +}
> +
> +static int bs_rdwr_sync_cmd_submit(struct scsi_cmd *cmd)
> +{
>
So, this function should replace bs_rdwr_request, correct? If yes, then
it's probably too old. Should I just replace the code in this function
with the current code from bs_rdwr_request?
> + struct scsi_lu *lu = cmd->dev;
> + int ret = 0, fd = lu->fd;
> + uint32_t length = 0;
> +
> + dprintf("rw %d len %u off %llu cdb %02x\n", cmd->rw, cmd->len,
> + (unsigned long long) cmd->offset, cmd->scb[0]);
> +
> + switch (cmd->scb[0]) {
> + case SYNCHRONIZE_CACHE:
> + case SYNCHRONIZE_CACHE_16:
> + ret = fsync(fd);
> + break;
> + case WRITE_6:
> + case WRITE_10:
> + case WRITE_12:
> + case WRITE_16:
> + length = scsi_get_out_length(cmd);
> + ret = pwrite64(fd, scsi_get_out_buffer(cmd), length,
> + cmd->offset);
> + break;
> + case READ_6:
> + case READ_10:
> + case READ_12:
> + case READ_16:
> + length = scsi_get_in_length(cmd);
> + ret = pread64(fd, scsi_get_in_buffer(cmd), length, cmd->offset);
> + break;
> + default:
> + break;
> + }
> +
> + if (ret == length) {
> + scsi_set_result(cmd, SAM_STAT_GOOD);
> + } else {
> + eprintf("io error %p %x %d %d %" PRIu64 ", %m\n",
> + cmd, cmd->scb[0], ret, length, cmd->offset);
> + scsi_set_result(cmd, SAM_STAT_CHECK_CONDITION);
> + sense_data_build(cmd, MEDIUM_ERROR, ASC_READ_ERROR);
> + }
> +
> + return 0;
> +}
> +
> +static int bs_rdwr_sync_cmd_done(struct scsi_cmd *cmd)
> +{
> + return 0;
> +}
> +
> +static struct backingstore_template rdwr_sync_bst = {
> + .bs_name = "rdwr_sync",
> + .bs_datasize = 0,
> + .bs_open = bs_rdwr_sync_open,
> + .bs_close = bs_rdwr_sync_close,
> + .bs_cmd_submit = bs_rdwr_sync_cmd_submit,
> + .bs_cmd_done = bs_rdwr_sync_cmd_done,
> +};
> +
> +__attribute__((constructor)) static void bs_rdwr_sync_constructor(void)
> +{
> + register_backingstore_template(&rdwr_sync_bst);
> +}
> diff --git a/usr/iscsi/iscsid.c b/usr/iscsi/iscsid.c
> index 16cd400..f8ce4e5 100644
> --- a/usr/iscsi/iscsid.c
> +++ b/usr/iscsi/iscsid.c
> @@ -2180,7 +2180,7 @@ static struct tgt_driver iscsi = {
> .show = iscsi_target_show,
> .cmd_end_notify = iscsi_scsi_cmd_done,
> .mgmt_end_notify = iscsi_tm_done,
> - .default_bst = "rdwr",
> + .default_bst = "rdwr_sync",
> };
>
> __attribute__((constructor)) static void iscsi_driver_constructor(void)
>
>
>
More information about the stgt
mailing list