[Stgt-devel] Tuning iSER for performance

Erez Zilber erezz
Wed Mar 5 15:08:10 CET 2008


Pete Wyckoff wrote:
> erezz at Voltaire.COM wrote on Sun, 02 Mar 2008 15:17 +0200:
>   
>> Here's an example of a single command execution:
>>
>> 27892 tgtd: [16:40:38.907547] iser_rx_progress(1101) entry
>> 27893 tgtd: [16:40:38.907558] handle_wc(897) incoming cmd, len 76
>> 27894 tgtd: [16:40:38.907568] iscsi_iser_read(1301) buf 0x529fc8 nbytes 48
>> 27895 tgtd: [16:40:38.907578] iser_parse_hdr(1265) control type PDU
>> 27896 tgtd: [16:40:38.907587] iser_parse_hdr(1271) rstag 9b320086 va
>> 3f1fc000
>> 27897 tgtd: [16:40:38.907597] iscsi_scsi_cmd_rx_start(1466) 1 28 0 0
>> 524288 1 b
>> 27898 tgtd: [16:40:38.907607] iscsi_rdma_alloc_data_buf(1659) malloc
>> 0x2acf61554000 sz 524288
>> 27899 tgtd: [16:40:38.907617] iscsi_task_queue(1411) 388b 388b 1
>> 27900 tgtd: [16:40:38.907627] target_cmd_queue(763) 0x535de8 28 1
>> 27901 tgtd: [16:40:38.907637] target_cmd_queue(783) 0x535de8 28 1 1
>>
>> this is where bs work starts:
>> 27902 tgtd: [16:40:38.907649] tgtd: [16:40:38.907650]
>> bs_thread_worker_fn(151) got 0x535de8
>> 27903 target_cmd_queue(794) b 28 (nil) 0x2acf61554000 262144 0 524288 0 4
>> 27904 tgtd: [16:40:38.907676] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27905 tgtd: [16:40:38.907686] handle_wc(922) incoming cmd proc done, repost
>> 27906 tgtd: [16:40:38.907697] iser_tx_progress(1017) entry
>> 27907 tgtd: [16:40:38.907706] iser_tx_progress(1028) trying tx
>> 27908 tgtd: [16:40:38.907716] iscsi_task_tx_start(1784) no more data
>> 27909 tgtd: [16:40:38.907725] iscsi_rdma_event_modify(1633) tx ready
>> removing 0x529eb0
>>
>> this is where bs work ends (130 us):
>> 27910 tgtd: [16:40:38.907779] bs_rdwr_request(114) io done 0x535de8 28
>> 524288 524288
>> 27911 tgtd: [16:40:38.907796] bs_thread_ack_fn(82) found 0x535de8
>> 27912 tgtd: [16:40:38.907813] bs_thread_request_done(122) back to tgtd,
>> 0x535de8
>> 27913 tgtd: [16:40:38.907825] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27914 tgtd: [16:40:38.907837] iser_tx_progress(1017) entry
>> 27915 tgtd: [16:40:38.907846] iser_tx_progress(1028) trying tx
>> 27916 tgtd: [16:40:38.907856] iscsi_task_tx_start(1759) found a task b
>> 524288 0 0
>>     
>
>   
>> 27917 tgtd: [16:40:38.907865] iscsi_data_rsp_build(974) 524288 524288
>> 524288 524288 b
>> 27918 tgtd: [16:40:38.907875] iscsi_rdma_rdma_write(1491) offset 0 len
>> 524288 stag 9b320086 va 3f1fc000
>> 27919 tgtd: [16:40:38.907885] iscsi_rdma_event_modify(1633) tx ready
>> removing 0x529eb0
>> 27920 tgtd: [16:40:38.907895] iscsi_scsi_cmd_tx_done(1707) more data or
>> sense or bidir b
>> 27921 tgtd: [16:40:38.908269] iser_rx_progress(1101) entry
>> 27922 tgtd: [16:40:38.908280] handle_wc(932) RDMA write done
>> 27923 tgtd: [16:40:38.908289] iscsi_rdma_event_modify(1628) tx ready
>> adding 0x529eb0
>> 27924 tgtd: [16:40:38.908298] iser_tx_progress(1017) entry
>> 27925 tgtd: [16:40:38.908308] iser_tx_progress(1028) trying tx
>> 27926 tgtd: [16:40:38.908317] iscsi_task_tx_start(1759) found a task b
>> 524288 524288 0
>> 27927 tgtd: [16:40:38.908327] iscsi_cmd_rsp_build(901) 0x535d28 28
>>     
>
> This is rather slow in here.  It does the RDMA write, gets the local
> completion that the write went out, then sends the command response
> packet.  Does it make sense that it takes 374 us for the adapter to
> say that it finished the write?
>   

Where did you get the 374 us from?

> Another comment:  there is 10 us between consecutive printfs in this
> trace.  You're probably not using TSC for GTOD.  It's not a big
> deal, but be aware that your "with debugging" runs may be
> significantly slower than without.

That's for sure. However, I tried to run without debugging, and
host_busy was <= 1.

>   Just watch to make sure that the
> profiling output makes sense.  We hack the kernel in a bad way to
> get around this on our machines.
> http://git.osc.edu/?p=linux.git;a=commitdiff;h=e7ff54222523a2ae6816c9ff17ad5a8fa86cd5e1
>   

What does this patch do? I'm not familiar with that.

>   
>> 27928 tgtd: [16:40:38.908336] iscsi_iser_write_begin(1352) new sendl
>> 0x2aaab45f7090 len 48
>> 27929 tgtd: [16:40:38.908347] iscsi_iser_write_end(1387) sendl
>> 0x2aaab45f7090 len 76
>> 27930 tgtd: [16:40:38.908356] iscsi_iser_write_end(1402) inc progress to
>> finish cmd
>> 27931 tgtd: [16:40:38.908366] __cmd_done(855) 8 (nil) 0x2acf61554000 0
>> 524288 0
>> 27932 tgtd: [16:40:38.908376] iscsi_rdma_free_data_buf(1678) free
>> 0x2acf61554000
>>
>> It seems that bs work takes a lot of time. For example: if I run
>> disktest on the same device directly from the target machine (i.e. tgt
>> is not involved) in the following way:
>>
>> disktest -PT -T100 -h1 -K64 -B256k -ID /dev/sdc -r
>>
>> I get ~13000 iops -> each IO takes ~77 us (compared to 130 us only for
>> bs work with tgt). I'm not familiar with the bs code in tgt, so I'm not
>> sure what should be done here.
>>     
>
> It would be interesting to isolate the IO effects from the thread
> switching to see where it all goes.  I recall a lot of it goes to
> non-IO overhead.  A really-synchronous of bs_rdwr follows.  It may
> not build against the current tree; it's old.
>   

According to this patch, bs_rdwr_request will be called from the context
of iser_rx_progress (sounds good). Why doesn't the current stgt code do
that? Why do we need the extra context switch?

>   
>> Another thing that looks strange - from looking at the logs, I don't see
>> that work is done in parallel. A command is received, iSCSI processing
>> is done and it is sent to bs processing (which seems to be idle until
>> that point in time). Now, nothing happens until bs processing is
>> complete. I will try to look more into the code and try to understand
>> it, but can someone try to elaborate on this?
>>     
>
> You need to give it some parallel work to do, if that's what you
> intend, like Tomo hints.  Of course, your logs will look much
> messier!
>
> 		-- Pete
>
>
> commit d9b9a70ee808f63f05c81e368fe9125dafdc8254
> Author: Pete Wyckoff <pw at osc.edu>
> Date:   Tue Oct 16 15:21:41 2007 -0400
>
>     bs-rdwr-sync
>     
>     New file bs_rdwr_sync.c is similar in spirit to bs_rdwr.c but without all
>     the threading.  For simpler debugging of core iscsi.
>
> diff --git a/usr/Makefile b/usr/Makefile
> index a16416b..7ee6596 100644
> --- a/usr/Makefile
> +++ b/usr/Makefile
> @@ -17,7 +17,7 @@ CFLAGS += -DISCSI
>  TGTD_OBJS += $(addprefix iscsi/, conn.o param.o session.o \
>  		iscsid.o target.o chap.o transport.o iscsi_tcp.o \
>  		isns.o libcrc32c.o)
> -TGTD_OBJS += bs_rdwr.o bs_aio.o
> +TGTD_OBJS += bs_rdwr.o bs_rdwr_sync.o bs_aio.o
>  
>  LIBS += -lcrypto
>  ifneq ($(ISCSI_RDMA),)
> diff --git a/usr/bs_rdwr_sync.c b/usr/bs_rdwr_sync.c
> new file mode 100644
> index 0000000..af6a76f
> --- /dev/null
> +++ b/usr/bs_rdwr_sync.c
> @@ -0,0 +1,118 @@
> +/*
> + * Synchronous I/O file backing store routine, based on an old bs_rdwr.
> + *
> + * Copyright (C) 2006-2007 FUJITA Tomonori <tomof at acm.org>
> + * Copyright (C) 2006-2007 Mike Christie <michaelc at cs.wisc.edu>
> + * Copyright (C) 2006-2007 Pete Wyckoff <pw at osc.edu>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation, version 2 of the
> + * License.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
> + * 02110-1301 USA
> + */
> +#define _XOPEN_SOURCE 500
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +
> +#include <linux/fs.h>
> +#include <sys/epoll.h>
> +
> +#include "list.h"
> +#include "util.h"
> +#include "tgtd.h"
> +#include "scsi.h"
> +
> +static int bs_rdwr_sync_open(struct scsi_lu *lu, char *path, int *fd,
> +			     uint64_t *size)
> +{
> +	*fd = backed_file_open(path, O_RDWR | O_LARGEFILE, size);
> +	if (*fd < 0)
> +		return *fd;
> +	return 0;
> +}
> +
> +static void bs_rdwr_sync_close(struct scsi_lu *lu)
> +{
> +	close(lu->fd);
> +}
> +
> +static int bs_rdwr_sync_cmd_submit(struct scsi_cmd *cmd)
> +{
>   

So, this function should replace bs_rdwr_request, correct? If yes, then
it's probably too old. Should I just replace the code in this function
with the current code from bs_rdwr_request?

> +	struct scsi_lu *lu = cmd->dev;
> +	int ret = 0, fd = lu->fd;
> +	uint32_t length = 0;
> +
> +	dprintf("rw %d len %u off %llu cdb %02x\n", cmd->rw, cmd->len,
> +		(unsigned long long) cmd->offset, cmd->scb[0]);
> +
> +	switch (cmd->scb[0]) {
> +	case SYNCHRONIZE_CACHE:
> +	case SYNCHRONIZE_CACHE_16:
> +		ret = fsync(fd);
> +		break;
> +	case WRITE_6:
> +	case WRITE_10:
> +	case WRITE_12:
> +	case WRITE_16:
> +		length = scsi_get_out_length(cmd);
> +		ret = pwrite64(fd, scsi_get_out_buffer(cmd), length,
> +			       cmd->offset);
> +		break;
> +	case READ_6:
> +	case READ_10:
> +	case READ_12:
> +	case READ_16:
> +		length = scsi_get_in_length(cmd);
> +		ret = pread64(fd, scsi_get_in_buffer(cmd), length, cmd->offset);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	if (ret == length) {
> +		scsi_set_result(cmd, SAM_STAT_GOOD);
> +	} else {
> +		eprintf("io error %p %x %d %d %" PRIu64 ", %m\n",
> +			cmd, cmd->scb[0], ret, length, cmd->offset);
> +		scsi_set_result(cmd, SAM_STAT_CHECK_CONDITION);
> +		sense_data_build(cmd, MEDIUM_ERROR, ASC_READ_ERROR);
> +	}
> +
> +	return 0;
> +}
> +
> +static int bs_rdwr_sync_cmd_done(struct scsi_cmd *cmd)
> +{
> +	return 0;
> +}
> +
> +static struct backingstore_template rdwr_sync_bst = {
> +	.bs_name		= "rdwr_sync",
> +	.bs_datasize		= 0,
> +	.bs_open		= bs_rdwr_sync_open,
> +	.bs_close		= bs_rdwr_sync_close,
> +	.bs_cmd_submit		= bs_rdwr_sync_cmd_submit,
> +	.bs_cmd_done		= bs_rdwr_sync_cmd_done,
> +};
> +
> +__attribute__((constructor)) static void bs_rdwr_sync_constructor(void)
> +{
> +	register_backingstore_template(&rdwr_sync_bst);
> +}
> diff --git a/usr/iscsi/iscsid.c b/usr/iscsi/iscsid.c
> index 16cd400..f8ce4e5 100644
> --- a/usr/iscsi/iscsid.c
> +++ b/usr/iscsi/iscsid.c
> @@ -2180,7 +2180,7 @@ static struct tgt_driver iscsi = {
>  	.show			= iscsi_target_show,
>  	.cmd_end_notify		= iscsi_scsi_cmd_done,
>  	.mgmt_end_notify	= iscsi_tm_done,
> -	.default_bst		= "rdwr",
> +	.default_bst		= "rdwr_sync",
>  };
>  
>  __attribute__((constructor)) static void iscsi_driver_constructor(void)
>
>
>   



More information about the stgt mailing list