[Stgt-devel] iSER patches, second release

Thu Sep 6 22:10:28 CEST 2007

nezhinsky at gmail.com wrote on Thu, 06 Sep 2007 15:22 +0300:
> I have a few questions regarding the code. I'll post them in different mails.

Thanks for looking this over.

> In handle_wc() in case of IBV_WC_RECV, iscsi_rx_handler() gets called and
> the received pdu is passed as an implicit parameter, through
> ci->rcv_comm_event.
> Only then it gets reposted.
> 
> First, the parameter passing method somewhat bothers me because it seems
> that on a SMP system two RECV completions may arrive through different
> CPUs and then we end up with garbage in ci->rcv_comm_event.
> Do I miss anything on this? Are all completions handled by the same CPU?
> Is there an implicit guard somewhere?

Yes, it bothers me a bit too.  There is some mismatch in how iscsi
wants to read bytes when it needs them:  48 first, then look for AHS
and possible read those, then maybe some immediate data.  Three
separate read(fd, ...) for one control PDU.  On the other hand, RDMA
gets all the data in one go.  So we compromise by feeding the bytes
into iscsi via memcpy for each of its small do_recv operations.

It would be possible to have a separate RX state machine just for
RDMA, but we didn't want to diverge the code so much.  I think this
way works, although Robin points out that there are bugs somewhere.

The iscsi code is not multi-threaded.  So only one handle_wc() will
ever be in progress.  And iscsi_rx_handler() always consumes the
bhs, ahs, and data completely.  See the "goto again" in there.  We don't
have to worry about partially consumed control PDU for different
tasks.  This would all break if we went multithreaded in the iscsi
core itself.  I'm not sure anyone needs multithreading at that
level---we already do support it at the backing store level, which
is where it probably makes most sense.

> Second, it looks like iscsi_rx_handler() will eventually copy the pdu data.
> Even if set to a small value, if the traffic consists only of such small writes
> it will be copied entirely.

Yes, up to TRDSL will be fully copied.  That's 8 kB now in linux
iser, but you could imagine wanting it bigger for better small
transfer performance.  If you enable unsolicited data, up to that
amount will be copied.  Ick.

> Third, reposting is postponed until iscsi handles the rx event.

Not so bothered about that.  We post receives to accommodate max
outstanding requests on the client.  It's hardcoded now, but should
be fixed (in the initiator) to be negotiated.  The receive slot
will not be reused until the initiator sees the response.

> I'd suggest:
> a. to add the rx descriptor to a list and have  iscsi_rx_handler() to feed upon
> this list which may be guarded if necessary
> b. to repost a different rx buffer - so it gets reposted as soon as possible -
> and meanwhile...
> c. not to copy the current one - but instead use it by reference all
> the way down
> the target, just as you do with the rdma buffers.
> 
> This may increase the number of resources to allocate but this may improve
> the performance, at least in some cases.

You may end up paying more overhead for the management, though.
Agreed saving the 8 kB memcpy (or maybe larger) may be a good idea,
but there is some complexity with keeping the data in the original
recvl buffer.  See how iscsi_scsi_cmd_rx_start() allocates the task
and ahs and data all in one go.  This can be modified, but I want to
avoid lots of "if (rdma)" cases.  And remember that TCP doesn't have
message boundaries, so things need to work when you get the 48-byte
BHS first then go back and read the rest, perhaps in multiple
1500-byte read statements across separate poll() iterations.

If you get interested in writing a slim queueing layer between TCP
or RDMA receive processing, and iscsi rx processing, I'll certainly
help and try to get it integrated.

		-- Pete