[Stgt-devel] iSER patches, second release

Sat Sep 8 13:49:54 CEST 2007

> There is some mismatch in how iscsi
> wants to read bytes when it needs them:  48 first, then look for AHS
> and possible read those, then maybe some immediate data.  Three
> separate read(fd, ...) for one control PDU.  On the other hand, RDMA
> gets all the data in one go.  So we compromise by feeding the bytes
> into iscsi via memcpy for each of its small do_recv operations.
>
> It would be possible to have a separate RX state machine just for
> RDMA, but we didn't want to diverge the code so much.  I think this
> way works, although Robin points out that there are bugs somewhere.

I believe that it is more efficient and flexible to separate the flows between
the transports on a slightly higher level than the basic operations like
polling, reading bytes from stream, malloc etc.

I suggest doing the separation by encapsulating some coarser grained ops,
that have application-level meaning. In case of iser/tcp iscsi, these
should be,
in principle: allocate/free iscsi and/or transport task; receive PDU; send
PDU etc.

For example, right now there are .ep_read per-transport function pointers.
They "know" nothing about PDUs, only about streams and bytes,
and as such are problematic for iser which is not stream based.

If there were per-transport .rx_pdu method, there would be no need to use
checks like "if(rdma)". TCP might use the code it uses today within its own
implementation, reading PDU by chunks, while iser might push the entire
pdu and tweak the RX state machine forcing it into a correct final state.

But here we run into another problem.
iscsi/tcp "pulls" PDUs from stream byte-by-byte, while iscsi/iser "pushes"
PDUs when they are received. So tcp would prefer a method that reads,
while iser would ask for a callback that notifies.

A common denominator can be found for these, but you are right,
encapsulating "the big thing" now will lead to virtually separate
state machines.
Thus, to minimize changes in the existing code, i'd suggest doing
something in the middle:
defining a per-transport method, which corresponds more or less to
what the function do_recv() does today.

That is, the RX state machine code still asks to read the PDUs by chunks,
but these chunks are requested by type, not merely by size.
This way, tcp will read bytes from the stream, while iser will only
manipulate pointers, setting them to the correct locations within the
receive buffer. We retain the RX state-machine code almost as is.
The only changes necessary there are related to the pointers like
conn->rx_buffer, which  probably  should be moved into the tcp
realization of .do_read. Perhaps, bhs may be also referenced by pointer.
But this is it, more or less.

Another example of making a coarse-grain separation is allocation & release
of iscsi tasks. I wanted to write a separate mail about it but this fits the
current discussion.

Function conn->tp->ep_free() gets called only in iscsi_free_task(), separately
for the data and the descriptor. Because the data buffer has no identification
attached, it must be looked for through the list of the buffers, for iser.

If we make task_free() a transport-specific pointer, instead, no
overhead is incurred.
iscsi will use free(), while iser will return the buffer to the list
without searching.
Of course task_alloc() should also be turned into a per-transport method,
after it is called some basic initializations are done on the common level,
in iscsid.c. It looks that if done this way, there is no need for
ep_free/ep_maloc
anymore.

> This would all break if we went multithreaded in the iscsi
> core itself.  I'm not sure anyone needs multithreading at that
> level---we already do support it at the backing store level, which
> is where it probably makes most sense.

First, one could spread different connections' handling between CPUs,
which may make some sense if multiple initiators are working with the
target. And, second, an unsafe code can sometimes break in ways
quite unexpected at the time it was written, during the system's evolution.

> Yes, up to TRDSL will be fully copied.  That's 8 kB now in linux
> iser, but you could imagine wanting it bigger for better small
> transfer performance.  If you enable unsolicited data, up to that
> amount will be copied.

So it does pay to eliminate the unsolicited data copy.

> > Third, reposting is postponed until iscsi handles the rx event.
>
> Not so bothered about that.  We post receives to accommodate max
> outstanding requests on the client.  It's hardcoded now, but should
> be fixed (in the initiator) to be negotiated.  The receive slot
> will not be reused until the initiator sees the response.

Agree.
This is actually more related to the interplay between the
mechanism of resource allocation and the mechanism of generation
of MaxCmdSN sent to the initiator, as this is the real control of max
outstanding cmds.

> See how iscsi_scsi_cmd_rx_start() allocates the task
> and ahs and data all in one go.  This can be modified, but I want to
> avoid lots of "if (rdma)" cases.  And remember that TCP doesn't have
> message boundaries, so things need to work when you get the 48-byte
> BHS first then go back and read the rest, perhaps in multiple
> 1500-byte read statements across separate poll() iterations.

It seems that what i have suggested about turning do_read() into a
per-transport method solves this problem, doesn't it?

> If you get interested in writing a slim queueing layer between TCP
> or RDMA receive processing, and iscsi rx processing, I'll certainly
> help and try to get it integrated.

I do want to contribute, both with the PDU queuing and with implementing
my suggestion above, if we agree on it or on another alternative.

Alexander