[stgt] [PATCH RFC 1/3] - new iser code (after pthreads revert)

Tue Aug 3 13:20:54 CEST 2010

Pete Wyckoff wrote:
> Lots of iser.h is original code from OSC, please put back copyright
> headers on that file too.  Thanks for keeping us on all the other
> files.
I'll do this, sure!

> You should add a patch 4/3 to delete the old RDMA code, to avoid
> confusion with having two in the tree.  
I did not send this iscsi_rdma.c removing patch, as it was just the entire
file marked with "-". Because this was a RFC only, i allowed myself
just to remove a line from the Makefile. This patch will be necessary
of course, when we get to the real patches.

> And also excise all the
> hooks that old RDMA code put into what used to be generic iscsi
> code.  Transport abstraction changes too.  
I left them because i was not sure which direction the changes will
take. When we finalize the design, these hooks will go away or 
change appropriately. 

> Since you have copied everything, there should be no need to test 
> for ->rdma.

If you mean testing target->rdma, then it is a tricky bit. As of today
there is no way of marking a target as "iser" or "pure iscsi" towards
the rest of the world. This situation stems from the iWARP origins of 
iSER spec, which implied that the RDMA-capabilities are discovered 
during the login phase (and not by marking targets by a management
means whatsoever).

As you probably remember, the spec stipulated starting connections 
in TCP, performing login, and if the target agrees to support iser, 
then one can proceed to the full-featured phase, else one can 
fall back to TCP or disconnect. With IB this is not an option.

We have to export all targets (both tcp and iser) thru SendTargets/iSNS
just as iscsi targets. The initiator's only way to know whether it's 
iser or iscsi/tcp is to try to login over iser (a word from an oracle
will also do). The code managing sessions is common, and needs to stay
this way at least to a certain degree even if we separate transports.
Thus when a new session is being instantiated, we have to ask if the 
target is RDMA or not, and if the login request came over a wrong
transport, we should reject. That's where target->rdma checks come
from.

> I'm still unhappy how you duplicated all the text handling.  Seems
> like there should have been a way to abstract the buffer handling
> more cleanly.
We have already agreed that i'll start with submitting patches that 
aim at a common set of text processing functions.

>     iser_task_add_out_rdma_buf - also a good idea
This reminds me of the need to support multiple buffers per command
(scatter-gather) to be able to support properly all existing operational
modes of iscsi in iser. 

This subject was raised in context of the new SG code. 

Tomo, perhaps we should start looking at the implications of this?

>     sched_rdma_rd, iosubmit, etc.  I like how you broke up all the
>     parts involved in handling requests; we only had maybe half as
>     many phases.  
One benefit of breaking the processing into phases is queuing all the
tasks waiting for a certain event and processing them all in batches.

This allows issuing one RDMA request consisting of a few linked 
operations in the same direction, like a few RDMA-Writes and Sends 
(from target), or few RDMA-Reads (to target). This requests aggregation
saves user-kernel switches and interrupts.

This also means that we can tolerate a short accumulation of ready tasks
on such a queue, if the ongoing activity is likely to produce more 
tasks for this queue, so that a greater degree of request aggregation 
is achieved.

> But this schedule() interface worries me.  It
>     looks like it only gets called when the fds are quiet.  That
>     seems like a bad way to manage load.  As soon as you accept a
>     request, you want to push it all the way through.  We ran into
>     this problem with iscsi_rdma.
If the fds are not quiet then tgt is probably busy with processing i/o
(mainly) completions, which will queue more tasks ready for completion
(RDMA-W and Send), so that all these completions will be aggregated.

We could also limit the number of fd-related events effectively processed
during one epoll_wait() sweep so that a better sharing between fd/schedule
events is achieved.

>     Too bad you broke AHS and bidir.  That worked in iscsi_rdma.
>     I worry you'll never get around to fixing it.
AHS was broken mainly because all the troubles with the text functions.
Bidir is another story, but the code mainly lacks plugging bidir commands
into the buffer allocation sequence.
I believe that we'll be able to get around these :)

> Performance graphs would be useful to allay my concerns about your
> use of schedule(), and make sure the fine-grained approach is valid.
Well, i'll produce some, when we get to the real patches.

> Interesting how you changed MAX_POLL_WC to 32.  That wants to be
> tunable more dynamically.
I agree. Need to add a parameter to --iser commands set

> Testing?  
>     - long client runs, see if memory leaks and/or performance drops
done, quite a lot
>     - lots of short-lived clients, ditto
done, need to test more
>     - lots of clients that die at random areas of communication
>       (using netfilter to block particular client IPs makes this easy)
done, by periodically switching an IB port off and back on -
this method tests scenarios when *all* clients disconnect and reconnect 
at once, so in a sense it is "more" stress, but perhaps misses some 
other subtle scenarios.

> I'm not going to be working on tgt, not because of your changes,
> just because my interest lies elsewhere these days.  Will happily
> look at graphs or patches if that can help, though.
Thanx for the insightful comments! Hope you'll be able to continue 
reviewing in the future.

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html