[stgt] [PATCH 0/6] iser new implementation

Alexander Nezhinsky alexandern at Voltaire.COM
Mon Jan 24 20:37:21 CET 2011


These patches contain a re-write of the iser code. Re-sending after
fixing NOP and bi-dir commands issues.

A separate transport lld (named "iser") is defined.

New header iser.h is created.
File iscsi_rdma.c is replaced by iser.c

File iser_text.c contains the iscsi-text processing code replicated
from iscsid.c. This is done because the functions there are
not general enough, and rely on specifics of iscsi/tcp structs.
This file will hopefully be removed after the code unification.

Additional portions of code are copied from iscsid.c and changed
appropriately. These changes do not always represent code duplication,
but rather simplify or optimize the flow for iSER.

This code fixes an occasional data corruption that happens
with the current version.

There are some unhandled error cases and rare conditions, left
until there is a solid common iscsi framework to rely upon. Most of
them are marked with ToDo comments.

The code is fairly RDMA-transport agnostic (ib/iwarp).

The code implies RDMA-only mode of work. This means the "first burst"
including immediate data should be disabled, so that the entire
data transfer is performed using RDMA.
This is achieved by setting:
       target->session_param[ISCSI_PARAM_INITIAL_R2T_EN].val = 1;
       target->session_param[ISCSI_PARAM_IMM_DATA_EN].val = 0;
in iser_target_create().

This mode is perhaps the most suitable one for iser in the majority
of work scenarios. The only concern (performanc-wise) may arise about
relatively small WRITE ios, which enjoy theoretically lower latencies
using IB SEND instead of RDMA-RD.

Implementing such "Immediate Data" mode (and any other mode that
implies non-empty FirstBurst) is meanwhile precluded because it would
lead to multiple buffers per iSER task (e.g. ImmediateData buffer
received with the command PDU, and the rest of the data retrieved
using RDMA-RD).

The code introduces some preparations for handling such general scenarios,
but as tgt has no framework for multi-buffer commands,
these extra code segments are either commented out or conditioned upon
events that should never take place. All such places have a ToDo
comment over them.

Apart of the known issues, there are feature regressions to be addressed
in the next patches. Most notably AHS and bi-dir handling.

This patch-set differs from the previously submitted RFC patches:

+ converted all functions in iser_text.c to receive iscsi_conn *
and iser_pdu * for requets and response instead of two tasks.

This may serve as an example for changes in iscsi text functions
(in iscsid.c), where iscsi_pdu should be passed instead.
iser_pdu differs from iscsi_pdu by having a pointer to BHS (received
with the entire data, while iscsi_pdu contains the BHS struct,
which is filled by a socket read operation), plus additional
iser-specific fields.

+ fixed chap invocation, this is a temporary shortcut, done by
allocating an iscsi_conn structure, copying the conn header
common to iser and iscsi, then faking iscsi-tcp buffer-related fields,
then caling auth function, then copying back the resulting size and
also the entire iscsi_conn header (because its state fields
have been changed).

+ bidirectional commands support: buffer allocation mechanism
was fitted to handle both directions in a single command;
AHS processing has been added.

+ fixed various issues related to connection state and login state.
For example when the conn. state is inappropriate, ignore received data,
stop posting rx buffers etc. In order to be able to discriminate
between some fine-grained states, necessary for iser, three new states:
STATE_INIT, STATE_START and STATE_READY are added.

+ upon tx completion check that it is full feature and then repost
or if it is still login, then a state swapping function (which
previously was lost in porting from tcp) is called:
iser_login_tx_complete() (equivalent to cmnd_finish() in iscsid.c)

+ some fields in iser-related structures renamed for easier discrimination
from tcp and automatic find/replace.

+ fixed a bug in disconnection flow with live i/o:
Because READ i/o initiates RDMA-WR and then SEND, chained one to another,
when the connection is broken, all tasks should be freed but only
when both of the above IB ops get flushed.

+ fixed logout req. processing: send all pending tx pdus, logout rsp.
being the last one, then set EXIT state.

+ fixed a bug related to possible reverse order between the conn-established
event and the first data pdu (login). This "race" is an IB "feature", to be
resolved in an application-specific way. The implemented solution is
updating conn state and saving the login task if necessary to delay
login-rx execution.

+ Merged flush and error handling

+ fixed QP async events handling: now processing all possible events and
reacting to all relevant events, print out their info neatly (dev name, event
type etc. using the rdma layer functions)

+ Because of a bug in the initiator InitiatorRDSL is sent 256KB irrespective
of the iscsi.MaxRDSL setting in the connection config file.
This caused allocation of 256K for tx buffer, one per task causing ~100MB
extra allocation per connection.

Now all tasks have one pdu (not req & rsp), and the pdu accepted from rx
is reused when sending responses. Actually we only send sense data which
is quite small, so there is no need to allocate huge tx buffers.

In login phase 2 tasks are used, one for request, another for response.
Both tasks contain a PDU with 8K data buffer.
In full featured phase a special text_tx task is allocated with the pdu
data according to the negotiated value. It is not not posted and used as
response when a text pdu is received.

+ fixed the iser_sched_consume_cq() path, previously it sometimes (very rarely)
failed to consume all completions until the CQ was empty.

+ fixed a bug: tasks were not marked with "in_scsi" flag before being submitted
to the scsi target. When their connection is broken, they should not be
"completed" or freed against the scsi target.

+ added sending NOP-IN. While not mandatory in iscsi, it is quite important
in iSER (specifically in IB case). When a connection is broken, IB does not
automatically generate error events if there is no operation to fail.
Thus it is quite possible to have stale connections which will never clear,
if the target has nothing to send (obviously, nothing will be received
from now on). Sending NOP-IN periodically guarantees that such connections
will be removed, as the send op will fail with an unrecoverable error.
NOPs are sent using the timer-based work scheduler.
Sending NOP-IN is enabled by default. To disable, invoke tgtd with:
 "--iser nop=off".

+ fixed conn. refcnt accounting for IB errors. If post_receive or post_send
fail, we should traverse the request list and discriminate between successful
and failed operations. Because the list is single-linked, the number of
posted requests is passed to the corresponding functions.

+ added iser_conn_getn() that accepts an increment parameter, to save calls
and spare debug messages when posting multiple ib requests, which
increment refcnt by a value greater than 1

+ added storing of initiator and target hostnames/ips in iser conn

+ eliminated all exit(1) calls in iscsi/iser_ib.c

+ enclosed conditions with likely()/unlikely() to minimize rare events
overhead in data-path

+ targets are marked with "rdma" flag. When a TCP is being established to a
RDMA target, it is being declined, and vice versa.

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



More information about the stgt mailing list