[stgt] [PATCH 0/4] iser, new implementation

Alexander Nezhinsky alexandern at Voltaire.COM
Wed Oct 13 15:31:08 CEST 2010

These patches contain a re-write of the iser code.

A separate transport lld (named "iser") is defined.

New header iser.h is created.
File iscsi_rdma.c is replaced by iser.c 

File iser_text.c contains the iscsi-text processing code replicated
from iscsid.c. This is done because the functions there are
not general enough, and rely on specifics of iscsi/tcp structs.
This file will hopefully be removed after the code unification.

Additional portions of code are copied from iscsid.c and changed 
appropriately. These changes do not always represent code duplication,
but rather simplify or optimize the flow for iSER.

This code fixes an occasional data corruption that happens
with the current version.

There are some unhandled error cases and rare conditions, left
until there is a solid common iscsi framework to rely upon. Most of
them are marked with ToDo comments.

The code is fairly RDMA-transport agnostic (ib/iwarp).

The code implies RDMA-only mode of work. This means the "first burst"
including immediate data should be disabled, so that the entire 
data transfer is performed using RDMA. This mode is perhaps the most
suitable one for iser in the majority of work scenarios.
The only concern is about relatively small WRITE ios, which may enjoy
theoretically lower latencies using IB SEND instead of RDMA-RD.  
Implementing this mode is meanwhile precluded because it would lead 
to multiple buffers per iSER task (e.g. ImmediateData buffer received
with the command PDU, and the rest of the data retrieved using RDMA-RD).
This is achieved by setting: 
	target->session_param[ISCSI_PARAM_INITIAL_R2T_EN].val = 1;
	target->session_param[ISCSI_PARAM_IMM_DATA_EN].val = 0;
in iser_target_create().

The code introduces some preparations for handling such general scenarios,
but as tgt has no framework for multi-buffer commands,
these extra code segments are either commented out or conditioned upon 
events that should never take place. All such places have a ToDo
comment over them.

Apart of the known issues, there are feature regressions to be addressed
in the next patches. Most notably AHS and bi-dir handling.

This patch-set differs from the previously submitted RFC patches:

+ converted all functions in iser_text.c to receive iscsi_conn *
and iser_pdu * for requets and response instead of two tasks.

This may serve as an example for changes in iscsi text functions
(in iscsid.c), where iscsi_pdu should be passed instead. 
iser_pdu differs from iscsi_pdu by having a pointer to BHS (received
with the entire data, while iscsi_pdu contains the BHS struct, 
which is filled by a socket read operation), plus additional 
iser-specific fields. 

+ fixed chap invocation, this is a temporary shortcut, done by
allocating an iscsi_conn structure, copying the conn header
common to iser and iscsi, then faking iscsi-tcp buffer-related fields,
then caling auth function, then copying back the resulting size and
also the entire iscsi_conn header (because its state fields 
have been changed)

+ fixed various issues related to connection state and login state.
For example when the conn. state is inappropriate, ignore received data,
stop posting rx buffers etc. In order to be able to discriminate 
between some fine-grained states, necessary for iser, three new states:

+ upon tx completion check that it is full feature and then repost
or if it is still login, then a state swapping function (which
previously was lost in porting from tcp) is called:
iser_login_tx_complete() (equivalent to cmnd_finish() in iscsid.c)

+ some fields in iser-related structures renamed for easier discrimination
from tcp and automatic find/replace.

+ fixed a bug in disconnection flow with live i/o:
Because READ i/o initiates RDMA-WR and then SEND, chained one to another,
when the connection is broken, all tasks should be freed but only 
when both of the above IB ops get flushed.

+ fixed logout req. processing: send all pending tx pdus, logout rsp. 
being the last one, then set EXIT state

+ fixed a bug related to possible reverse order between the conn-established
event and the first data pdu (login). This "race" is an IB "feature", to be 
resolved in an application-specific way. The implemented solution is
updating conn state and saving the login task if necessary to delay 
login-rx execution.

+ Merged flush and error handling

+ fixed QP async events handling: now processing all possible events and
reacting to all relevant events, print out their info neatly (dev name, event 
type etc. using the rdma layer functions) 

+ Because of a bug in the initiator InitiatorRDSL is sent 256KB irrespective 
of the iscsi.MaxRDSL setting in the connection config file. 
This caused allocation of 256K for tx buffer, one per task causing ~100MB 
extra allocation per connection.

Now all tasks have one pdu (not req & rsp), and the pdu accepted from rx 
is reused when sending responses. Actually we only send sense data which 
is quite small, so there is no need to allocate huge tx buffers. 

In login phase 2 tasks are used, one for request, another for response.
Both tasks contain a PDU with 8K data buffer.
In full featured phase a special text_tx task is allocated with the pdu
data according to the negotiated value. It is not not posted and used as 
response when a text pdu is received.

+ fixed the iser_sched_consume_cq() path, previously it sometimes (very rarely)
failed to consume all completions until the CQ was empty.

+ fixed a bug: tasks were not marked with "in_scsi" flag before being submitted
to the scsi target. When their connection is broken, they should not be 
"completed" or freed against the scsi target.

+ added sending NOP-IN. While not mandatory in iscsi, it is quite important 
in iSER (specifically in IB case). When a connection is broken, IB does not 
automatically generate error events if there is no operation to fail.
Thus it is quite possible to have stale connections which will never clear,
if the target has nothing to send (obviously, nothing will be received
from now on). Sending NOP-IN periodically guarantees that such connections
will be removed, as the send op will fail with an unrecoverable error.
Sending NOP-IN is enabled by default. To disable, invoke tgtd with 
"--iser nop=off". 
Uses a timer: do we need a centralized timer management in tgtd? 

+ fixed conn. refcnt accounting for IB errors. If post_receive or post_send
fail, we should traverse the request list and discriminate between successful
and failed operations. Because the list is single-linked, the number of 
posted requests is passed to the corresponding functions.

+ added iser_conn_getn() that accepts an increment parameter, to save calls 
and spare debug messages when posting multiple ib requests, which
increment refcnt by a value greater than 1

+ added storing of initiator and target hostnames/ips in iser conn

+ eliminated all exit(1) calls in iscsi/iser_ib.c

+ enclosed conditions with likely()/unlikely() to minimize rare events 
overhead in data-path

+ targets are marked with "rdma" flag. When a TCP is being established to a
RDMA target, it is being declined, and vice versa.

To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

More information about the stgt mailing list