[stgt] [PATCH V1 01/13] transfer_len for scsi_cmd, set together with resid; unified resid handling for iscsi/iser

nezhinsky at gmail.com nezhinsky at gmail.com
Tue Feb 12 15:49:16 CET 2013


From: Alexander Nezhinsky <nezhinsky at gmail.com>

The current code is a bit messy about handling allocation length, actual
transfer length, and residual counts in underflow and overflow conditions.
This patch attempts to tune up the exiting APIs so that their semantics follows
the relevant specs through the entire course of a command execution.

It is important to distinguish here between two layers: Transport and SCSI.
Transport is implemented as LLDs (iSCSI, iSER, FCoE etc.) in tgt architecture.
SCSI is comprised of SPC, SBC etc., jointly covered by "Device Types" and
"Backing Stores" in tgt.

Residual counts for underflow and overflow conditions are relevant only for
the Transport layer. For example, iSCSI spec (10.4. "SCSI Response") defines
them relatively to "Expected Data Transfer Length" (EDTL):
* overflow:
  "initiator's EDTL was not sufficient for the number of bytes to be transfered"
* underflow:
  "bytes that were not transferred out of expected number of bytes (EDTL)".

EDTL is communicated by the Transport (e.g. in iSCSI Command PDU) and provides
a normalized indication (which does not require CDB parsing) about the buffer
length allocated by the initiator. This is important because all CDBs also
contain similar counters, but their format varies with the opcode and does
require CDB parsing.

While EDTL reflects the expectation of the initiator, the actual number of
bytes to be transferred is produced by SCSI layer. It is regulated by specs like
SPC, SBC etc. There is a fundametal difference between the method used by
SPC (plus some commands from other groups) vs. SBC for handling buffer lengths.

SBC directly specifies TRANSFER_LENGTH, counted in logical blocks.
Any event in which the actual number of blocks to be transferred differs from
the expected value is defined as underflow or overflow, both for READ and WRITE
commands.

SPC commands convey ALLOCATION_LENGTH, counted in bytes.
For READ SPC commands SCSI layer should generate DATA-IN buffer and communicate
available length to the initiator in an appropriate field of the response
data structure, contained in the generated DATA-IN.

If the available DATA-IN length is longer than the allocation length, it is
truncated, but this does not change those length-carrying fields,
as mandated by SPC-3 4.3.5.6:
 "If the information being transferred to the Data-In Buffer includes
  fields containing counts of the number of bytes in some or all of the data,
  then the contents of these fields shall not be altered to reflect the
  truncation, if any, that results from an insufficient ALLOCATION LENGTH value,
  unless the standard that describes the Data-In Buffer format states otherwise"

Such truncation is the main reason for having non-zero residual counts in SPC.
In SPC there is usually no residual counts for WRITE commands.

Below is an algorithm for handling READ-type SPC cmds (like REPORT_LUNS,
INQUIRY etc.) with clear separation between transport and scsi layers.

1. Transport layer receives a SCSI cmd PDU, carrying CDB and EDTL
2. Transport layer passes cmd to SCSI layer:
   * scsi_execute_spc_command (cdb, edtl)
   2.1. extract alloc_len from cdb
   2.2. if (edtl < alloc_len)
           then sense(ILLEGAL_REQUEST,ASC_INVALID_FIELD_IN_CDB)
   2.3. generate Data-IN buffer, avail_len bytes long
   2.4. actual_len = MIN (alloc_len, avail_len)
   /* Note that avail_len remains internal to SCSI SPC layer, although it is
      contained in DATA-IN; Transport receives only actual_len value */
   2.5. set actual_len for IN direction
3. Transport layer is requested to transfer data_in
   * transport_send_data_in (actual_len)
   3.1. if (actual_len == edtl)
           then { resid = 0; }
   3.2. if (actual_len < edtl)
           then { resid = edtl - actual_len; underflow = 1; }
   3.3. if (actual_len > edtl)
           then { resid = actual_len - edtl; overflow = 1; }

In SPC READ cmds overflow is impossible as it would imply edtl < alloc_len,
which is responded with sense. Still it is (at least in theory) possible with
SBC commands (like READ-10).

With SBC only step 2. of the above algorithm actually differs, as there
is no alloc_len, but transfer_len (in blocks) instead, and the actual_len is
normally received from the backing store device. With emulated devices some
scenarios producing residual counts are impossible. When working with
generic SCSI devices, as in the case of SG/BSG, all scenarios are possible.

Currently, iscsid.c always truncates the case of READ overflow (resid < 0)
and sets resid = 0.

As explained above this is correct behavior with SPC cmds (no overflows),
but overflows should not have been produced by SPC module in the first place,
while with READ SBC commands overflow MUST be reported.

Current iser.c implementation behaves according to the spec, but because
the SCSI layer sometimes does not, it may occasionally produce incorrect
responses. Linux initiator disregards them, but more pedantic initiators,
like VmWare's ESX, throws errors.

This patch basically forces SCSI layer to set only the actual transfer len
as described in section 2.5. and to implement residual+overflow calculations as
described in section 3. of the above algorithm entirely in the Transpor layer.

It unifies functions setting final residual counts and over/underflow
indicators for iscsi and iser LLDs.

New bi-dir field "transfer_len" is defined for scsi cmd descriptor.
By default it is assumed that EDTL will be actually transferred.
If actual_len differs from the expected then SCSI layer must call
scsi_set_in_resid_by_actual() or scsi_set_out_resid_by_actual().
These functions always existed but now they set both the transfer len
and residual count.

This patch relies on correct behavior of SCSI layer, as described above.
Thus it is followed by a series of patches correcting this and other related
aspects.

Signed-off-by: Alexander Nezhinsky <nezhinsky at gmail.com>
---
 usr/iscsi/iscsid.c |   91 +++++++++++++++++++++++++---------------------------
 usr/iscsi/iscsid.h |    1 +
 usr/iscsi/iser.c   |   46 +++-----------------------
 usr/scsi_cmnd.h    |   39 ++++++++++++++++------
 usr/target.c       |    7 ++++
 5 files changed, 86 insertions(+), 98 deletions(-)

diff --git a/usr/iscsi/iscsid.c b/usr/iscsi/iscsid.c
index eaa21b1..9614001 100644
--- a/usr/iscsi/iscsid.c
+++ b/usr/iscsi/iscsid.c
@@ -1008,36 +1008,48 @@ static void cmnd_finish(struct iscsi_connection *conn)
 	}
 }
 
-static void calc_residual(struct iscsi_cmd_rsp *rsp, struct iscsi_task *task)
+static inline void iscsi_rsp_set_resid(struct iscsi_cmd_rsp *rsp,
+				       int32_t resid)
 {
-	uint32_t residual = 0;
-	struct scsi_cmd *scmd = &task->scmd;
-	uint32_t read_len = scsi_get_in_length(scmd);
-
-	/* we never have write under/over flow, no way to signal that
-	 * back from the target currently. */
-	if (scsi_get_data_dir(scmd) == DATA_BIDIRECTIONAL) {
-		if (task->len < read_len) {
-			rsp->flags |= ISCSI_FLAG_CMD_BIDI_UNDERFLOW;
-			residual = read_len - task->len;
-		} else if (task->len > read_len) {
-			rsp->flags |= ISCSI_FLAG_CMD_BIDI_OVERFLOW;
-			residual = task->len - read_len;
-		}
-		rsp->bi_residual_count = cpu_to_be32(residual);
+	if (likely(!resid))
 		rsp->residual_count = 0;
+	else if (resid > 0) {
+		rsp->flags |= ISCSI_FLAG_CMD_UNDERFLOW;
+		rsp->residual_count = cpu_to_be32((uint32_t)resid);
 	} else {
-		if (task->len < read_len) {
-			rsp->flags |= ISCSI_FLAG_CMD_UNDERFLOW;
-			residual = read_len - task->len;
-		} else if (task->len > read_len) {
-			rsp->flags |= ISCSI_FLAG_CMD_OVERFLOW;
-			residual = task->len - read_len;
-		}
-		rsp->residual_count = cpu_to_be32(residual);
+		rsp->flags |= ISCSI_FLAG_CMD_OVERFLOW;
+		rsp->residual_count = cpu_to_be32((uint32_t)-resid);
+	}
+}
+
+static inline void iscsi_rsp_set_bidir_resid(struct iscsi_cmd_rsp *rsp,
+					     int32_t resid)
+{
+	if (likely(!resid))
+		rsp->bi_residual_count = 0;
+	else if (resid > 0) {
+		rsp->flags |= ISCSI_FLAG_CMD_BIDI_UNDERFLOW;
+		rsp->bi_residual_count = cpu_to_be32((uint32_t)resid);
+	} else {
+		rsp->flags |= ISCSI_FLAG_CMD_BIDI_OVERFLOW;
+		rsp->bi_residual_count = cpu_to_be32((uint32_t)-resid);
 	}
 }
 
+void iscsi_rsp_set_residual(struct iscsi_cmd_rsp *rsp, struct scsi_cmd *scmd)
+{
+	rsp->bi_residual_count = 0;
+	if (scsi_get_data_dir(scmd) == DATA_READ)
+		iscsi_rsp_set_resid(rsp, scsi_get_in_resid(scmd));
+	else if (scsi_get_data_dir(scmd) == DATA_WRITE)
+		iscsi_rsp_set_resid(rsp, scsi_get_out_resid(scmd));
+	else if (scsi_get_data_dir(scmd) == DATA_BIDIRECTIONAL) {
+		iscsi_rsp_set_bidir_resid(rsp, scsi_get_in_resid(scmd));
+		iscsi_rsp_set_resid(rsp, scsi_get_out_resid(scmd));
+	} else
+		rsp->residual_count = 0;
+}
+
 struct iscsi_sense_data {
 	uint16_t length;
 	uint8_t  data[0];
@@ -1060,7 +1072,7 @@ static int iscsi_cmd_rsp_build(struct iscsi_task *task)
 	rsp->exp_cmdsn = cpu_to_be32(conn->session->exp_cmd_sn);
 	rsp->max_cmdsn = cpu_to_be32(conn->session->exp_cmd_sn + MAX_QUEUE_CMD);
 
-	calc_residual(rsp, task);
+	iscsi_rsp_set_residual(rsp, &task->scmd);
 
 	sense_len = task->scmd.sense_len;
 	if (sense_len) {
@@ -1092,15 +1104,14 @@ static int iscsi_data_rsp_build(struct iscsi_task *task)
 	rsp->offset = cpu_to_be32(task->offset);
 	rsp->datasn = cpu_to_be32(task->exp_r2tsn++);
 
-	datalen = min_t(uint32_t, scsi_get_in_length(&task->scmd), task->len);
-	datalen -= task->offset;
+	datalen = scsi_get_in_transfer_len(&task->scmd) - task->offset;
 
 	maxdatalen = conn->tp->rdma ?
 		conn->session_param[ISCSI_PARAM_MAX_BURST].val :
 		conn->session_param[ISCSI_PARAM_MAX_XMIT_DLENGTH].val;
 
 	dprintf("%d %d %d %" PRIu32 "%x\n", datalen,
-		scsi_get_in_length(&task->scmd), task->len, maxdatalen,
+		scsi_get_in_transfer_len(&task->scmd), task->offset, maxdatalen,
 		rsp->itt);
 
 	if (datalen <= maxdatalen) {
@@ -1113,7 +1124,8 @@ static int iscsi_data_rsp_build(struct iscsi_task *task)
 			rsp->flags |= ISCSI_FLAG_DATA_STATUS;
 			rsp->cmd_status = result;
 			rsp->statsn = cpu_to_be32(conn->stat_sn++);
-			calc_residual((struct iscsi_cmd_rsp *) rsp, task);
+			iscsi_rsp_set_residual((struct iscsi_cmd_rsp *) rsp,
+					       &task->scmd);
 		}
 	} else
 		datalen = maxdatalen;
@@ -1214,7 +1226,6 @@ void iscsi_free_cmd_task(struct iscsi_task *task)
 static int iscsi_scsi_cmd_done(uint64_t nid, int result, struct scsi_cmd *scmd)
 {
 	struct iscsi_task *task = ITASK(scmd);
-	uint32_t read_len = scsi_get_in_length(scmd);
 
 	/*
 	 * Since the connection is closed we just free the task.
@@ -1227,19 +1238,6 @@ static int iscsi_scsi_cmd_done(uint64_t nid, int result, struct scsi_cmd *scmd)
 		return 0;
 	}
 
-	if (scsi_get_data_dir(scmd) == DATA_WRITE)
-		task->len = scsi_get_out_length(scmd) - scsi_get_out_resid(scmd);
-	else
-		task->len = scsi_get_in_length(scmd) - scsi_get_in_resid(scmd);
-
-	if (scsi_get_data_dir(scmd) == DATA_WRITE)
-		task->len = 0;  /* no read result */
-	else if (task->len > read_len) {
-		dprintf("shrunk too big device read len %d > %u\n",
-			task->len, read_len);
-		task->len = read_len;
-	}
-
 	list_add_tail(&task->c_list, &task->conn->tx_clist);
 	task->conn->tp->ep_event_modify(task->conn, EPOLLIN | EPOLLOUT);
 
@@ -1773,7 +1771,7 @@ static int iscsi_scsi_cmd_tx_start(struct iscsi_task *task)
 
 	if (task->r2t_count)
 		err = iscsi_r2t_build(task);
-	else if (task->offset < task->len)
+	else if (task->offset < scsi_get_in_transfer_len(&task->scmd))
 		err = iscsi_data_rsp_build(task);
 	else
 		err = iscsi_cmd_rsp_build(task);
@@ -1853,10 +1851,9 @@ static int iscsi_scsi_cmd_tx_done(struct iscsi_connection *conn)
 	case ISCSI_OP_R2T:
 		break;
 	case ISCSI_OP_SCSI_DATA_IN:
-		if (task->offset < task->len ||
+		if (task->offset < scsi_get_in_transfer_len(&task->scmd) ||
 		    scsi_get_result(&task->scmd) != SAM_STAT_GOOD ||
-		    scsi_get_data_dir(&task->scmd) == DATA_BIDIRECTIONAL ||
-		    conn->tp->rdma) {
+		    scsi_get_data_dir(&task->scmd) == DATA_BIDIRECTIONAL) {
 			dprintf("more data or sense or bidir %x\n", hdr->itt);
 			list_add(&task->c_list, &task->conn->tx_clist);
 			return 0;
diff --git a/usr/iscsi/iscsid.h b/usr/iscsi/iscsid.h
index 444661b..e7e08ae 100644
--- a/usr/iscsi/iscsid.h
+++ b/usr/iscsi/iscsid.h
@@ -319,6 +319,7 @@ extern int iscsi_delete_portal(char *addr, int port);
 extern int iscsi_param_parse_portals(char *p, int do_add, int do_delete);
 extern void iscsi_update_conn_stats_rx(struct iscsi_connection *conn, int size, int opcode);
 extern void iscsi_update_conn_stats_tx(struct iscsi_connection *conn, int size, int opcode);
+extern void iscsi_rsp_set_residual(struct iscsi_cmd_rsp *rsp, struct scsi_cmd *scmd);
 
 /* iscsid.c iscsi_task */
 extern void iscsi_free_task(struct iscsi_task *task);
diff --git a/usr/iscsi/iser.c b/usr/iscsi/iser.c
index 4d08ea9..e4fcd92 100644
--- a/usr/iscsi/iser.c
+++ b/usr/iscsi/iser.c
@@ -2000,7 +2000,6 @@ static void iser_scsi_cmd_iosubmit(struct iser_task *task, int not_last)
 	scsi_set_out_length(scmd, 0);
 
 	if (task->is_write) {
-		scsi_set_out_resid(scmd, 0);
 		/* It's either the last buffer, which is RDMA,
 		   or the only buffer */
 		data_buf = list_entry(task->out_buf_list.prev,
@@ -2023,7 +2022,6 @@ static void iser_scsi_cmd_iosubmit(struct iser_task *task, int not_last)
 		scsi_set_out_length(scmd, task->out_len);
 	}
 	if (task->is_read) {
-		scsi_set_in_resid(scmd, 0);
 		/* ToDo: multiple RDMA-Read buffers */
 		data_buf = list_entry(task->in_buf_list.next,
 				      struct iser_membuf, task_list);
@@ -2049,32 +2047,6 @@ static void iser_scsi_cmd_iosubmit(struct iser_task *task, int not_last)
 	target_cmd_queue(session->target->tid, scmd);
 }
 
-static void iser_rsp_set_read_resid(struct iscsi_cmd_rsp *rsp_bhs,
-				    int in_resid)
-{
-	if (in_resid > 0) {
-		rsp_bhs->flags |= ISCSI_FLAG_CMD_UNDERFLOW;
-		rsp_bhs->residual_count = cpu_to_be32((uint32_t)in_resid);
-	} else {
-		rsp_bhs->flags |= ISCSI_FLAG_CMD_OVERFLOW;
-		rsp_bhs->residual_count = cpu_to_be32(((uint32_t)-in_resid));
-	}
-	rsp_bhs->bi_residual_count = 0;
-}
-
-static void iser_rsp_set_bidir_resid(struct iscsi_cmd_rsp *rsp_bhs,
-				     int in_resid)
-{
-	if (in_resid > 0) {
-		rsp_bhs->flags |= ISCSI_FLAG_CMD_BIDI_UNDERFLOW;
-		rsp_bhs->bi_residual_count = cpu_to_be32((uint32_t)in_resid);
-	} else {
-		rsp_bhs->flags |= ISCSI_FLAG_CMD_BIDI_OVERFLOW;
-		rsp_bhs->bi_residual_count = cpu_to_be32(((uint32_t)-in_resid));
-	}
-	rsp_bhs->residual_count = 0;
-}
-
 static int iser_scsi_cmd_done(uint64_t nid, int result,
 			      struct scsi_cmd *scmd)
 {
@@ -2084,7 +2056,6 @@ static int iser_scsi_cmd_done(uint64_t nid, int result,
 	struct iser_conn *conn = task->conn;
 	struct iscsi_session *session = conn->h.session;
 	unsigned char sense_len = scmd->sense_len;
-	int in_resid;
 
 	assert(nid == scmd->cmd_itn_id);
 
@@ -2108,21 +2079,12 @@ static int iser_scsi_cmd_done(uint64_t nid, int result,
 	iser_set_rsp_stat_sn(session, task->pdu.bhs);
 	rsp_bhs->exp_datasn = 0;
 
+	iscsi_rsp_set_residual(rsp_bhs, scmd);
 	if (task->is_read) {
-		in_resid = scsi_get_in_resid(scmd);
-		if (in_resid != 0) {
-			if (!task->is_write)
-				iser_rsp_set_read_resid(rsp_bhs, in_resid);
-			else
-				iser_rsp_set_bidir_resid(rsp_bhs, in_resid);
-			if (in_resid > 0)
-				task->rdma_wr_remains = task->in_len - in_resid;
-		}
-		/* schedule_rdma_write(task, conn); // ToDo: need this? */
-	} else {
-		rsp_bhs->bi_residual_count = 0;
-		rsp_bhs->residual_count = 0;
+		task->rdma_wr_remains = scsi_get_in_transfer_len(scmd);
+		task->rdma_wr_sz = scsi_get_in_transfer_len(scmd);
 	}
+
 	task->pdu.ahssize = 0;
 	task->pdu.membuf.size = 0;
 
diff --git a/usr/scsi_cmnd.h b/usr/scsi_cmnd.h
index 83d733c..8bdcb19 100644
--- a/usr/scsi_cmnd.h
+++ b/usr/scsi_cmnd.h
@@ -12,9 +12,10 @@ enum data_direction {
 };
 
 struct scsi_data_buffer {
-	int resid;
-	uint32_t length;
 	uint64_t buffer;
+	uint32_t length;
+	uint32_t transfer_len;
+	int32_t resid;
 };
 
 struct scsi_cmd {
@@ -80,23 +81,43 @@ static inline type scsi_get_##dir##_##field(struct scsi_cmd *scmd)		\
 	return get_cast (scmd->dir##_sdb.field);				\
 }
 
-scsi_data_buffer_accessor(length, uint32_t,,);
-scsi_data_buffer_accessor(resid, int,,);
+scsi_data_buffer_accessor(length, uint32_t, ,);
+scsi_data_buffer_accessor(transfer_len, uint32_t, ,);
+scsi_data_buffer_accessor(resid, int32_t, ,);
 scsi_data_buffer_accessor(buffer, void *, (unsigned long), (void *)(unsigned long));
 
 static inline void scsi_set_in_resid_by_actual(struct scsi_cmd *scmd,
-					       uint32_t actual)
+					       uint32_t transfer_len)
 {
-	scsi_set_in_resid(scmd, scsi_get_in_length(scmd) - actual);
+	uint32_t expected_len = scsi_get_in_length(scmd);
+	int32_t resid;
+
+	if (transfer_len <= expected_len)
+		resid = expected_len - transfer_len;
+	else {
+		resid = -(int32_t)(transfer_len - expected_len);
+		transfer_len = expected_len;
+	}
+	scsi_set_in_transfer_len(scmd, transfer_len);
+	scsi_set_in_resid(scmd, resid);
 }
 
 static inline void scsi_set_out_resid_by_actual(struct scsi_cmd *scmd,
-					       uint32_t actual)
+						uint32_t transfer_len)
 {
-	scsi_set_out_resid(scmd, scsi_get_out_length(scmd) - actual);
+	uint32_t expected_len = scsi_get_out_length(scmd);
+	int32_t resid;
+
+	if (transfer_len <= expected_len)
+		resid = expected_len - transfer_len;
+	else {
+		resid = -(int32_t)(transfer_len - expected_len);
+		transfer_len = expected_len;
+	}
+	scsi_set_out_transfer_len(scmd, transfer_len);
+	scsi_set_out_resid(scmd, resid);
 }
 
-
 enum {
 	TGT_CMD_QUEUED,
 	TGT_CMD_PROCESSED,
diff --git a/usr/target.c b/usr/target.c
index 7b9ae12..9a8f2fd 100644
--- a/usr/target.c
+++ b/usr/target.c
@@ -1111,6 +1111,13 @@ int target_cmd_queue(int tid, struct scsi_cmd *cmd)
 	/* service delivery or target failure */
 	if (target->target_state != SCSI_TARGET_READY)
 		return -EBUSY;
+
+	/* by default assume zero residual counts */
+	scsi_set_in_resid(cmd, 0);
+	scsi_set_in_transfer_len(cmd, scsi_get_in_length(cmd));
+	scsi_set_out_resid(cmd, 0);
+	scsi_set_out_transfer_len(cmd, scsi_get_out_length(cmd));
+
 	/*
 	 * Call struct scsi_lu->cmd_perform() that will either be setup for
 	 * internal or passthrough CDB processing using 2 functions below.
-- 
1.7.9.6

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



More information about the stgt mailing list