[Stgt-devel] iSER - one too many rubber duckies in the mempool

Tue Dec 18 17:46:51 CET 2007

robin.humble+stgt at anu.edu.au wrote on Tue, 18 Dec 2007 05:48 -0500:
> with >=2 iSER clients I'm hitting these quite a lot:
>   tgtd: iscsi_rdma_malloc(1619) free list empty
>   tgtd: iscsi_rdma_alloc_data_buf(1647) free list empty
>      (the fn name depends on the tgtd version)
> after which the initiator node is pretty much toast :-/
> 
> the code for this is in iscsi/iscsi_rdma.c -->
> 
>   static void *iscsi_rdma_alloc_data_buf(struct iscsi_connection *conn, size_t sz)
>   {       
>   ...
>           if (list_empty(&dev->mempool_free)) {
>                   /* XXX: take slow path: allocate and register */
>                   eprintf("free list empty\n");
>                   exit(1);
>           }
>   ...
> 
> which looks like an OO(rdma)M fallback that's just unimplemented at
> the moment?
> 
> as a workaround I boosted:
>   static int mempool_num = 192;
> to 1920 which let 2 clients survive, but not the 15 or 100 that I'd
> ideally like.
> 
> is dynamically adding more entries to the mempool the solution, or a
> separate list of non-mempool rdma bufs, or just telling the initiator to
> backoff for a while?

The core of the problem is that we should be flow controlling client
requests, but aren't.  Look for the parameter max_cmdsn in iscsid.c
and you'll see that it is always set to exp_cmdsn (the current
number) + a constant MAX_QUEUE_CMD, 128.  What should happen is
those places should ask the transport how much room they have
available for the biggest command the client might send.

This is theoretically a problem for TCP too, in that each working
command allocates a new buffer, although the failure mode in that
case is that malloc() of the task data buffers would fail and close
the connection to the initiator.

Maybe we should get rid of the exit(1) in iSER and just return
-ENOMEM so it is the same as TCP.  Connection will drop but maybe
not kill initiator as unpleasantly.

For iSER it is more severe because we use pre-pinned data buffers to
get good performance, and the limits on the amount of pinned memory
can be tighter, needing physical.

(If you are doing small transfers, a better mempool allocater that
could divide chunks may help, but doesn't fix the general problem.)

You could hack around it in iser by taking the slow path and
allocating as the XXX comment suggests, but you would soon hit the
memory allocation limit.  Each connection could have 128 * 512k
commands in flight.  That's gobs.  You can play with
RDMA_TRANSFER_SIZE and mempool_size to shrink this somewhat and see
how things work; that is not a negotiated parameter, but bigger is
likely to be faster.  And yeah, cranking up the mempool_num will pin
more memory shared across all initiators.

A different but related problem is MAX_WQE.  It is 1800 to
accommodate current linux iser behavior.  That governs the number of
potential work entries on a QP, and is rather large.  Not sure when
a NIC would run out of QPs.  It also needs some smaller but not
insignificant pinned memory for each of these.  We could negotiate
using the MaxOutstandingUnexpectedPDUs parameter, but linux iser
does not support that.

		-- Pete