robin.humble+stgt at anu.edu.au wrote on Tue, 18 Dec 2007 05:48 -0500: > with >=2 iSER clients I'm hitting these quite a lot: > tgtd: iscsi_rdma_malloc(1619) free list empty > tgtd: iscsi_rdma_alloc_data_buf(1647) free list empty > (the fn name depends on the tgtd version) > after which the initiator node is pretty much toast :-/ > > the code for this is in iscsi/iscsi_rdma.c --> > > static void *iscsi_rdma_alloc_data_buf(struct iscsi_connection *conn, size_t sz) > { > ... > if (list_empty(&dev->mempool_free)) { > /* XXX: take slow path: allocate and register */ > eprintf("free list empty\n"); > exit(1); > } > ... > > which looks like an OO(rdma)M fallback that's just unimplemented at > the moment? > > as a workaround I boosted: > static int mempool_num = 192; > to 1920 which let 2 clients survive, but not the 15 or 100 that I'd > ideally like. > > is dynamically adding more entries to the mempool the solution, or a > separate list of non-mempool rdma bufs, or just telling the initiator to > backoff for a while? The core of the problem is that we should be flow controlling client requests, but aren't. Look for the parameter max_cmdsn in iscsid.c and you'll see that it is always set to exp_cmdsn (the current number) + a constant MAX_QUEUE_CMD, 128. What should happen is those places should ask the transport how much room they have available for the biggest command the client might send. This is theoretically a problem for TCP too, in that each working command allocates a new buffer, although the failure mode in that case is that malloc() of the task data buffers would fail and close the connection to the initiator. Maybe we should get rid of the exit(1) in iSER and just return -ENOMEM so it is the same as TCP. Connection will drop but maybe not kill initiator as unpleasantly. For iSER it is more severe because we use pre-pinned data buffers to get good performance, and the limits on the amount of pinned memory can be tighter, needing physical. (If you are doing small transfers, a better mempool allocater that could divide chunks may help, but doesn't fix the general problem.) You could hack around it in iser by taking the slow path and allocating as the XXX comment suggests, but you would soon hit the memory allocation limit. Each connection could have 128 * 512k commands in flight. That's gobs. You can play with RDMA_TRANSFER_SIZE and mempool_size to shrink this somewhat and see how things work; that is not a negotiated parameter, but bigger is likely to be faster. And yeah, cranking up the mempool_num will pin more memory shared across all initiators. A different but related problem is MAX_WQE. It is 1800 to accommodate current linux iser behavior. That governs the number of potential work entries on a QP, and is rather large. Not sure when a NIC would run out of QPs. It also needs some smaller but not insignificant pinned memory for each of these. We could negotiate using the MaxOutstandingUnexpectedPDUs parameter, but linux iser does not support that. -- Pete |