[stgt] stgtd 0.9.3 : Read-Errors using iser transport
Robin Humble
robin.humble+stgt at anu.edu.au
Sun Feb 22 14:41:37 CET 2009
On Sun, Feb 22, 2009 at 02:53:00PM +0200, Or Gerlitz wrote:
> Dr. Volker Jaenisch wrote:
>> every combination that I've tried when there are multiple
>> simultaneous readers Reproduced that. On a single core more than one
>> simultanteous threads accessing the LUN over iSER also give read
>> errors.
> OK, Thanks a lot for doing all this testing / bug hunting work.
>
> I read the Feb 2008 "iser multiple readers" thread and wasn't sure if /
> what was the conclusion.
just to chime in here - I don't think there was any conclusion from 12
months ago... as I was the only one seeing problems at that time, it
(quite rightly) couldn't be ruled out that there was something odd with
our machines/setup.
now that other people are seeing problems too, the chances that the
problem is real and of finding a fix are better.
however, it turns out that we don't need iSER in production any time
soon, so I haven't been spending any time on it. but let me know if you
want me to test a fix and I'll try to find time to break it :-)
cheers,
robin
>OTOH Robin reported that the patch that slows
> down tgt not to send the scsi response before the rdma write is
> completed eliminated the error but OTOH Pete was doing some analysis of
> the errors, @
> http://lists.berlios.de/pipermail/stgt-devel/2008-February/001379.html
> said
>> "The offsets are always positive, which fits in with the theory that
>> future RDMAs are overwriting earlier ones. This goes against the
>> theory in your (my) patch, which guesses that the SCSI responsemessage
>> is sneaking ahead of RDMA operations."
>
> and here starts the talking on possible relations of this error with
> FMRs, where Pete suggested to disable FMRs and see if the problem
> persists, I wasn't sure if you did that.
>
>> My guess is that the AMD hyper-transport may interfere with the fmr.
>> But I am no linux memory management specialist .. so please correct me
>> if I am wrong. Maybe the following happens: Bootet with one CPU all
>> FMR request goes to the 16GB RAM this single CPU directly addresses
>> via its memory controller. In case of more than one active CPU the
>> memory is fetched from both CPUs memory controllers with preference
>> to local memory. In seldom cases the memory manager fetchs memory for
>> the FMR process running on CPU0 from the CPU1 via the hyper-transport
>> channel and something weird happens.
> To make sure we are on the same page (...) here: FMR (Fast Memory
> Registration) is a means to register with the HCA a (say) arbitrary list
> of pages to be used for an I/O. This page SC (scatter-gather) list was
> allocated and provided by the SCSI midlayer to the iSER SCSI LLD
> (low-level-driver) through the queuecommand interface. So I read your
> comment as saying that when using one CPU and or a system with one
> memory controller all I/O are served with pages from the "same memory"
> where when this doesn't happen, something gets broken.
>
> I wasn't sure to follow on the sentence "In seldom cases the memory
> manager fetchs memory for the FMR process running on CPU0 from the CPU1
> via the hyper-transport channel and something weird happens" - can you
> explain a bit what you were referring to?
>
> Or.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stgt" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
More information about the stgt
mailing list