[stgt] stgtd 0.9.3 : Read-Errors using iser transport

Sun Feb 22 14:41:37 CET 2009

On Sun, Feb 22, 2009 at 02:53:00PM +0200, Or Gerlitz wrote:
> Dr. Volker Jaenisch wrote:
>> every combination that I've tried when there are  multiple  
>> simultaneous readers Reproduced that. On a single core more than one  
>> simultanteous threads accessing the LUN over iSER also give read 
>> errors.
> OK, Thanks a lot for doing all this testing / bug hunting work.
>
> I read the Feb 2008 "iser multiple readers" thread and wasn't sure if /  
> what was the conclusion.

just to chime in here - I don't think there was any conclusion from 12
months ago... as I was the only one seeing problems at that time, it
(quite rightly) couldn't be ruled out that there was something odd with
our machines/setup.

now that other people are seeing problems too, the chances that the
problem is real and of finding a fix are better.

however, it turns out that we don't need iSER in production any time
soon, so I haven't been spending any time on it. but let me know if you
want me to test a fix and I'll try to find time to break it :-)

cheers,
robin

>OTOH Robin reported that the patch that slows  
> down tgt not to send the scsi response before the rdma write is  
> completed eliminated the error but OTOH Pete was doing some analysis of  
> the errors, @  
> http://lists.berlios.de/pipermail/stgt-devel/2008-February/001379.html 
> said
>> "The offsets are always positive, which fits in with the theory that  
>> future RDMAs are overwriting earlier ones. This goes against the  
>> theory in your (my) patch, which guesses that the SCSI responsemessage  
>> is sneaking ahead of RDMA operations."
>
> and here starts the talking on possible relations of this error with  
> FMRs, where Pete suggested to disable FMRs and see if the problem  
> persists, I wasn't sure if you did that.
>
>> My guess is that the AMD hyper-transport may interfere with the fmr.  
>> But I am no linux memory management specialist .. so please correct me  
>> if I am wrong. Maybe the following happens:  Bootet with one CPU all  
>> FMR request goes to the 16GB RAM this single CPU directly addresses  
>> via its memory controller.  In case of more than one active CPU the  
>> memory is fetched from both CPUs memory controllers  with preference  
>> to local memory. In seldom cases the memory manager fetchs memory for  
>> the FMR process running on CPU0 from the CPU1 via the hyper-transport  
>> channel and something weird happens.
> To make sure we are on the same page (...) here: FMR (Fast Memory  
> Registration) is a means to register with the HCA a (say) arbitrary list  
> of pages to be used for an I/O. This page SC (scatter-gather) list was  
> allocated and provided by the SCSI midlayer to the iSER SCSI LLD  
> (low-level-driver) through the queuecommand interface. So I read your  
> comment as saying that when using one CPU and or a system with one  
> memory controller all I/O are served with pages from the "same memory"  
> where when this doesn't happen, something gets broken.
>
> I wasn't sure to follow on the sentence "In seldom cases the memory  
> manager fetchs memory for the FMR process running on CPU0 from the CPU1  
> via the hyper-transport channel and something weird happens" - can you  
> explain a bit what you were referring to?
>
> Or.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stgt" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html