[stgt] stgt: how to avoid WARNING at block/blk-core.c:1080

Thu Aug 5 21:28:09 CEST 2010

On 8/5/10 1:09 AM, FUJITA Tomonori wrote:
> On Tue, 03 Aug 2010 17:23:32 -0700
> Joe Eykholt<jeykholt at cisco.com>  wrote:
>
>> On 8/3/10 12:40 PM, Joe Eykholt wrote:
>>> On 7/31/10 6:49 PM, FUJITA Tomonori wrote:
>>>> On Fri, 30 Jul 2010 10:32:34 -0700
>>>> Joe Eykholt<jeykholt at cisco.com>  wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm working on a target module for scsi_tgt to work with libfc / fcoe
>>>>> in Linux 2.6.33.5. I'm missing some step in handling completion of
>>>>> a request.
>>>>>
>>>>> After each Inquiry or READ operation, in the transfer_response
>>>>> callback I send the data and response and then call the (*done)()
>>>>> routine.
>>>>>
>>>>> I figure I must be missing something, because I get this WARNING:
>>>>>
>>>>> [ 271.521650] ------------[ cut here ]------------
>>>>> [ 271.521658] WARNING: at block/blk-core.c:1080
>>>>> __blk_put_request+0x4f/0xbe()
>>>>> [ 271.521660] Hardware name:<snip>
>>>>> [ 271.521662] Modules linked in: libfc_tgt scsi_tgt nfs lockd nfs_acl
>>>>> auth_rpcgss autofs4 sunrpc ip6t_REJECT
>>>>> nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput
>>>>> fnic libfcoe libfc enic i2c_i801
>>>>> iTCO_wdt serio_raw scsi_transport_fc iTCO_vendor_support pcspkr
>>>>> e1000e shpchp radeon ttm drm_kms_helper drm
>>>>> i2c_algo_bit i2c_core
>>>>> [ 271.521705] Pid: 2769, comm: scsi_tgtd/2 Tainted: G W 2.6.33.5-ftgt #9
>>>>> [ 271.521708] Call Trace:
>>>>> [ 271.521715] [<ffffffff811696ae>] ? __blk_put_request+0x4f/0xbe
>>>>> [ 271.521720] [<ffffffff81041462>] warn_slowpath_common+0x77/0xa4
>>>>> [ 271.521724] [<ffffffff8104149e>] warn_slowpath_null+0xf/0x11
>>>>> [ 271.521728] [<ffffffff811696ae>] __blk_put_request+0x4f/0xbe
>>>>> [ 271.521735] [<ffffffffa02b14b7>] scsi_host_put_command+0x52/0x80
>>>>> [scsi_tgt]
>>>>> [ 271.521740] [<ffffffffa02b14e5>] ? scsi_tgt_cmd_destroy+0x0/0x3f
>>>>> [scsi_tgt]
>>>>> [ 271.521745] [<ffffffffa02b151f>] scsi_tgt_cmd_destroy+0x3a/0x3f
>>>>> [scsi_tgt]
>>>>> [ 271.521751] [<ffffffff810513a0>] worker_thread+0x131/0x1bd
>>>>> [ 271.521756] [<ffffffff81054bfc>] ? autoremove_wake_function+0x0/0x38
>>>>> [ 271.521760] [<ffffffff8105126f>] ? worker_thread+0x0/0x1bd
>>>>> [ 271.521764] [<ffffffff810547fe>] kthread+0x7d/0x85
>>>>> [ 271.521770] [<ffffffff810099a4>] kernel_thread_helper+0x4/0x10
>>>>> [ 271.521774] [<ffffffff81054781>] ? kthread+0x0/0x85
>>>>> [ 271.521779] [<ffffffff810099a0>] ? kernel_thread_helper+0x0/0x10
>>>>> [ 271.521782] ---[ end trace 096013af4d0a82a0 ]---
>>>>>
>>>>> It's this code in block/blk-core.c:
>>>>>
>>>>> void __blk_put_request(struct request_queue *q, struct request *req)
>>>>> {
>>>>> if (unlikely(!q))
>>>>> return;
>>>>> if (unlikely(--req->ref_count))
>>>>> return;
>>>>>
>>>>> elv_completed_request(q, req);
>>>>>
>>>>> /* this is a bio leak */
>>>>> WARN_ON(req->bio != NULL);
>>>>>
>>>>> ...
>>>>>
>>>>> So, what should I be doing to tell the block layer that the request
>>>>> is done?
>>>>> Or, is there something that scsi_tgt should be doing to disassociate
>>>>> the bio
>>>>> from the req?
>>>>
>>>> Maybe we should call blk_end_request_all().
>>>>
>>>> But I don't think that bio is leaked in our case. So setting req->bio
>>>> to NULL might be fine.
>>>>
>>>> I can't access to my IBM POWER box now but I'll try later.
>>>
>>> I tried it with my module (not with ibmvstgt) and it seems to work better.
>>> Here's the patch I used (cut-and-paste from stg so for illustration only).
>>> Feel free to do something different. If setting rq->bio = NULL is OK,
>>> it might be better. I haven't researched this enough to know what's best.
>>>
>>> Author: Joe Eykholt<jeykholt at cisco.com>
>>> Date: Tue Aug 3 12:33:07 2010 -0700
>>>
>>> stgt: fix warning from __blk_put_request()
>>>
>>> Every ftgt I/O ending was getting a WARNING and stack dump
>>> from block/blk-core:1080 __blk_put_request() because a request
>>> still had a bio associated with it.
>>>
>>> Call blk_end_request_all() after unmapping the user pages.
>>>
>>> Signed-off-by: Joe Eykholt<jeykholt at cisco.com>
>>>
>>> diff --git a/drivers/scsi/scsi_tgt_lib.c b/drivers/scsi/scsi_tgt_lib.c
>>> index 1030327..a454ed0 100644
>>> --- a/drivers/scsi/scsi_tgt_lib.c
>>> +++ b/drivers/scsi/scsi_tgt_lib.c
>>> @@ -151,6 +151,7 @@ void scsi_host_put_command(struct Scsi_Host *shost,
>>> struct scsi_cmnd *cmd)
>>> kmem_cache_free(scsi_tgt_cmd_cache, tcmd);
>>>
>>> spin_lock_irqsave(q->queue_lock, flags);
>>> + __blk_end_request_all(rq, 0);
>>> __blk_put_request(q, rq);
>>> spin_unlock_irqrestore(q->queue_lock, flags);
>>>
>>
>> Although that seemed to work on 2.6.35, when I tried it on
>> the fcoe-next tree, which is 2.6.35-rc3, it caused a crash due to
>> blk_update_request referencing a freed bio.  So, maybe just setting
>> rq->bio to NULL is correct.  If we freed the bio, we should do that
>> at the same place.   Just wanted to let you know that there's a
>> problem with the patch at this point.
>
> I think that we need to call blk_end_request_all before calling
> blk_rq_unmap_user.
>
> I've just updated the kernel version to the latest git and confirmed
> that this patch works for my IBM POWER box.
>
> =
> From: FUJITA Tomonori<fujita.tomonori at lab.ntt.co.jp>
> Date: Thu, 5 Aug 2010 17:02:42 +0900
> Subject: [PATCH] tgt: fix warning
>
> We don't set rq->bio to NULL so we get the following warnings:
>
> Badness at /home/fujita/git/linux-2.6/block/blk-core.c:1108
> NIP: c0000000001bfb58 LR: c0000000001bfb48 CTR: c0000000001bfb08
>
> We don't leak bios (blk_rq_unmap_user should free them). We don't need
> to call blk_end_request_all but let's finish a request like everyone.
>
> Signed-off-by: FUJITA Tomonori<fujita.tomonori at lab.ntt.co.jp>
> ---
>   drivers/scsi/scsi_tgt_lib.c |    1 +
>   1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/scsi/scsi_tgt_lib.c b/drivers/scsi/scsi_tgt_lib.c
> index 66241dd..fff1b61 100644
> --- a/drivers/scsi/scsi_tgt_lib.c
> +++ b/drivers/scsi/scsi_tgt_lib.c
> @@ -184,6 +184,7 @@ static void scsi_tgt_cmd_destroy(struct work_struct *work)
>
>   	dprintk("cmd %p %d %u\n", cmd, cmd->sc_data_direction,
>   		rq_data_dir(cmd->request));
> +	blk_end_request_all(cmd->request, 0);
>   	scsi_unmap_user_pages(tcmd);
>   	scsi_host_put_command(scsi_tgt_cmd_to_host(cmd), cmd);

Both blk_end_request_all() and scsi_host_put_command() do
a __blk_put_request().

I get a fault in the above call to scsi_tgt_cmd_to_host() where
it derefs the request, presumably because the request has been
freed by blk_end_request_all().

This happens once, early in my LUN discovery sequence (maybe
an Inquiry or Report LUNs) and then it seems OK.  My system
goes on after the fault.

So, is there a case where we might not have started a request?
I don't see that.  I thought maybe a special case like
an I/O with no buffer and dir == DMA_NONE, like a Test Unit Ready.
Or maybe the request refcnt is not correct in some case?

I tried merely getting the host pointer before blk_end_request_all(),
but that just lead to another problem in scsi_host_put_command()
since it also refers to the request to get the tcmd.

	Regards,
	Joe
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html