[stgt] tgtd segfault with software RAID, hard resetting link

Tue Apr 21 14:06:27 CEST 2009

FUJITA Tomonori schrieb:

>> So I rebooted the initiator without logging it out of the target, with 
>> "echo b >/proc/sysrq-trigger" (it's a diskless initiator, so basically 
>> that's the only method when its disks are gone).
>>
>>
>> Initiator started to boot again and I think tgtd segfaulted when the 
>> initiator tried to log in to the target.
> 
> Duh, seems that we have another problem.
> 
> Can you reproduce this by just rebooting the initiator without logging
> out and starting the initiator again?

It seems to be harder to cause it on purpose... but yes, it's reproducible.
It may or may not be a different problem.

1) On initiator, do:

# echo 3 > /sys/block/sda/device/timeout
# echo 3 > /sys/block/sdd/device/timeout 
# dd if=/dev/zero of=/mnt/iscsi/bigfile bs=64k

2) On the target, do (drive being a part of RAID):

# i=1; while [ $i -ne 100 ] ; do echo $i; hdparm -Y /dev/sdd;  i=$((i+1)); done

3) If IO errors appear on the initiator (thi seems important), reboot it without logging out of the target:

# echo b >/proc/sysrq-trigger

4) Initiator will start booting and will connect to the target.
It won't be able to boot (hdparm loop still running on the target; some data still in cache/dirty/writeback).

Interrupt the loop, if you have luck, tgtd _may_ segfault.

--------------------

While I tried to reproduce it, I did, on the initiator (both are iSCSI disks):

# echo 3 > /sys/block/sda/device/timeout
# echo 3 > /sys/block/sdd/device/timeout 

Then, on the target:

i=1; while [ $i -ne 100 ] ; do echo $i; hdparm -Y /dev/sdd;  i=$((i+1)); done

And it segfaulted after ~30 iterations (happened only once; no initiator reboot needed):

Apr 21 13:26:07 megathecus tgtd: conn_close(100) connection closed, 0x81a791c 1
Apr 21 13:27:12 megathecus tgtd: abort_task_set(988) found 51 0
Apr 21 13:27:12 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:12 megathecus tgtd: abort_cmd(964) found 21 e
Apr 21 13:27:20 megathecus tgtd: abort_task_set(988) found 39 0
Apr 21 13:27:20 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:20 megathecus tgtd: abort_cmd(964) found 73 e
Apr 21 13:27:41 megathecus tgtd: conn_close(100) connection closed, 0x81a791c 3
Apr 21 13:27:41 megathecus tgtd: conn_close(106) sesson 0x81a7d70 1
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000051 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000041 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000050 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000043 e
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000040 e
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000045 e
Apr 21 13:27:58 megathecus tgtd: abort_task_set(988) found 10000072 0
Apr 21 13:27:58 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:58 megathecus tgtd: abort_cmd(964) found 1000007e e
Apr 21 13:28:07 megathecus tgtd: conn_close(100) connection closed, 0x81a700c 4
Apr 21 13:28:07 megathecus tgtd: conn_close(106) sesson 0x81a71f0 1
Apr 21 13:28:09 megathecus kernel: tgtd[21360]: segfault at 0 ip 080546d6 sp 6cc1c340 error 4 in tgtd[8048000+24000]

-- 
Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html