[stgt] tgtd segfault during heavy I/O

Mon Jul 4 17:36:39 CEST 2011

Dear Tomonori,

We got segfault error on heavy I/O. Hope you can give some suggestion.

[Setting]
7 machines, each machine runs a VM and each VM uses 10 targets on
tgtd. Machine equips 1GB cards.
So there will be at least 70+ volumes on tgtd.

The tgtd (1.0.16) is running on a machine with two 10GBe cards bonded.
For setting up backing store of target, LVM logical volumes are used.
(Physical volume is on software RAID 5)

Both initiator side and target side are running CentOS 5.4.

I tried to setting up the system so core-dump can be generated when
problem hit. The core dump file seems incomplete, file is 8G+ bigger,
but only use about 30~50M disk capacity.

So I try to use gdb to attach to a debug build (make DEBUG=1) of tgtd.
(The symptom is much easier to be reproduced during heavy I/O test and
with optimized build of tgtd (-o2).)
When symptom shows, I got the following backtraces: (only the latest
part is pasted)
============
..
[New Thread 0x2aabbaa5d940 (LWP 20176)]
[New Thread 0x2aabbb45e940 (LWP 20177)]
[New Thread 0x2aabbbe5f940 (LWP 20227)]
[New Thread 0x2aabbc860940 (LWP 20228)]
[New Thread 0x2aabbd261940 (LWP 20229)]
[New Thread 0x2aabbdc62940 (LWP 20230)]
[New Thread 0x2aabbe663940 (LWP 20258)]
[New Thread 0x2aabbf064940 (LWP 20259)]
[New Thread 0x2aabbfa65940 (LWP 20265)]
[New Thread 0x2aabc0466940 (LWP 20266)]

Program received signal SIGSEGV, Segmentation fault.
0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
iscsi/iscsid.c:1524
1524                    if (task->tag == req->itt)
(gdb) bt
#0  0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
iscsi/iscsid.c:1524
#1  0x0000000000409360 in iscsi_task_rx_start (conn=0x10f26028) at
iscsi/iscsid.c:1729
#2  0x0000000000409d42 in iscsi_rx_handler (conn=0x10f26028) at
iscsi/iscsid.c:1986
#3  0x0000000000411ba6 in iscsi_tcp_event_handler (fd=445, events=5,
data=0x10f26028) at iscsi/iscsi_tcp.c:158
#4  0x0000000000417365 in event_loop () at tgtd.c:454
#5  0x0000000000417a16 in main (argc=1, argv=0x7fffd5eb9a98) at tgtd.c:640
(gdb)

(gdb) print task
$5 = (struct iscsi_task *) 0xffffffffffffff90
(gdb) print req
$6 = (struct iscsi_data *) 0x10f26148
(gdb)

(gdb) p task->req
Cannot access memory at address 0xffffffffffffff90
(gdb) p task->rsp
Cannot access memory at address 0xffffffffffffffc0
(gdb) p task->tag
Cannot access memory at address 0xfffffffffffffff0

(gdb) p req->opcode
$30 = 5 '\005'
(gdb) p req->flags
$31 = 128 '\200'
(gdb) p req->rsvd2
$32 =   "\000"

============

The system log can be downloaded from here:
http://dl.dropbox.com/u/8354750/tgtd/20110704/messages

Seems *task* is freed and referenced again.
Hope I can get some feedback.
Thanks a lot.

--
Kiefer Chang
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html