[stgt] tgtd segfault during heavy I/O

Sun Jul 10 10:59:07 CEST 2011

On Mon, 4 Jul 2011 23:36:39 +0800
Kiefer Chang <zapchang at gmail.com> wrote:

> Dear Tomonori,
> 
> We got segfault error on heavy I/O. Hope you can give some suggestion.
> 
> [Setting]
> 7 machines, each machine runs a VM and each VM uses 10 targets on
> tgtd. Machine equips 1GB cards.
> So there will be at least 70+ volumes on tgtd.
> 
> The tgtd (1.0.16) is running on a machine with two 10GBe cards bonded.
> For setting up backing store of target, LVM logical volumes are used.
> (Physical volume is on software RAID 5)
> 
> Both initiator side and target side are running CentOS 5.4.
> 
> I tried to setting up the system so core-dump can be generated when
> problem hit. The core dump file seems incomplete, file is 8G+ bigger,
> but only use about 30~50M disk capacity.
> 
> So I try to use gdb to attach to a debug build (make DEBUG=1) of tgtd.
> (The symptom is much easier to be reproduced during heavy I/O test and
> with optimized build of tgtd (-o2).)
> When symptom shows, I got the following backtraces: (only the latest
> part is pasted)
> ============
> ..
> [New Thread 0x2aabbaa5d940 (LWP 20176)]
> [New Thread 0x2aabbb45e940 (LWP 20177)]
> [New Thread 0x2aabbbe5f940 (LWP 20227)]
> [New Thread 0x2aabbc860940 (LWP 20228)]
> [New Thread 0x2aabbd261940 (LWP 20229)]
> [New Thread 0x2aabbdc62940 (LWP 20230)]
> [New Thread 0x2aabbe663940 (LWP 20258)]
> [New Thread 0x2aabbf064940 (LWP 20259)]
> [New Thread 0x2aabbfa65940 (LWP 20265)]
> [New Thread 0x2aabc0466940 (LWP 20266)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1524
> 1524                    if (task->tag == req->itt)
> (gdb) bt
> #0  0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1524
> #1  0x0000000000409360 in iscsi_task_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1729
> #2  0x0000000000409d42 in iscsi_rx_handler (conn=0x10f26028) at
> iscsi/iscsid.c:1986
> #3  0x0000000000411ba6 in iscsi_tcp_event_handler (fd=445, events=5,
> data=0x10f26028) at iscsi/iscsi_tcp.c:158
> #4  0x0000000000417365 in event_loop () at tgtd.c:454
> #5  0x0000000000417a16 in main (argc=1, argv=0x7fffd5eb9a98) at tgtd.c:640
> (gdb)
> 
> (gdb) print task
> $5 = (struct iscsi_task *) 0xffffffffffffff90
> (gdb) print req
> $6 = (struct iscsi_data *) 0x10f26148
> (gdb)
> 
> (gdb) p task->req
> Cannot access memory at address 0xffffffffffffff90
> (gdb) p task->rsp
> Cannot access memory at address 0xffffffffffffffc0
> (gdb) p task->tag
> Cannot access memory at address 0xfffffffffffffff0
> 
> 
> (gdb) p req->opcode
> $30 = 5 '\005'
> (gdb) p req->flags
> $31 = 128 '\200'
> (gdb) p req->rsvd2
> $32 =   "\000"
> 
> ============
> 
> The system log can be downloaded from here:
> http://dl.dropbox.com/u/8354750/tgtd/20110704/messages
> 
> Seems *task* is freed and referenced again.

This is related with tmf (aborting task, etc)? Your next report is.
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html