[Stgt-devel] disk kicked out of RAID -> tgtd segmentation fault

Thu Jun 26 17:51:20 CEST 2008

Tomasz Chmielewski schrieb:
> ronnie sahlberg schrieb:
>> Since it is reproducible,  can you run it under gdb and get a backtrace?
> 
> Unfortunately not.
> 
> The disk is now totally failed - some fifteen minutes ago I was still 
> able to read from it for a couple of minutes after plugging it in.
> 
> Now I can't read from it any more after plugging it in, so I can't add 
> it to RAID.

OK, I found a way to reproduce it without using any broken disks.

You will need these to build a RAID-1 device:
- losetup or a separate partition/disk
- 3 machines (1 target, 1 target+initiator, 1 initiator)

If something was not clear enough, please just ask.

1. Create a ~100 MB empty file (or bigger, if you want):

dd if=/dev/zero of=raid-a bs=1k count=104857

2. Make a block device out of it:

losetup /dev/loop0 raid-a

3. Log in to another target - lets say the drive will be called /dev/sdh 
(substitute the target address etc.):

iscsiadm -m node -T iqn......test -p 10.1.1.1 -l

4. Build RAID-1 (above target is now /dev/sdz):

mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/loop0 /dev/sdh

5. Wait a bit until it's synced (see /proc/mdstat), and make this target 
available to initiators (substitute the values to match your own):

tgtadm --lld iscsi --op new --mode target --tid 3 -T 
iqn.2008-06.net.syneticon:superthecus.test
tgtadm --lld iscsi --op new --mode logicalunit --tid 3 --lun 1 -b /dev/md1
tgtadm --lld iscsi --op bind --mode target --tid 3 -I ALL

6. Start watching the syslog (on the machine where tgtd will crash):

tail -f /var/log/syslog

7. From yet another machine (initiator), log in to that target, and read 
data from it (let's say the target is now /dev/sdb):

iscsiadm -m node -T iqn.2008-06.net.syneticon:superthecus.test -p 
10.1.1.2 -l

while true; do dd if=/dev/sdb of=/dev/null bs=1k; done

8. And immediately block the traffic between this machine (where we will 
make tgtd fail after a drive failed in RAID) and the other target 
(immediately, because if you have lots of RAM and the whole device is 
cached, it won't be kicked out of RAID very fast):

iptables -A INPUT -s 10.1.1.1 -p tcp --sport 3260 -j DROP

The connection will fail after 120 seconds with standard open-iscsi 
settings.

After blocking the traffic, the syslog says (with my comments):

This is where we can't communicate with the other target (which is a 
part of RAID):

Jun 26 17:39:43 superthecus iscsid: Nop-out timedout after 15 seconds on 
connection 6:0 state (3). Dropping session.

When a disk in RAID-1 is failing, there is no IO to the whole array for 
~minute - tgtd notices it[1]:

Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a04 0
Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a05 0
Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found 0 0
Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1c e
Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1e e

The disk is finally failed:

Jun 26 17:41:43 superthecus kernel:  session6: session recovery timed 
out after 120 secs
Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: 
hostbyte=0x01 driverbyte=0x00
Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, 
sector 44256
Jun 26 17:41:43 superthecus kernel: printk: 46 messages suppressed.
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44256
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44264
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44272
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44280
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44288
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44296
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44304
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44312
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44320
Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44328
Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: 
hostbyte=0x01 driverbyte=0x00
Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, 
sector 44000
Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: 
hostbyte=0x01 driverbyte=0x00
Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, 
sector 44256
Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: 
hostbyte=0x01 driverbyte=0x00
Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, 
sector 44256
Jun 26 17:41:43 superthecus kernel: raid1: Disk failure on sdh, 
disabling device.
Jun 26 17:41:43 superthecus kernel: ^IOperation continuing on 1 devices
Jun 26 17:41:43 superthecus mdadm: Fail event detected on md device 
/dev/md1, component device /dev/.static/dev/.static/dev/sdh
Jun 26 17:41:43 superthecus kernel: RAID1 conf printout:
Jun 26 17:41:43 superthecus kernel:  --- wd:1 rd:2
Jun 26 17:41:43 superthecus kernel:  disk 0, wo:0, o:1, dev:loop0
Jun 26 17:41:43 superthecus kernel:  disk 1, wo:1, o:0, dev:sdh
Jun 26 17:41:43 superthecus kernel: RAID1 conf printout:
Jun 26 17:41:43 superthecus kernel:  --- wd:1 rd:2
Jun 26 17:41:43 superthecus kernel:  disk 0, wo:0, o:1, dev:loop0
Jun 26 17:42:43 superthecus mdadm: SpareActive event detected on md 
device /dev/md1, component device /dev/.static/dev/.static/dev/sdh

We have only one tgtd instance:

# ps -C tgtd
   PID TTY          TIME CMD
  6496 ?        00:00:00 tgtd

With the above method, "segfault" is not logged in the syslog.

Note that simply failing a disk in RAID (-f option to mdadm) doesn't 
make tgtd crash. You may need to reproduce the above scenario 1-2 times 
to actually see the issue (for 4 tries, tgtd failed 3 times for me).

[1] While testing it, tgtd crashed on the other target machine (with no 
issues in RAID); on the same machine to which I was blocking traffic 
with iptables:

tgtd[2125]: segfault at 8 rip 40e72d rsp 7fff511287d0 error 6

Definitely, it's not very healthy :(

-- 
Tomasz Chmielewski
http://wpkg.org