Tomasz Chmielewski schrieb: > ronnie sahlberg schrieb: >> Since it is reproducible, can you run it under gdb and get a backtrace? > > Unfortunately not. > > The disk is now totally failed - some fifteen minutes ago I was still > able to read from it for a couple of minutes after plugging it in. > > Now I can't read from it any more after plugging it in, so I can't add > it to RAID. OK, I found a way to reproduce it without using any broken disks. You will need these to build a RAID-1 device: - losetup or a separate partition/disk - 3 machines (1 target, 1 target+initiator, 1 initiator) If something was not clear enough, please just ask. 1. Create a ~100 MB empty file (or bigger, if you want): dd if=/dev/zero of=raid-a bs=1k count=104857 2. Make a block device out of it: losetup /dev/loop0 raid-a 3. Log in to another target - lets say the drive will be called /dev/sdh (substitute the target address etc.): iscsiadm -m node -T iqn......test -p 10.1.1.1 -l 4. Build RAID-1 (above target is now /dev/sdz): mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/loop0 /dev/sdh 5. Wait a bit until it's synced (see /proc/mdstat), and make this target available to initiators (substitute the values to match your own): tgtadm --lld iscsi --op new --mode target --tid 3 -T iqn.2008-06.net.syneticon:superthecus.test tgtadm --lld iscsi --op new --mode logicalunit --tid 3 --lun 1 -b /dev/md1 tgtadm --lld iscsi --op bind --mode target --tid 3 -I ALL 6. Start watching the syslog (on the machine where tgtd will crash): tail -f /var/log/syslog 7. From yet another machine (initiator), log in to that target, and read data from it (let's say the target is now /dev/sdb): iscsiadm -m node -T iqn.2008-06.net.syneticon:superthecus.test -p 10.1.1.2 -l while true; do dd if=/dev/sdb of=/dev/null bs=1k; done 8. And immediately block the traffic between this machine (where we will make tgtd fail after a drive failed in RAID) and the other target (immediately, because if you have lots of RAM and the whole device is cached, it won't be kicked out of RAID very fast): iptables -A INPUT -s 10.1.1.1 -p tcp --sport 3260 -j DROP The connection will fail after 120 seconds with standard open-iscsi settings. After blocking the traffic, the syslog says (with my comments): This is where we can't communicate with the other target (which is a part of RAID): Jun 26 17:39:43 superthecus iscsid: Nop-out timedout after 15 seconds on connection 6:0 state (3). Dropping session. When a disk in RAID-1 is failing, there is no IO to the whole array for ~minute - tgtd notices it[1]: Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a04 0 Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a05 0 Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found 0 0 Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1c e Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1e e The disk is finally failed: Jun 26 17:41:43 superthecus kernel: session6: session recovery timed out after 120 secs Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 driverbyte=0x00 Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector 44256 Jun 26 17:41:43 superthecus kernel: printk: 46 messages suppressed. Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44256 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44264 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44272 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44280 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44288 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44296 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44304 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44312 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44320 Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44328 Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 driverbyte=0x00 Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector 44000 Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 driverbyte=0x00 Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector 44256 Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 driverbyte=0x00 Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector 44256 Jun 26 17:41:43 superthecus kernel: raid1: Disk failure on sdh, disabling device. Jun 26 17:41:43 superthecus kernel: ^IOperation continuing on 1 devices Jun 26 17:41:43 superthecus mdadm: Fail event detected on md device /dev/md1, component device /dev/.static/dev/.static/dev/sdh Jun 26 17:41:43 superthecus kernel: RAID1 conf printout: Jun 26 17:41:43 superthecus kernel: --- wd:1 rd:2 Jun 26 17:41:43 superthecus kernel: disk 0, wo:0, o:1, dev:loop0 Jun 26 17:41:43 superthecus kernel: disk 1, wo:1, o:0, dev:sdh Jun 26 17:41:43 superthecus kernel: RAID1 conf printout: Jun 26 17:41:43 superthecus kernel: --- wd:1 rd:2 Jun 26 17:41:43 superthecus kernel: disk 0, wo:0, o:1, dev:loop0 Jun 26 17:42:43 superthecus mdadm: SpareActive event detected on md device /dev/md1, component device /dev/.static/dev/.static/dev/sdh We have only one tgtd instance: # ps -C tgtd PID TTY TIME CMD 6496 ? 00:00:00 tgtd With the above method, "segfault" is not logged in the syslog. Note that simply failing a disk in RAID (-f option to mdadm) doesn't make tgtd crash. You may need to reproduce the above scenario 1-2 times to actually see the issue (for 4 tries, tgtd failed 3 times for me). [1] While testing it, tgtd crashed on the other target machine (with no issues in RAID); on the same machine to which I was blocking traffic with iptables: tgtd[2125]: segfault at 8 rip 40e72d rsp 7fff511287d0 error 6 Definitely, it's not very healthy :( -- Tomasz Chmielewski http://wpkg.org |