Hi Tomasz, I could not get that configuration to work. Can you please provide more detailed instructions exactly how to set up hosts A B and C so I can try to reproduce it. Please provide the exact commandline for each and every command I need to run on the three hosts and Ill try to reproduce it under gdb. On Fri, Jun 27, 2008 at 1:51 AM, Tomasz Chmielewski <mangoo at wpkg.org> wrote: > Tomasz Chmielewski schrieb: >> >> ronnie sahlberg schrieb: >>> >>> Since it is reproducible, can you run it under gdb and get a backtrace? >> >> Unfortunately not. >> >> The disk is now totally failed - some fifteen minutes ago I was still able >> to read from it for a couple of minutes after plugging it in. >> >> Now I can't read from it any more after plugging it in, so I can't add it >> to RAID. > > OK, I found a way to reproduce it without using any broken disks. > > You will need these to build a RAID-1 device: > - losetup or a separate partition/disk > - 3 machines (1 target, 1 target+initiator, 1 initiator) > > > If something was not clear enough, please just ask. > > > 1. Create a ~100 MB empty file (or bigger, if you want): > > dd if=/dev/zero of=raid-a bs=1k count=104857 > > > 2. Make a block device out of it: > > losetup /dev/loop0 raid-a > > > 3. Log in to another target - lets say the drive will be called /dev/sdh > (substitute the target address etc.): > > iscsiadm -m node -T iqn......test -p 10.1.1.1 -l > > > 4. Build RAID-1 (above target is now /dev/sdz): > > mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/loop0 /dev/sdh > > > 5. Wait a bit until it's synced (see /proc/mdstat), and make this target > available to initiators (substitute the values to match your own): > > tgtadm --lld iscsi --op new --mode target --tid 3 -T > iqn.2008-06.net.syneticon:superthecus.test > tgtadm --lld iscsi --op new --mode logicalunit --tid 3 --lun 1 -b /dev/md1 > tgtadm --lld iscsi --op bind --mode target --tid 3 -I ALL > > > 6. Start watching the syslog (on the machine where tgtd will crash): > > tail -f /var/log/syslog > > > 7. From yet another machine (initiator), log in to that target, and read > data from it (let's say the target is now /dev/sdb): > > iscsiadm -m node -T iqn.2008-06.net.syneticon:superthecus.test -p 10.1.1.2 > -l > > while true; do dd if=/dev/sdb of=/dev/null bs=1k; done > > > 8. And immediately block the traffic between this machine (where we will > make tgtd fail after a drive failed in RAID) and the other target > (immediately, because if you have lots of RAM and the whole device is > cached, it won't be kicked out of RAID very fast): > > iptables -A INPUT -s 10.1.1.1 -p tcp --sport 3260 -j DROP > > > The connection will fail after 120 seconds with standard open-iscsi > settings. > > After blocking the traffic, the syslog says (with my comments): > > > This is where we can't communicate with the other target (which is a part of > RAID): > > Jun 26 17:39:43 superthecus iscsid: Nop-out timedout after 15 seconds on > connection 6:0 state (3). Dropping session. > > > When a disk in RAID-1 is failing, there is no IO to the whole array for > ~minute - tgtd notices it[1]: > > Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a04 0 > Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a05 0 > Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found 0 0 > Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1c e > Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1e e > > > The disk is finally failed: > > Jun 26 17:41:43 superthecus kernel: session6: session recovery timed out > after 120 secs > Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 > driverbyte=0x00 > Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector > 44256 > Jun 26 17:41:43 superthecus kernel: printk: 46 messages suppressed. > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44256 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44264 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44272 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44280 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44288 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44296 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44304 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44312 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44320 > Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44328 > Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 > driverbyte=0x00 > Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector > 44000 > Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 > driverbyte=0x00 > Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector > 44256 > Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01 > driverbyte=0x00 > Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector > 44256 > Jun 26 17:41:43 superthecus kernel: raid1: Disk failure on sdh, disabling > device. > Jun 26 17:41:43 superthecus kernel: ^IOperation continuing on 1 devices > Jun 26 17:41:43 superthecus mdadm: Fail event detected on md device > /dev/md1, component device /dev/.static/dev/.static/dev/sdh > Jun 26 17:41:43 superthecus kernel: RAID1 conf printout: > Jun 26 17:41:43 superthecus kernel: --- wd:1 rd:2 > Jun 26 17:41:43 superthecus kernel: disk 0, wo:0, o:1, dev:loop0 > Jun 26 17:41:43 superthecus kernel: disk 1, wo:1, o:0, dev:sdh > Jun 26 17:41:43 superthecus kernel: RAID1 conf printout: > Jun 26 17:41:43 superthecus kernel: --- wd:1 rd:2 > Jun 26 17:41:43 superthecus kernel: disk 0, wo:0, o:1, dev:loop0 > Jun 26 17:42:43 superthecus mdadm: SpareActive event detected on md device > /dev/md1, component device /dev/.static/dev/.static/dev/sdh > > > We have only one tgtd instance: > > # ps -C tgtd > PID TTY TIME CMD > 6496 ? 00:00:00 tgtd > > > With the above method, "segfault" is not logged in the syslog. > > > > > Note that simply failing a disk in RAID (-f option to mdadm) doesn't make > tgtd crash. You may need to reproduce the above scenario 1-2 times to > actually see the issue (for 4 tries, tgtd failed 3 times for me). > > > [1] While testing it, tgtd crashed on the other target machine (with no > issues in RAID); on the same machine to which I was blocking traffic with > iptables: > > tgtd[2125]: segfault at 8 rip 40e72d rsp 7fff511287d0 error 6 > > Definitely, it's not very healthy :( > > > > -- > Tomasz Chmielewski > http://wpkg.org > > |