[Stgt-devel] disk kicked out of RAID -> tgtd segmentation fault

Mon Jun 30 01:56:48 CEST 2008

Hi Tomasz,

I could not get that configuration to work.

Can you please provide more detailed instructions exactly how to set
up hosts A B and C
so I can try to reproduce it.

Please provide the exact commandline for each and every command I need
to run on the three hosts and Ill try to
reproduce it under gdb.

On Fri, Jun 27, 2008 at 1:51 AM, Tomasz Chmielewski <mangoo at wpkg.org> wrote:
> Tomasz Chmielewski schrieb:
>>
>> ronnie sahlberg schrieb:
>>>
>>> Since it is reproducible,  can you run it under gdb and get a backtrace?
>>
>> Unfortunately not.
>>
>> The disk is now totally failed - some fifteen minutes ago I was still able
>> to read from it for a couple of minutes after plugging it in.
>>
>> Now I can't read from it any more after plugging it in, so I can't add it
>> to RAID.
>
> OK, I found a way to reproduce it without using any broken disks.
>
> You will need these to build a RAID-1 device:
> - losetup or a separate partition/disk
> - 3 machines (1 target, 1 target+initiator, 1 initiator)
>
>
> If something was not clear enough, please just ask.
>
>
> 1. Create a ~100 MB empty file (or bigger, if you want):
>
> dd if=/dev/zero of=raid-a bs=1k count=104857
>
>
> 2. Make a block device out of it:
>
> losetup /dev/loop0 raid-a
>
>
> 3. Log in to another target - lets say the drive will be called /dev/sdh
> (substitute the target address etc.):
>
> iscsiadm -m node -T iqn......test -p 10.1.1.1 -l
>
>
> 4. Build RAID-1 (above target is now /dev/sdz):
>
> mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/loop0 /dev/sdh
>
>
> 5. Wait a bit until it's synced (see /proc/mdstat), and make this target
> available to initiators (substitute the values to match your own):
>
> tgtadm --lld iscsi --op new --mode target --tid 3 -T
> iqn.2008-06.net.syneticon:superthecus.test
> tgtadm --lld iscsi --op new --mode logicalunit --tid 3 --lun 1 -b /dev/md1
> tgtadm --lld iscsi --op bind --mode target --tid 3 -I ALL
>
>
> 6. Start watching the syslog (on the machine where tgtd will crash):
>
> tail -f /var/log/syslog
>
>
> 7. From yet another machine (initiator), log in to that target, and read
> data from it (let's say the target is now /dev/sdb):
>
> iscsiadm -m node -T iqn.2008-06.net.syneticon:superthecus.test -p 10.1.1.2
> -l
>
> while true; do dd if=/dev/sdb of=/dev/null bs=1k; done
>
>
> 8. And immediately block the traffic between this machine (where we will
> make tgtd fail after a drive failed in RAID) and the other target
> (immediately, because if you have lots of RAM and the whole device is
> cached, it won't be kicked out of RAID very fast):
>
> iptables -A INPUT -s 10.1.1.1 -p tcp --sport 3260 -j DROP
>
>
> The connection will fail after 120 seconds with standard open-iscsi
> settings.
>
> After blocking the traffic, the syslog says (with my comments):
>
>
> This is where we can't communicate with the other target (which is a part of
> RAID):
>
> Jun 26 17:39:43 superthecus iscsid: Nop-out timedout after 15 seconds on
> connection 6:0 state (3). Dropping session.
>
>
> When a disk in RAID-1 is failing, there is no IO to the whole array for
> ~minute - tgtd notices it[1]:
>
> Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a04 0
> Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found a05 0
> Jun 26 17:39:50 superthecus tgtd: abort_task_set(938) found 0 0
> Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1c e
> Jun 26 17:39:50 superthecus tgtd: abort_cmd(914) found 1e e
>
>
> The disk is finally failed:
>
> Jun 26 17:41:43 superthecus kernel:  session6: session recovery timed out
> after 120 secs
> Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01
> driverbyte=0x00
> Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector
> 44256
> Jun 26 17:41:43 superthecus kernel: printk: 46 messages suppressed.
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44256
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44264
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44272
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44280
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44288
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44296
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44304
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44312
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44320
> Jun 26 17:41:43 superthecus kernel: raid1: sdh: rescheduling sector 44328
> Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01
> driverbyte=0x00
> Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector
> 44000
> Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01
> driverbyte=0x00
> Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector
> 44256
> Jun 26 17:41:43 superthecus kernel: sd 14:0:0:1: [sdh] Result: hostbyte=0x01
> driverbyte=0x00
> Jun 26 17:41:43 superthecus kernel: end_request: I/O error, dev sdh, sector
> 44256
> Jun 26 17:41:43 superthecus kernel: raid1: Disk failure on sdh, disabling
> device.
> Jun 26 17:41:43 superthecus kernel: ^IOperation continuing on 1 devices
> Jun 26 17:41:43 superthecus mdadm: Fail event detected on md device
> /dev/md1, component device /dev/.static/dev/.static/dev/sdh
> Jun 26 17:41:43 superthecus kernel: RAID1 conf printout:
> Jun 26 17:41:43 superthecus kernel:  --- wd:1 rd:2
> Jun 26 17:41:43 superthecus kernel:  disk 0, wo:0, o:1, dev:loop0
> Jun 26 17:41:43 superthecus kernel:  disk 1, wo:1, o:0, dev:sdh
> Jun 26 17:41:43 superthecus kernel: RAID1 conf printout:
> Jun 26 17:41:43 superthecus kernel:  --- wd:1 rd:2
> Jun 26 17:41:43 superthecus kernel:  disk 0, wo:0, o:1, dev:loop0
> Jun 26 17:42:43 superthecus mdadm: SpareActive event detected on md device
> /dev/md1, component device /dev/.static/dev/.static/dev/sdh
>
>
> We have only one tgtd instance:
>
> # ps -C tgtd
>  PID TTY          TIME CMD
>  6496 ?        00:00:00 tgtd
>
>
> With the above method, "segfault" is not logged in the syslog.
>
>
>
>
> Note that simply failing a disk in RAID (-f option to mdadm) doesn't make
> tgtd crash. You may need to reproduce the above scenario 1-2 times to
> actually see the issue (for 4 tries, tgtd failed 3 times for me).
>
>
> [1] While testing it, tgtd crashed on the other target machine (with no
> issues in RAID); on the same machine to which I was blocking traffic with
> iptables:
>
> tgtd[2125]: segfault at 8 rip 40e72d rsp 7fff511287d0 error 6
>
> Definitely, it's not very healthy :(
>
>
>
> --
> Tomasz Chmielewski
> http://wpkg.org
>
>