[Stgt-devel] yet another tgtd iSCSI misbehaviour (aborted journal, remounting ro)
Wed Feb 6 10:45:54 CET 2008
It seems there is yet another problem (?) in tgtd.
It can be easily reproduced when the initiator crashes and then starts
again. I tested it only with diskless machines booted off iSCSI.
1. Start tgtd, apply settings with tgtadm
2. Start a diskless initiator:
a) a diskless initiator fetches the kernel and the initrd via PXE/tftp
b) kernel executes initrd; initrd brings the interface up
c) initrd starts the iSCSI connection with "iscsistart" command from
d) we switch to a new root, system boots fine
e) IMPORTANT - system starts iscsid now (/etc/init.d/open-iscsi start)
So far, everything was fine and unproblematic.
3. Now, crash your initiator machine (i.e. press reboot button).
4. Initiator starts just fine again - the connection was established
5. IMPORTANT - start iscsid now (/etc/init.d/open-iscsi start). The
initiator will report "connection1:0: iscsi: detected conn error (1011)"
and eventually, will break the connection, remount fs readonly etc.
scary things will happen.
a) there is a workaround to that: when initiator reports
"connection1:0: iscsi: detected conn error..." - kill tgtd, and start it
again. Initiator will reconnect flawlessly
b) if you don't kill/start tgtd again, connection will break and fs
will be remounted ro.
The issue does not happen with IET or SCST.
It looks like:
- tgtd has an established connection with an initiator
- initiator is killed, but tgtd still thinks initiator is connected to it
- initiator connects from the same IP address
- when we start iscsid on the initiator, it confuses tgtd, tgtd breaks
and has to be restarted
Let me know if you need such tcpdumps (if so, please give me all tcpdump
command line options you would use):
- point 2e) - clean start of iscsid on the initiator
- point 5) - iscsid start on the initiator when connection breaks
- iscsid start on the initiator, target is SCST
 I use kexec here to reboot the machine because it has a buggy BIOS
(an old Supermicro P4SBR/P4SBE server). Randomly, it doesn't reboot when
a normal reboot command is used; the system shuts down, but never
reboots. kexec is a nice workaround for that, but it doesn't close
network sockets, so the target thinks we're still connected.
More information about the stgt