[stgt] Smooth cluster failover with Microsoft iSCSI initiator

Tue Jul 7 20:08:23 CEST 2009

Hello,

My apologies if this has been inquired about before; if there is a post
I overlooked in the list archives that addresses my issue, please feel
free to point me to it.

I am currently working on iSCSI Target & LU resource agents for the
Pacemaker cluster manager. If interested, please see
http://hg.linux-ha.org/dev/; the relevant resource agents are named
iSCSITarget and iSCSILogicalUnit and can be found in
http://hg.linux-ha.org/dev/file/tip/resources/OCF. The implementation
currently supports IET and STGT, and is intended to be used in
conjunction with other cluster resource types so the following sequence
occurs on resource startup:

- block all access to TCP port 3260 via a firewall rule;
- switch a DRBD (www.drbd.org) device into the Primary role;
- make available an LVM Volume Group that resides on that DRBD device;
- fire up a virtual cluster IP address that initiators use to connect to
the target portal;
- create an iSCSI target and portal;
- assign LUs to that target (these map to LVs on the DRBD-backed VG);
- unblock access to TCP port 3260.

On resource shutdown, the same procedure happens in reverse order.
Resource migration (to the peer cluster node) in essence amounts to
shutdown on node A, then startup on node B. The entire process typically
completes in well under 30 seconds.

Now, when failover completes, connected initiators naturally encounter a
connection reset from the target daemon. The open-iSCSI initiator takes
this in stride, reconnecting immediately and continuing any ongoing I/O
unhampered.

The Microsoft iSCSI initiator (2.08), when connected to an IET target,
also reconnects immediately after target failover. From the Windows
event log:

Event ID 20 (from iSCSIPrt)
Connection to the target was lost. The initiator will attempt to retry
the connection.
Event ID 34 (from iSCSIPrt)
A connection to the target was lost, but Initiator successfully
reconnected to the target. Dump data contains the target name.

Ongoing I/O on connected devices, in this case, continues without a
user-noticeable hiccup.

I see the same messages when the same Microsoft iSCSI initiator is
connected to an STGT target. However, and only when talking to a STGT
target, I also see these (after the connection is re-established):

Event ID 12 (from PlugPlayManager)
The device 'IET      Controller       SCSI Array Device'
(SCSI\Array&Ven_IET_____&Prod_Controller______&Rev_0001\1&2afd7d61&2&000000)
disappeared from the system without first being prepared for removal.

These are repeated for all iSCSI disks the initiator is connected to. I
am also getting these:

Event ID 57 (from Ftdisk)
The system failed to flush data to the transaction log. Corruption may
occur.

In the STGT case, even though the initiator automatically reconnects to
the target, any I/O on the connected target is interrupted, and the
Windows box spews out positively alarming messages. Now I wonder what
STGT is doing differently from IET here? Is there any specific target or
LU parameter that should be set in order to avoid this issue?

Any insight would be much appreciated. Thanks very much!

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <http://lists.wpkg.org/pipermail/stgt/attachments/20090707/91b63fd0/attachment.sig>