[stgt] stgt does not preempt SCSI-2 reservations; may break MS Cluster Service failover

Florian Haas florian.haas at linbit.com
Mon Sep 7 10:16:06 CEST 2009


On 09/07/2009 08:00 AM, Florian Haas wrote:
> On 09/07/2009 07:58 AM, FUJITA Tomonori wrote:
>> On Mon, 07 Sep 2009 07:47:32 +0200
>> Florian Haas <florian.haas at linbit.com> wrote:
>>
>>>> What iSCSI initiator implementation do you use on Linux?
>>> For testing? I use open-iscsi on Debian lenny; the Debian package
>>> version number is 2.0.870~rc3-0.4. My sg3-utils package version (if
>>> that's of any help) is 1.24-2.
>> That's strange. open-iscsi doesn't send TARGET_REST.
> 
> Interesting. Let me grab a packet dump and I'll be back in touch.

You're right, open-iSCSI apparently simply translates any host or "bus"
reset into a new login sequence. For device resets, it does use a Task
Management Function (0x02) of the type "LU reset" (0x05).

Going back to my original problem, I've now sifted through packet traces
generated on the production iSCSI target server (the one that the MSCS
hosts talk to), and have encountered something that leaves me
confounded. This applies to both IET and STGT (hence yet another
cross-post to both lists), so it's either something that is wrong in
both implementation, or some breakage in MSCS. Perhaps someone can
enlighten me here.

Here is the situation:
- I have two initiator hosts, 10.160.156.24 and 10.160.156.26. Both are
part of the same MSCS cluster.
- The quorum device is on the iSCSI target, LUN 3.
- .24 is the active host. It issues RESERVE commands every three seconds
and gets these confirmed reliably.
- .24 gets forcibly disconnected from the network, by having its
Ethernet cable removed.
- 110 seconds expire. This is in line with a relatively long
Time2Retain, which in this setup is 90 seconds.
- 110 seconds after .24 has issued its last successful RESERVE command,
.26 apparently attempts to acquire the quorum device.
- I now see a SERVICE ACTION IN command (opcode 0x9e) from .26, with a
Service Action of Read Capacity (10).
- This fails with a Reservation Conflict (0x18) status.

The initiator on .26 then repeats the last two actions indefinitely. It
apparently never even attempts to recover from this situation. Whatever
"bus reset" entries I am seeing in the Windows Event log, none of those
actions ever appear to actually reach the target -- I am not seeing a
renewed login attempt, nor a target reset, nor a LUN reset, nothing.

I am also failing to understand why the MS initiator would use the
SERVICE ACTION IN detour when upon initial login it just uses standard
INQUIRY commands and READ CAPACITY.

I have complete pcap traces from the sequence of events, both for IET
and STGT -- I can send them off-list if anyone is interested in looking
into this.

Fujita-san, Ross, Arne -- any ideas at all?

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <http://lists.wpkg.org/pipermail/stgt/attachments/20090907/43ad5d28/attachment-0001.sig>


More information about the stgt mailing list