[stgt] Infinite Loop on 1.0.26?

Tue Apr 17 21:21:13 CEST 2012

> -----Original Message-----
> From: Goodman, Brad
> Sent: Saturday, April 14, 2012 8:09 PM
> To: Goodman, Brad; 'FUJITA Tomonori'
> Cc: 'stgt at vger.kernel.org'; 'Alexander Nezhinsky'
> Subject: RE: Infinite Loop on 1.0.26?
> 
> > > -----Original Message-----
> > > From: FUJITA Tomonori [mailto:fujita.tomonori at lab.ntt.co.jp]
> > > Sent: Sunday, April 08, 2012 9:14 PM
> > > To: Goodman, Brad
> > > Cc: stgt at vger.kernel.org; alexandern at mellanox.com
> > > Subject: Re: Infinite Loop on 1.0.26?
> > >
> > > On Sat, 7 Apr 2012 21:53:23 -0400
> > > <Brad.Goodman at emc.com> wrote:
> > >
> > > >> > I have never seen this type of behavior ever, on prior
> > > >> > versions. Barring that investigation when/if this happens
> again
> > -
> > > I
> > > >> > just wanted to see if this was a "known" issue, or anyone had
> > ever
> > > >> > seen anything like this before. Is this new? Any ideas?
> > > >>
> > > >> As far as I know, it's new. What the last tgt version worked
> well
> > > for
> > > >> you?
> > > >
> > > > We have done a decent amount of testing with versions 1.0.14 and
> > > 1.0.23
> > > >
> > > > Our testing with 1.0.26 has been fairly short (I'm guessing under
> > 10
> > > minutes actual total run-time).
> > > >
> > > > However, in prior versions our testing has been limited to a
> > maximum
> > > of two initiators, whereas our 1.0.26 has been with a maximum of 8
> > > initiators. In both cases, again, testing has been specific to
> iSER.
> > >
> > > Can you perform the same test against the old versions?
> >
> >
> > I have done some more testing and have not seen this exact bug
> > reproduced, however, I have seen other issues happen with 1.0.26,
> which
> > I believe may explain what was happening.
> >
> > When we reproduced this bug with 1.0.26, it appears as though it did
> > not necessarily happen during our intensive data testing - but at
> some
> > other time around it. I wasn't sure quite when, but it may have been
> > AFTER the testing, possibly associated with other activities, such as
> > adding/removing initiators, etc.
> >
> > In other testing (though I have not seen this exact bug), I have seen
> > cases where if one were to accidentally use a pre-1.0.26 version of
> > tgtadm to talk to tgtd, it appears as though memory leaks occur. I
> have
> > seen this manifest itself in several different, reproducible ways.
> For
> > example, I could create a target, and query which targets exist. The
> > name of the target I just created will show up as garbled. I will
> then
> > try to add a LUN to the target, and it would say the LUN was "in
> use",
> > although it didn't even exist, etc.
> >
> > So, I would conclude that there is a decent chance that I had been
> > using an older version of tgtadm, which may have caused this problem.
> > One of my engineers on the project had told me at one point that
> there
> > appeared to be differences in [the data structures associated with]
> > tgtadm communication in the newer 1.0.26, and there may be some
> > compatibility issues. Thus, I believe I am seeing just that.
> >
> > I would possibly advise:
> >
> > 1. That such data structures which would potential change to the
> point
> > of incompatibility be stamped with some sort of "version number", so
> > that they may be rejected if messages are sent with incompatible
> > versions.
> >
> > 2. Safeguards against the types of (apparent) buffer leaks that may
> > happen when bad, or incompatible data is sent.
> >
> > Either way - this information, I will still keep a watchful eye for
> > issues, but am willing to lay this issue to rest for now.
> >
> > Thanks for your time and attention,
> >
> > Brad Goodman
> > EMC
> 
> 
> Sorry for the craziness - but FUTHER testing indicates that this STILL
> happens with 1.0.26!
> 
> We do have a tiny more info, though:
> 
> First, it did happen during "Steady-state data testing" - this means we
> were just running traffic from some initiators. There were no
> initiators being logged in or out, or any tgtadm commands being issued
> when this happened.
> 
> Second, when it did happen, it was very hard to notice. CPU numbers
> would spike on the target (on all tgtd processes), but this did NOT
> have an effect on performance (see discussion below).
> 
> First, note that our testing was using eight different tgtd instances,
> each with one target which had a NULL backing-store device, and an AIO
> backing device. (There is some debate over which of the two devices we
> were actually testing when the problem arose). We were testing
> exclusively over iSER.
> 
> When this problem happened, (100% CPU time on all tgtd processes - 85%
> system/15% user) - tgtd seemed as though it was constantly calling
> epoll_wait, and rather than it blocking on something, it would always
> immediately return a "1". Further investigation showed, that sometimes
> it would return a "2". When the "2" was returned, it would immediately
> poll other devices, like the "timerfd" file descriptor (which seemed to
> fire every 500mS), and sometimes the "ibverbs-event fd". But the VAST
> majority of the time - and this is telling in and of itself, it didn't
> make *any* type of syscall in its handling of whatever fd it was
> getting. (I.E. It didn't go off and read a file descriptor, in response
> to the fact that epoll_wait was apparently telling it an FD had an
> event on it). It seemed akin to a sporadic interrupt - constantly
> (falsely) firing for no reason, maybe. This could have been for two
> reasons:
> 
> 1. epoll_wait is broken (doubtful)
> 2. Some underlying driver/service was broken, constantly notifying
> epoll_wait
> 3. Something weird happening, which constantly needed actual attention.
> 
> HOWEVER - and this is the big one - What event could have fired back,
> which would have been handled by tgtd in a manner that it wouldn't have
> needed a syscall to do so?? (I don't know the answer). Like some sort
> of services (RDMA?) which could have put data into user-memory
> directly, so a syscall would not be required to poll the result? I
> don't know if such a mechanism exists or is used.
> 
> As to the "second" point above - this behavior was not directly
> noticeable during periods of heavy traffic, because, even though it
> would consume a lot of CPU cycles to constantly poll events which
> (apparently did NOT exist) - when actual events DO exist, the poll is
> doing what it normally needs to - and does not seem to be preempted by
> the "false" events. Therefore, no reduction in throughput is seen. This
> state only seems to be detected once we STOP our normal testing, and
> notice that TGTD is still consuming 100% of each CPU it is on.
> 
> (I think I may have indicated the contrary in a prior message - that
> communication to tgtd was impacted when in this state. In this latest
> test, that certainly was not the case).
> 
> We didn't have a version of tgtd time built with sufficient debug
> information for us to do any inspection other than with strace. We are
> in the process of testing with version prior to 1.0.26, and having
> debug and instrumentation on-hand for the code we will continue to run
> on.
> 
> Thanks,
> 
> -BKG
> 
> P.S. This doesn't change what I said about the problems with
> incompatibilities between tgtd and tgtadm - with versions newer-than
> and prior-to 1.0.25 (or 1.0.26?)
> 

We're definitely still hitting this issue - and I can reproduce it very easily. I have 8 initiators talking to a target instance which has one target with three LUNs (plus controller LUN) under it. When the targets are connected, and I do at:

tgtadm -C 0 -m target --op show

tgtd goes 100% CPU forever (though still processes events). 100% of the time.

Further investigation.

So the session seems to work normally for me, and I get:

Target 1: target_4
    System information:
        Driver: iser
        State: ready
    I_T nexus information:
        I_T nexus: 9
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.13.51
        I_T nexus: 10
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.13.52
        I_T nexus: 11
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.12.53
        I_T nexus: 12
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.11.56
        I_T nexus: 13
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.11.55
        I_T nexus: 14
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.11.58
        I_T nexus: 15
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.12.54
        I_T nexus: 16
            Initiator: iqn.1994-05.com.redhat:90e8d665bb9f
            Connection: 0
                RDMA IP Address: 10.10.11.57
    LUN information:
        LUN: 0
            Type: controller
            SCSI ID: IET     00010000
            SCSI SN: beaf10
            Size: 0 MB, Block size: 1
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: null
            Backing store path: None
            Backing store flags:
        LUN: 1
            Type: disk
            SCSI ID: IET     00010001
            SCSI SN: beaf11
            Size: 700147 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: aio
            Backing store path: /dev/rssda
            Backing store flags:
        LUN: 2
            Type: disk
            SCSI ID: IET     00010002
            SCSI SN: beaf12
            Size: 1099512 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: null
            Backing store path: null
            Backing store flags:
        LUN: 3
            Type: disk
            SCSI ID: IET     00010003
            SCSI SN: beaf13
            Size: 537 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: aio
            Backing store path: /dev/bnull4
            Backing store flags:
    Account information:
    ACL information:
        ALL

But then we hit the issue.

We're getting a bit more debug on it. The issue seems to be happening in that epoll_wait is constantly waking up with an event coming in from the socket: 

    /var/run/tgtd.ipc_abstract_namespace.0

This socket appears to be in the "connected" state.

Note that all initiators have been DISCONNECTED, and there are no processes running tgtadm or otherwise talking to the management port, on this or any other machines. (Thus, I don't know where the traffic on this socket is coming from). I verified this with lsof.

This event is being dispatched to mtask_recv_send_handler, which is in the MTASK_STATE_PDU_SEND state, mtask as follows:

mtask is:
{mtask_state = MTASK_STATE_PDU_SEND, 
retry = 0, 
done = 3045, 
req = {mode = MODE_TARGET, op = OP_SHOW,
  lld = '\000' <repeats 63 times>, len = 120, tid = -1, sid = 0, 
  lun = 18446744073709551615, cid = 0, host_no = 0,
  device_type = 0, ac_dir = 0, pack = 0, force = 0}, req_buf = 0x0, 
  req_bsize = 0, rsp = {err = 0, len = 3053},
  rsp_concat = {streamf = 0x0, err = 0, used = 3044,
  buf = 0x48f3b60 "Target 1: target_6\n    System information:\n        
    Driver: iser\n        State: ready\n    I_T nexus information:\n        
    I_T nexus: 1\n", ' ' <repeats 12 times>, "Initiator: iqn.1994-05.com.redhat:90e8d665bb9f\n
    "..., size = 3045}
}

Like I said before, the MTASK_STATE_PDU_SEND state just results in a call to concat_write which does not seem to do any actual I/O (thus incurring no syscalls).

We've dug in a little. Seems to be calling concat_write, with the offset close to the buffer size, so it does NOT issue the actual "write" command, but then keeps re-entering the call and doing the same thing forever.

We are of the believe that it may have to do with the relatively large size of the data in the buffer, being close to the buffer limit. It seems as though "offset" is always non-zero when the bug is hit. Not really too farmiliar with this code - this is as far as we've traced it to.

However, we're 100% certain that we are using the correct version of tgtadm for the tgtd version when we hit this bug. And with 
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html