[stgt] Mysterious stale cache
Neil Skrypuch
neil at pollstream.com
Wed Nov 17 23:23:57 CET 2010
I'm using stgt to export a block device that's replicated in primary/primary
mode with DRBD, upon which the initiators will ultimately mount a GFS2
filesystem (after some indirection through CLVM). Basically, what I'm finding
is that some nodes are getting stale cached reads, while other nodes are
getting the correct data. The gory details are below...
I have the following systems (VMs, at the moment) setup:
gfs2-1: data node (stgt exports the DRBD block device)
gfs2-2: data node (stgt exports the DRBD block device)
gfs2-3: initiator
gfs2-4: initiator
gfs2-5: initiator
I have disabled the write cache on both gfs2-1 and gfs2-2:
[root at gfs2-1 ~]# grep -Pv '^\s*#' /etc/tgt/targets.conf
default-driver iscsi
<target iqn.2010-11.com.polldev:gfs2-1.gfs2>
backing-store /dev/drbd1
write-cache off
</target>
DRBD is using protocol "C" (fully synchronous) on /dev/vdb (a virtio disk).
All of the initiator machines import the target from both gfs2-1 and gfs2-2,
which is then accessed in a multibus fashion via dm-multipath. The idea is
that I can reboot or otherwise remove one of the data nodes from service at
any time without any other nodes knowing or caring.
Now, this all works swimmingly except that about half the nodes get stale
cache back when they read data, depending on which data node they're actually
reading from. For example:
1. gfs2-4 and gfs2-5 read sector n from gfs2-2
2. gfs2-3 issues a write to sector n on gfs2-1
3. gfs2-1 commits the write to disk
4. gfs2-1 replicates the write to gfs2-2 via DRBD
At this point, any attempt to read the data written in step 1 through gfs2-2
over iSCSI will return the same result that it did during step 1 and
will /not/ reflect the data written in step 2. Unfortunately, this means that
those reads will return incorrect data. Reads issued to gfs2-1 will return
the correct data.
Now, the especially interesting part is that reading directly from /dev/drbd1
on gfs2-1 or gfs2-2 (avoiding iSCSI) always returns the correct data.
Furthermore, if I issue a "echo 3 >/proc/sys/vm/drop_caches" on both gfs2-1
and gfs2-2 (but not the rest of the nodes), the correct data is returned via
iSCSI until more writes occur.
Given the above, I'm fairly certain that the problem is somewhere in tgtd, but
to further confirm, I tried exporting the block devices on gfs2-1 and gfs2-2
via GNBD instead of iSCSI, and the problem disappeared, reads always return
the correct value. (There are other issues with GNBD which make me hesitant
to go any further with it, but that's neither here nor there.)
My specific test case is to append a '.' to a file every second on one node
and issue a "watch ls -l" in that directory on every node. As the nodes
switch from one path to another in the multipath, some nodes inevitably get
back stale data while others get back fresh data. Though, as mentioned above,
gfs2-1 and gfs2-2 always get fresh data because they mount /dev/drbd1
directly instead of going through iSCSI.
I had a peek at the code and it appears that rdwr does not open the file with
O_DIRECT. This means that in the case of dual primary DRBD and thus the block
device changing behind tgtd's back, the page cache would be stale but not
invalidated as it should be. Though I didn't test it (yet), I'm fairly sure
that if rdwr opened the file with O_DIRECT that this would work correctly for
me.
I also tried replacing gfs2-1 and gfs2-2 with otherwise identically configured
RHEL 6 machines, but this produced the same results.
I eventually stumbled upon the bs-type config setting, although I tried mmap,
sg and aio, only aio seemed to work at all. Both mmap and sg failed with an
error like so:
tgtadm: invalid request
Command:
tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun
1 -b /dev/drbd1 --bstype mmap
exited with code: 22.
Notably, when using aio, all the nodes appear to get the correct data on read.
Unfortunately, when RHEL 5 clients connect to RHEL 6 targets using aio, I'm
unable to mount the filesystem (a slew of I/O errors are returned and the
system withdraws from the filesystem).
All of the machines are running RHEL 5.5, which means scsi-target-utils is
version 0.0-6.20091205snap.el5_5.3 and iscsi-initiator-utils is version
6.2.0.871-0.16.el5.
In my brief experiment with RHEL 6, scsi-target-utils was version 1.0.4-3.el6
and iscsi-initiator-utils was version 6.2.0.872-10.el6.
Ultimately, my question is twofold:
1) Is it intentional that rdwr does not use O_DIRECT?
2) Should RHEL 5 clients be able to connect successfully using aio to RHEL 6
targets?
- Neil
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
More information about the stgt
mailing list