[stgt] tgtd does data corruption after "Forcing release of tx task"?

Tue Jun 28 11:38:51 CEST 2011

Hi,

we have a weird data integrity problem with tgtd.

The problem gets evident when seeing random rpm checksum errors
on iscsi clients by doing "rpm -V -a" consistency checks.
(The checksum errors are on different files on each run after
a drop_caches)

Further tests revealed also data corruption when writing to
the target. 
There gets about one block corrupted per GByte.
The amount is strongly workload dependant.

The tgtd daemon is running without any problem before.
Current tgtd Version is 1.0.8 (as provided by RHEL5).

The only difference than before was a backing storage
expansion going on the day before.
During storage expansion the controller disabled his caching
module and the backing storage gets very slow for about 4 hours.

During this expansion period those kind messages where logged to
syslog by tgtd:

...
tgtd: conn_close(129) Forcing release of tx task 0x16eb12c0 0 0
...
tgtd: conn_close(163) Forcing release of tx task 0x16fd1570 0
...
tgtd: conn_close(129) Forcing release of tx task 0x16f9b010 10000038 1
...
(about 190 lines; with different addresses and numbers; no errors where
logged after expansion completed)

No other indication of an error found.
After restarting tgtd the problem vanished.

IO-Stack:
cciss <-> LVM <-> tgtd <=Ethernet=> open-iscsi <-> KVM

A dumped cores with gcore from tgtd processes before terminating
the (corrupted) tgtd daemon.
(See http://bach.wu.ac.at/rfried/tgtd_cores.zip)

Does anyone ever had problems like this?

Kind Regards,
Roland

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html