[stgt] Quick Failover & Scalability questions

ronnie sahlberg ronniesahlberg at gmail.com
Thu Oct 14 12:14:25 CEST 2010

On Thu, Oct 14, 2010 at 3:26 AM, Mark Lehrer <mark at knm.org> wrote:
> The simplest would be for the clients to reconnect to the new server and
> re-establish communications.

This is how almost all initiators work today, and it is quite
transparent to applications.
As long as the failover to the "new" server is reasonably quick and
completes before the initiators run out of writeback cache.
Once you get "Delayed Write Failed" dialogs,  that is when the pain starts.

For blocking reads, applications only see ms to a few seconds pauses
during the session failure and reconnect.
If you use ctdb for failover and its "tcptickle" feature, this should
be able to shortcircuit any tcp sessions hung inside tcp
retransmission timeouts and speed up recovery.
Often during recovery of cluster applications, after failover it is
often very common that clients are stuck inside tcp retransmission
backoffs sitting for 10-20-40 serconds before tcp will detect the
session failure, which greatly increases the recovery time. there are
certain tricks in tcp that ctdb uses, and other applications could
too, to shourtcircuit the the client timeouts and trigger session
recovery to happen immediately.

I.e.   most time spent paused during failover is actually not spent in
failover at all but rather waiting for the tcp stack on the clients to
detect the session failure. Tcp retransmission backoff is not your
friend here.
This can usually be short-circuited from the server by clever tcp hacks.

> However, how painful would it be for the new
> server to keep the same sockets open for a truly seamless failover?  Again,
> I am only concerned about the tgtd internal states at this point - assume
> that the block device mirroring as well as the
> keepalived/heartbeat/iptables/fencing/etc issues are handled already (though
> there would obviously be a good bit of integration work there!).

Keeping application state and kernel state (tcp state) is horribly
complex and difficult and make this into transparent failover is very
I personally do not think that is required for the iscsi protocol since
*) there is so little state required in iscsi
*) all initiators quickly reconnect and quickly rebuild all required
state in almost all situations anyway.

ronnie sahlberg
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

More information about the stgt mailing list