[sheepdog] fix embryonic connection

Liu Yuan namei.unix at gmail.com
Tue Sep 4 09:00:56 CEST 2012


On 09/04/2012 12:38 AM, Liu Yuan wrote:
> ESTAB      0      52                                     127.0.0.1:48339                                 127.0.0.1:7001
> timer:(on,49sec,12) users:(("sheep",4961,16)) ino:855713 sk:ffff88007086de80 ts sack cubic wscale:7,7 rto:120000 rtt:18.75/7.5 ato:40 ssthresh:7 send 14.0Mbps rcv_space:32792

I guess I found the problem why keepalive doesn't take any effect. There
is 52 bytes(our header!) in send buffer not acknowledged by the remote
host, then the RTO timer is fired on, which mute keepalive timer.

keepalive has the following benefit:

   when the remote node is just busy(for e.g, with doing disk IO) not
really down, poll might false timeout and we'll close all the valid fds
of that node.

Actually, if data is already sent to the remote node, when remote node
crashes, the poll & keepalive works well.

So basically there are two fixing out there:
 1 add timeout for poll (pros: simply, cons: false timeout)
 2 add timeout for poll only if we detect that send buffer is not emapy.
If not, do with keepalive. ( pros: efficient, cons: more complex code)

I prefer 2, because most of times (for normal running time), we dont'
have this corner case.

What do you think? Kazutaka.

Thanks,
Yuan



More information about the sheepdog mailing list