[sheepdog] [PATCH 1/2] test: add a test for sockfd keepalive

Mon Sep 3 15:30:09 CEST 2012

On 09/03/2012 08:24 PM, MORITA Kazutaka wrote:
> No.  The reason I doubt keepalive is that, when the trouble happens,
> the scripts takes 15 minutes always.  I just guess the connection is
> closed with another timeout, but I'm not sure.  So, I wrote 'perhaps'.
> 
>> > 
>> > I am not sure, but I think current keepalive implementation looks okay to me, it is simple
>> > and efficient. I have tested with various situation besides this script. If there is any
>> > problem inside the code, I'd like to fix the bug instead of running away completely from it.
> Okay, but in future, it would be considerable to remove TCP keepalive.
> The check of node availability is the work of cluster driver.

All the hangs is suspected to use RTO instead of keepalive timer. Could you please tell me where
the thread is hung at? 

This might not be the topic, but for a quick debug, I found connect() will use RTO as timer
instead of keepalive too. This can happen during connect() to other node and that node crash meanwhile.
This problem (the RTO timer takes minutes to fire out) can't be solved even you close(fd) when epoch changes
because we are hung at connect() and this fd isn't registered yet.

I think we need to find all the possible RTO only timers and use keepalive timer/snd timer/recv timer/ instead.

Thanks,
Yuan