[sheepdog] [PATCH 1/2] test: add a test for sockfd keepalive

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Mon Sep 3 16:07:34 CEST 2012


At Mon, 03 Sep 2012 21:30:09 +0800,
Liu Yuan wrote:
> 
> On 09/03/2012 08:24 PM, MORITA Kazutaka wrote:
> > No.  The reason I doubt keepalive is that, when the trouble happens,
> > the scripts takes 15 minutes always.  I just guess the connection is
> > closed with another timeout, but I'm not sure.  So, I wrote 'perhaps'.
> > 
> >> > 
> >> > I am not sure, but I think current keepalive implementation looks okay to me, it is simple
> >> > and efficient. I have tested with various situation besides this script. If there is any
> >> > problem inside the code, I'd like to fix the bug instead of running away completely from it.
> > Okay, but in future, it would be considerable to remove TCP keepalive.
> > The check of node availability is the work of cluster driver.
> 
> All the hangs is suspected to use RTO instead of keepalive timer. Could you please tell me where
> the thread is hung at? 

It waits for a response from the unreachable node at poll() in
wait_forward_request().  I'm not sure why it returns after keepalive
timeout...

Thanks,

Kazutaka

> 
> This might not be the topic, but for a quick debug, I found connect() will use RTO as timer
> instead of keepalive too. This can happen during connect() to other node and that node crash meanwhile.
> This problem (the RTO timer takes minutes to fire out) can't be solved even you close(fd) when epoch changes
> because we are hung at connect() and this fd isn't registered yet.
> 
> I think we need to find all the possible RTO only timers and use keepalive timer/snd timer/recv timer/ instead.
> 
> Thanks,
> Yuan
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list