[sheepdog] [PATCH 1/2] test: add a test for sockfd keepalive
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Mon Sep 3 16:07:34 CEST 2012
At Mon, 03 Sep 2012 21:30:09 +0800,
Liu Yuan wrote:
>
> On 09/03/2012 08:24 PM, MORITA Kazutaka wrote:
> > No. The reason I doubt keepalive is that, when the trouble happens,
> > the scripts takes 15 minutes always. I just guess the connection is
> > closed with another timeout, but I'm not sure. So, I wrote 'perhaps'.
> >
> >> >
> >> > I am not sure, but I think current keepalive implementation looks okay to me, it is simple
> >> > and efficient. I have tested with various situation besides this script. If there is any
> >> > problem inside the code, I'd like to fix the bug instead of running away completely from it.
> > Okay, but in future, it would be considerable to remove TCP keepalive.
> > The check of node availability is the work of cluster driver.
>
> All the hangs is suspected to use RTO instead of keepalive timer. Could you please tell me where
> the thread is hung at?
It waits for a response from the unreachable node at poll() in
wait_forward_request(). I'm not sure why it returns after keepalive
timeout...
Thanks,
Kazutaka
>
> This might not be the topic, but for a quick debug, I found connect() will use RTO as timer
> instead of keepalive too. This can happen during connect() to other node and that node crash meanwhile.
> This problem (the RTO timer takes minutes to fire out) can't be solved even you close(fd) when epoch changes
> because we are hung at connect() and this fd isn't registered yet.
>
> I think we need to find all the possible RTO only timers and use keepalive timer/snd timer/recv timer/ instead.
>
> Thanks,
> Yuan
> --
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog
More information about the sheepdog
mailing list