[sheepdog] [PATCH v7 3/7] sheep: rejoin cluster after a zookeeper session timeout

Tue Jun 25 17:06:55 CEST 2013

At Tue, 25 Jun 2013 22:43:41 +0800,
Kai Zhang wrote:
> 
> [1  <text/plain; us-ascii (7bit)>]
> 
> On Jun 25, 2013, at 8:49 PM, Hitoshi Mitake <mitake.hitoshi at gmail.com> wrote:
> 
> > The timeout handling part which is implemented in this patchset
> > (e.g. zoo_state(zhandle) == ZOO_EXPIRED_SESSION_STATE) is very
> > useful, because current master of sheepdog doesn't handle it well. And
> > it is causing problems in our internal use of sheepdog.
> > 
> > The reconnection part is also useful but the current patchset is big
> > and not review friendly.
> > 
> > I think they can be separated ones. Could you make a smaller patchset
> > which only solves the problem of zookeeper timeout? If you kindly make
> > it, I'd like to merge it to the stable branch.
> 
> I see. But I think the only way to handle session timeout is the 'rejoin'.
> Panic() means nothing.
> Do you have any other idea?

As you say, the rejoin would be an only way to handle session timeout
correctly. But the current zookeeper driver produces serious problems
when network failures happen (e.g. inconsistent epochs).

So I believe the panic() or exit() would be better than doing
nothing. If sheeps with zookeeper driver exits immediately in the
above case, we can restart sheeps manually.
# I understand this solution goes against the policy of sheepdog... :(

And our internal team needs the solution until this Thursday (we have
a local change for this problem but it is a temporal and dirty
thing). If you can help us, I'm very happy :)

Thanks,
Hitoshi