[sheepdog] [PATCH 0/3] zookeeper: fix error handling

Thu May 30 16:00:19 CEST 2013

At Thu, 30 May 2013 21:53:16 +0800,
Liu Yuan wrote:
> 
> On 05/30/2013 09:46 PM, MORITA Kazutaka wrote:
> > At Thu, 30 May 2013 21:26:26 +0800,
> > Liu Yuan wrote:
> >>
> >> On 05/30/2013 09:16 PM, MORITA Kazutaka wrote:
> >>> Although ZooKeeper easily causes a timeout, the problem is not
> >>> specific to the zookeeper driver.  I think this should be fixed in
> >>> sheep/group.c.
> >>>
> >>
> >> Compared to corosync, local driver, zk is the only one having the
> >> concept of timeout, no? With this in mind, I think put timeout problem
> >> into sheep just complicate the sheep code. I don't get why timeout
> >> should be handled in a upper layer in sheep, what is the benefit?
> > 
> > Corosync with many nodes also has the similar problem (temporary split
> > brain), and I believe we will suffer from the same problem with
> > shepherd.  Of course, I don't intend to introduce a complex code.  If
> > my implementation become complex, I'll give up the idea.
> > 
> 
> Well, network partition looks to me different problem than timeout

Both are the same problem - how to handle temprary left nodes.

> problem. Suppose we have network split into A and B, then all nodes in A
> and B are going to try to rejoin each other's partition?

Curretnly, all the nodes in the minority subcluster will exit the
program in the case.  However, they can be gateway nodes and rejoin to
the majority subcluster with my idea.

Thanks,

Kazutaka