[sheepdog] [PATCH 0/3] zookeeper: fix error handling

Thu May 30 15:16:48 CEST 2013

At Wed, 29 May 2013 20:38:39 +0800,
Kai Zhang wrote:
> 
> Is there a way that sheep can rejoin cluster other than panic?
> Because currently sheep panic will cause a qemu restart which should be avoided in production environment.

Although ZooKeeper easily causes a timeout, the problem is not
specific to the zookeeper driver.  I think this should be fixed in
sheep/group.c.

The naive approach I'm trying to implement is:
 1. add a function like 'sd_timeout_handler' to group.c, which will be
    called when a timeout is detected in the cluster driver.
 2. become a gateway node after sd_timeout_handler() is called.
 3. try to rejoin several times, and exit the program if sheep cannot
    join Sheepdog again.

Thanks,

Kazutaka