[Sheepdog] [PATCH] sheep: remove cdrv_handlers and check_join_cb

Wed Apr 25 12:28:22 CEST 2012

On 04/25/2012 05:39 PM, Huxinwei wrote:

> What's the specific problem you had ?
> There're several times I found that sheep fails to elect a master.
> It turns out to be the first nodes failed before it unblocks other joining messages.
> When it happened, you have to restart all sheeps to recover.

> 

Yes,if we s/joining messages/notify messages, this situation also blocks
the whole cluster.

> I thought it was corosync specific. Or there're more subtle issues there ?

We run zookeeper driver for simulating massive(around 1000) nodes
recently. I guess the whole block mechanism should be examined carefully
later. The hang of the whole cluster is too destructive for production
use. Anyway, we don't trace hard into the problem yet to reach any
useful conclusion. Maybe its other subtle bugs to root-cause this hang.

Thanks,
Yuan