On 04/25/2012 05:39 PM, Huxinwei wrote: > What's the specific problem you had ? > There're several times I found that sheep fails to elect a master. > It turns out to be the first nodes failed before it unblocks other joining messages. > When it happened, you have to restart all sheeps to recover. > Yes,if we s/joining messages/notify messages, this situation also blocks the whole cluster. > I thought it was corosync specific. Or there're more subtle issues there ? We run zookeeper driver for simulating massive(around 1000) nodes recently. I guess the whole block mechanism should be examined carefully later. The hang of the whole cluster is too destructive for production use. Anyway, we don't trace hard into the problem yet to reach any useful conclusion. Maybe its other subtle bugs to root-cause this hang. Thanks, Yuan |