> -----Original Message----- > From: sheepdog-bounces at lists.wpkg.org > [mailto:sheepdog-bounces at lists.wpkg.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, April 25, 2012 3:54 PM > To: Liu Yuan > Cc: sheepdog at lists.wpkg.org > Subject: Re: [Sheepdog] [PATCH] sheep: remove cdrv_handlers and > check_join_cb > > On Wed, Apr 25, 2012 at 03:51:30PM +0800, Liu Yuan wrote: > > I am more interested in how do you plan to deal with block_cb()? We > > already meet some subtle problem that cluster gets hung at block state > > for ever running a 1000 sheep daemon on dozen of machines, but not yet > > come to any conclusion useful. We can only say that the block mechanism > > would leave some holes to hang the whole cluster by only several minor > > failed nodes (be it whether EIO-exiting or down). What's the specific problem you had ? There're several times I found that sheep fails to elect a master. It turns out to be the first nodes failed before it unblocks other joining messages. When it happened, you have to restart all sheeps to recover. I thought it was corosync specific. Or there're more subtle issues there ? > I haven't looked into a better scheme yet - I just identified that the > area needs way more work than a simple cleanup, that's why I didn't > touch it for now. > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |