[Sheepdog] [PATCH 4/4] [PATCH 04/10] sheep: update node information and epoch from

Tue May 15 09:35:32 CEST 2012

On Tue, May 15, 2012 at 10:53:55AM +0800, Liu Yuan wrote:
> We actually met this the same problem running zookeeper, when we tried
> to remove register/un-register group_fd to help main thread get more
> responsive. The main thread is very unwieldy for now and become a major
> bottleneck for massive nodes cluster. Our effort is trying to get it
> slimmer.

I defintively like the register/un-register group_fd removal idea, it
helps moving work off the main thread, and simplifies the code.

I'd actually like to take some of the concepts there even further,
e.g. the patch already updates most of the cluster state in join
synchronously, and the rest of it are the vdi_inuse update and lots
of tiny synchronous code pathes.  Why not do all of __sd_join_done
synchronously from sd_join_handler except for the vdi_inuse update,
and then make sure all code looking at vdi_inuse is executed in its
own work queue, both for updates and reads from it, which should
help lifting the VDI operations out of the main thread, even if the
VDI workqueue remains single threaded by itself.

Pretty similarly the only thing executed in the event queue for
leaves is check_majority, and I wonder if we can get away with
executing that after all the cluster state updates currently in
__sd_leave_done, which except for recovery could then be called
directly from sd_leave_handler, avoiding the event queue issues.