On Tue, May 15, 2012 at 10:53:55AM +0800, Liu Yuan wrote: > We actually met this the same problem running zookeeper, when we tried > to remove register/un-register group_fd to help main thread get more > responsive. The main thread is very unwieldy for now and become a major > bottleneck for massive nodes cluster. Our effort is trying to get it > slimmer. I defintively like the register/un-register group_fd removal idea, it helps moving work off the main thread, and simplifies the code. I'd actually like to take some of the concepts there even further, e.g. the patch already updates most of the cluster state in join synchronously, and the rest of it are the vdi_inuse update and lots of tiny synchronous code pathes. Why not do all of __sd_join_done synchronously from sd_join_handler except for the vdi_inuse update, and then make sure all code looking at vdi_inuse is executed in its own work queue, both for updates and reads from it, which should help lifting the VDI operations out of the main thread, even if the VDI workqueue remains single threaded by itself. Pretty similarly the only thing executed in the event queue for leaves is check_majority, and I wonder if we can get away with executing that after all the cluster state updates currently in __sd_leave_done, which except for recovery could then be called directly from sd_leave_handler, avoiding the event queue issues. |