On Tue, May 15, 2012 at 3:35 PM, Christoph Hellwig <hch at infradead.org> wrote: > On Tue, May 15, 2012 at 10:53:55AM +0800, Liu Yuan wrote: >> We actually met this the same problem running zookeeper, when we tried >> to remove register/un-register group_fd to help main thread get more >> responsive. The main thread is very unwieldy for now and become a major >> bottleneck for massive nodes cluster. Our effort is trying to get it >> slimmer. > > I defintively like the register/un-register group_fd removal idea, it > helps moving work off the main thread, and simplifies the code. > > I'd actually like to take some of the concepts there even further, > e.g. the patch already updates most of the cluster state in join > synchronously, and the rest of it are the vdi_inuse update and lots > of tiny synchronous code pathes. Why not do all of __sd_join_done > synchronously from sd_join_handler except for the vdi_inuse update, > and then make sure all code looking at vdi_inuse is executed in its > own work queue, both for updates and reads from it, which should > help lifting the VDI operations out of the main thread, even if the > VDI workqueue remains single threaded by itself. > > Pretty similarly the only thing executed in the event queue for > leaves is check_majority, and I wonder if we can get away with > executing that after all the cluster state updates currently in > __sd_leave_done, which except for recovery could then be called > directly from sd_leave_handler, avoiding the event queue issues. Agreed, and in current check_majority code, there are lots of redundant connection to each node, it's not a good idea. why not leave the heath check logic with membership management module (eg. zookeeper/accord/corosync) /* * Check whether the majority of Sheepdog nodes are still alive or not */ static int check_majority(struct sd_node *nodes, int nr_nodes) { int nr_majority, nr_reachable = 0, fd, i; char name[INET6_ADDRSTRLEN]; nr_majority = nr_nodes / 2 + 1; /* we need at least 3 nodes to handle network partition * failure */ if (nr_nodes < 3) return 1; for (i = 0; i < nr_nodes; i++) { addr_to_str(name, sizeof(name), nodes[i].addr, 0); fd = connect_to(name, nodes[i].port); if (fd < 0) continue; close(fd); nr_reachable++; if (nr_reachable >= nr_majority) { dprintf("the majority of nodes are alive\n"); return 1; } } dprintf("%d, %d, %d\n", nr_nodes, nr_majority, nr_reachable); eprintf("the majority of nodes are not alive\n"); return 0; } > > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog ________________________________ This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。 |