On 05/15/2012 03:31 AM, Christoph Hellwig wrote: > Update the node and vnodes lists as well as the epoch information on the > master node before replying to the slaves so that we can avoid a race > window which gives different sheep the same starting epoch. > > This detailed order order of corosync messages: > > slave0:cfgchange(join) -> > master:send-response(0), > > slave1:cfgchange(join) -> > master:send-response(1) > master:recv-response(0) -> inc_epoch > master:recv-response(1) -> inc_epoch > > will cause two responses to contain the same epoch, which gives one slave > the wrong starting epoch, and thus causes epoch mismatches in the cluster. > > It can be fairly easily reproduced by starting a number of sheep very > quickly on a formatted cluster. The actual reproduces is part of a bigger > software project, but I and my coworkers hope to contribute a simpler > reproducer as part of a test suite soon. > > Implementing the fix requires passing the authoritative node list from the > cluster drivers to sd_check_join_cb, similar to how we do it for other > callbacks from the cluster drivers. The callers in corosync and zookeeper > don't actually have this list as they didn't add the joining node yet, so > this adds additional complications. The corosync version of this has been > heavily tested, but the zookeeper variant is entirely untested so far. We actually met this the same problem running zookeeper, when we tried to remove register/un-register group_fd to help main thread get more responsive. The main thread is very unwieldy for now and become a major bottleneck for massive nodes cluster. Our effort is trying to get it slimmer. Thanks for all your work on it, I am going to give this patch a review. Thanks, Yuan |