[Sheepdog] [PATCH 4/4] [PATCH 04/10] sheep: update node information and epoch from
Liu Yuan
namei.unix at gmail.com
Tue May 15 04:53:55 CEST 2012
On 05/15/2012 03:31 AM, Christoph Hellwig wrote:
> Update the node and vnodes lists as well as the epoch information on the
> master node before replying to the slaves so that we can avoid a race
> window which gives different sheep the same starting epoch.
>
> This detailed order order of corosync messages:
>
> slave0:cfgchange(join) ->
> master:send-response(0),
>
> slave1:cfgchange(join) ->
> master:send-response(1)
> master:recv-response(0) -> inc_epoch
> master:recv-response(1) -> inc_epoch
>
> will cause two responses to contain the same epoch, which gives one slave
> the wrong starting epoch, and thus causes epoch mismatches in the cluster.
>
> It can be fairly easily reproduced by starting a number of sheep very
> quickly on a formatted cluster. The actual reproduces is part of a bigger
> software project, but I and my coworkers hope to contribute a simpler
> reproducer as part of a test suite soon.
>
> Implementing the fix requires passing the authoritative node list from the
> cluster drivers to sd_check_join_cb, similar to how we do it for other
> callbacks from the cluster drivers. The callers in corosync and zookeeper
> don't actually have this list as they didn't add the joining node yet, so
> this adds additional complications. The corosync version of this has been
> heavily tested, but the zookeeper variant is entirely untested so far.
We actually met this the same problem running zookeeper, when we tried
to remove register/un-register group_fd to help main thread get more
responsive. The main thread is very unwieldy for now and become a major
bottleneck for massive nodes cluster. Our effort is trying to get it
slimmer.
Thanks for all your work on it, I am going to give this patch a review.
Thanks,
Yuan
More information about the sheepdog
mailing list