[Sheepdog] [PATCH 4/4] [PATCH 04/10] sheep: update node information and epoch from

Tue May 15 04:53:55 CEST 2012

On 05/15/2012 03:31 AM, Christoph Hellwig wrote:

> Update the node and vnodes lists as well as the epoch information on the
> master node before replying to the slaves so that we can avoid a race
> window which gives different sheep the same starting epoch.
> 
> This detailed order order of corosync messages:
> 
> slave0:cfgchange(join) ->
>                              master:send-response(0),
> 
> slave1:cfgchange(join) ->
>                              master:send-response(1)
> 			     master:recv-response(0) -> inc_epoch
> 			     master:recv-response(1) -> inc_epoch
> 
> will cause two responses to contain the same epoch, which gives one slave
> the wrong starting epoch, and thus causes epoch mismatches in the cluster.
> 
> It can be fairly easily reproduced by starting a number of sheep very
> quickly on a formatted cluster.  The actual reproduces is part of a bigger
> software project, but I and my coworkers hope to contribute a simpler
> reproducer as part of a test suite soon.
> 
> Implementing the fix requires passing the authoritative node list from the
> cluster drivers to sd_check_join_cb, similar to how we do it for other
> callbacks from the cluster drivers.  The callers in corosync and zookeeper
> don't actually have this list as they didn't add the joining node yet, so
> this adds additional complications.  The corosync version of this has been
> heavily tested, but the zookeeper variant is entirely untested so far.

We actually met this the same problem running zookeeper, when we tried
to remove register/un-register group_fd to help main thread get more
responsive. The main thread is very unwieldy for now and become a major
bottleneck for massive nodes cluster. Our effort is trying to get it
slimmer.

Thanks for all your work on it, I am going to give this patch a review.

Thanks,
Yuan