[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request
Shevek
shevek at anarres.org
Fri Apr 27 10:24:50 CEST 2012
On Fri, 2012-04-27 at 00:12 -0700, Shevek wrote:
> On Fri, 2012-04-27 at 12:54 +0800, Liu Yuan wrote:
To clarify: a fuller failure analysis, demonstrating the relationship
between the two patches, as a timeline:
Slave inits CPG
* Failure: don't care
Slave joins CPG
* Slave crashes: confchg does cleanup and unblocks queue
* Master crashes: Our patch S003 is required
- JOIN cevent has msg = null, so __dispatch_one does not process it
* Slave hangs: Cluster hangs. We have a pending fix for this.
Slave sends join_message
* Slave crashes: We get two epochs - a join and a leave - untested
* Master crashes: Our patch S003 is required
- JOIN cevent is blocked, so __dispatch_one does not process it
* Slave hangs: OK until it is elected master
Master sends join_response
* Slave crashes: Cleanup is OK
* Master crashes: Prior c4e3 patch marks master as gone
- JOIN cevent is now unblocked, so everybody else processes it OK
The remaining case is a slave starts in an uncoalesced corosync and
elects itself, in which case we get:
* Master joins CPG. Everything breaks.
> > Thanks for your patch
> >
> > I think the (commit: c4e3559758b2e) dedicated to this problem.
> > mastership is actually transferred to the second sheep. So I suspect
> > that hang is caused by other bug.
>
> Our patch is correct and required in addition to the c4e3 patch.
> Christoph Hellwig and I proved it with gdb, and we have detailed log
> files and memory dumps to prove this patch correct.
>
> Our patch is required because if the master never sends a join_response,
> the secondary sheep will have a blocked JOIN event for itself in its
> queue, and has never set join_finished, so has never built cpg_nodes,
> and cannot set .gone. The c4e3 patch only works if the master has
> unblocked the JOIN event by sending a join_response.
S.
More information about the sheepdog
mailing list