On Fri, 2012-04-27 at 00:12 -0700, Shevek wrote: > On Fri, 2012-04-27 at 12:54 +0800, Liu Yuan wrote: To clarify: a fuller failure analysis, demonstrating the relationship between the two patches, as a timeline: Slave inits CPG * Failure: don't care Slave joins CPG * Slave crashes: confchg does cleanup and unblocks queue * Master crashes: Our patch S003 is required - JOIN cevent has msg = null, so __dispatch_one does not process it * Slave hangs: Cluster hangs. We have a pending fix for this. Slave sends join_message * Slave crashes: We get two epochs - a join and a leave - untested * Master crashes: Our patch S003 is required - JOIN cevent is blocked, so __dispatch_one does not process it * Slave hangs: OK until it is elected master Master sends join_response * Slave crashes: Cleanup is OK * Master crashes: Prior c4e3 patch marks master as gone - JOIN cevent is now unblocked, so everybody else processes it OK The remaining case is a slave starts in an uncoalesced corosync and elects itself, in which case we get: * Master joins CPG. Everything breaks. > > Thanks for your patch > > > > I think the (commit: c4e3559758b2e) dedicated to this problem. > > mastership is actually transferred to the second sheep. So I suspect > > that hang is caused by other bug. > > Our patch is correct and required in addition to the c4e3 patch. > Christoph Hellwig and I proved it with gdb, and we have detailed log > files and memory dumps to prove this patch correct. > > Our patch is required because if the master never sends a join_response, > the secondary sheep will have a blocked JOIN event for itself in its > queue, and has never set join_finished, so has never built cpg_nodes, > and cannot set .gone. The c4e3 patch only works if the master has > unblocked the JOIN event by sending a join_response. S. |