[Sheepdog] PATCH S003: Handle master crashing before sending JOIN request

Fri Apr 27 15:55:53 CEST 2012

On 04/27/2012 04:24 PM, Shevek wrote:

> To clarify: a fuller failure analysis, demonstrating the relationship
> between the two patches, as a timeline:
> 
> Slave inits CPG
> 
>   * Failure: don't care
> 
> Slave joins CPG
> 
>   * Slave crashes: confchg does cleanup and unblocks queue
>   * Master crashes: Our patch S003 is required
>     - JOIN cevent has msg = null, so __dispatch_one does not process it
>   * Slave hangs: Cluster hangs. We have a pending fix for this.
> 
> Slave sends join_message
> 
>   * Slave crashes: We get two epochs - a join and a leave - untested
>   * Master crashes: Our patch S003 is required
>     - JOIN cevent is blocked, so __dispatch_one does not process it
>   * Slave hangs: OK until it is elected master
> 
> Master sends join_response
> 
>   * Slave crashes: Cleanup is OK
>   * Master crashes: Prior c4e3 patch marks master as gone
>     - JOIN cevent is now unblocked, so everybody else processes it OK
> 
> The remaining case is a slave starts in an uncoalesced corosync and
> elects itself, in which case we get:
> 
>   * Master joins CPG. Everything breaks.

Excellent analysis. Thanks very much. These cases are very subtle to
produce but very destructive. Looking forward to your next patch series.

Thanks,
Yuan

To Yunkai,
	I think our zk_driver has the same syndrome , we'd better give it a
full examination.