On Fri, 2012-04-27 at 12:54 +0800, Liu Yuan wrote: > Hi, Shevek > On 04/27/2012 08:15 AM, Shevek wrote: > > > > > A problem arises if a node joins the cluster and generates a > > confchg event, then crashes or leaves without sending a join > > request and receiving a join response. The second node to join > > never becomes master, and the entire cluster hangs. > > > > This patch allows a node to detect whether it should promote itself > > to master after an arbitrary confchg event. Every node except the > > master creates a blocked JOIN event for every node that joined > > after itself, therefore the master is the node which has a JOIN > > event for every node in the members list. > > > > A following patch will handle the case where a join request > > is sent, but the master crashes before sending a join response. > > > > > Thanks for your patch > > I think the (commit: c4e3559758b2e) dedicated to this problem. > mastership is actually transferred to the second sheep. So I suspect > that hang is caused by other bug. Our patch is correct and required in addition to the c4e3 patch. Christoph Hellwig and I proved it with gdb, and we have detailed log files and memory dumps to prove this patch correct. Our patch is required because if the master never sends a join_response, the secondary sheep will have a blocked JOIN event for itself in its queue, and has never set join_finished, so has never built cpg_nodes, and cannot set .gone. The c4e3 patch only works if the master has unblocked the JOIN event by sending a join_response. > Is there way to confirm or reproduce this hang reliably? Yes, we did it in our lab. If you want to reproduce the cluster hang without a complicated network setup, hack a master sheep to: a) start a cluster b) join itself fully c) format d) wait for the cfgchange event for a secondary sheep e) exit/crash/net-fail before sending any join_response messages to the second sheep - and the cluster will hang. When this patch is applied, the cluster will un-hang. The .gone patch is not sufficient, as it does not unblock the message queue if the crashing master has not yet sent a join response to the second sheep. The .gone patch only works if the master accepted the second sheep before crashing or losing network connection. > > There is a third outstanding issue if two clusters merge, also to be > > addressed in a following patch. > > > What kind of issue? Issue 1: There is a race condition with inc_epoch which causes sheep to have mismatched epochs. We have a good patch for this which solves the problem, we will publish it in the next couple of days. Issue 2: Sheepdog builds the cpg_nodes array based on the assumption that corosync always does ordered message delivery, but there are circumstances where corosync can deliver (especially cfgchange) messages in a different order to different sheep. We have reproduced this, and it is not a bug in corosync, it's just a fact of networking. If either corosync is started very rapidly after sheepdog, or the network partitions then rejoins, or sheepdog is started before the corosync ring has coalesced, then rings will get inconsistent copies of cpg_nodes, and will therefore have an inconsistent view of who is the master. This causes really weird effects. We do not yet have a patch for this, but we have a good understanding of the problem, and we can reproduce it in our lab, so we will write a patch soon. > + // Exactly one non-master member has seen join events for all other > + // members, because events are ordered. > + for (i = 0; i < member_list_entries; i++) { > + struct cpg_node member = { > > please use the /* */ to comment multiple lines. I will do that in future. Thank you. S. |