On 09/18/2011 11:35 AM, Liu Yuan wrote: > From: Liu Yuan<tailai.ly at taobao.com> > > [Problem] > > Currently, sheepdog cannot recovery cluster into full functional state if any node in > the cluster fails to join cluster after *shutdown*, because of being considered unhealthy > (F.g. the targeted epoch content is corrupted). That is, the cluster can only get worked > again only *if* all the nodes join the cluster successfully. > > For 14 nodes in the cluster, > > ==========*===<--- cluster refuses to work > ^ > | > unhealthy node > > This is quit awkward. The cluster with many nodes after being shutdowned, we easily meet > the condition that some of nodes are unhealthy that are rejected by the master during join > stage.This patch gives sheepdog some kind of intelligence to deal with unhealthy nodes and > process to recovery when all the nodes alive reach the agreement. > > [Design] > > This patch add a new concept into sheepdog, the *leave node*. The _leave node_ is the one > that the master checks and finds it unhealthy (unmatched epoch content), so marks it as > 'leave node', meaning that it is supposed to leave the cluster. The leave nodes are queued > in the leave list, *only* exist during SD_STATUS_WAIT_FOR_FORMAT. All the leave nodes stop > sheep itself automatically after being started. > Typo. Should be SD_STATUS_WAIT_FOR_JOIN. Yuan |