On 10/31/2011 06:29 PM, Yibin Shen wrote: > On Mon, Oct 31, 2011 at 6:15 PM, Liu Yuan <namei.unix at gmail.com> wrote: >> >> On 10/31/2011 06:00 PM, MORITA Kazutaka wrote: >> >>> At Tue, 25 Oct 2011 15:06:05 +0800, >>> Liu Yuan wrote: >>>> >>>> On 10/25/2011 02:55 PM, zituan at taobao.com wrote: >>>> >>>>> From: Yibin Shen <zituan at taobao.com> >>>>> >>>>> In some situation, sheep may disconnected from corosync instantaneously, >>>>> at the same time, both sheep and corosync will keep running but >>>>> none of them exit, then the disconnected sheep may receive a confchg >>>>> message from corosync which notify this sheep has left. >>>>> that will lead to a network partition, this patch fix it. >>>>> >>>>> Signed-off-by: Yibin Shen <zituan at taobao.com> >>>>> --- >>>>> sheep/group.c | 3 +++ >>>>> 1 files changed, 3 insertions(+), 0 deletions(-) >>>>> >>>>> diff --git a/sheep/group.c b/sheep/group.c >>>>> index e22dabc..ab5a9f0 100644 >>>>> --- a/sheep/group.c >>>>> +++ b/sheep/group.c >>>>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left, >>>>> struct work_leave *w = NULL; >>>>> int i, size; >>>>> >>>>> + if (node_cmp(left, &sys->this_node) == 0) >>>>> + panic("BUG: this node can't be on the left list\n"); >>>>> + >>>> >>>> >>>> Hmm, the panic output looks confusing. how about "Network Patition Bug: >>>> I should have exited.\n"? since the output will be seen by >>>> administrators, not only programmer. >>> >>> Applied after modifying output text, thanks! >>> >>> Kazutaka >> >> >> Kazutaka, >> Maybe we should not panic out when it becomes a single node cluster. >> The node will change into HALT state which doesn't any harm to its data. >> >> This is introduced by a corosync bug which is fixed by Yunkai in latest >> corosync. During patch fixing, Yunkai found that corosync will likely >> re-join the old configuration soon after a short break-out with other >> corosync nodes. > they are two different bugs, > and this one is somewhat more sophisticated than YunKai's fixed bug > (http://lists.corosync.org/pipermail/discuss/2011-October/000150.html), > > anyway, before We know what exactly occurred inside corosync, > I think let sheep die in such situation is safe. Yes, panic-out right now is sufficient. If corosync sorts out later, we should consider not panic-out. Thanks, Yuan |