At Mon, 31 Oct 2011 21:44:34 +0800, Chaos Eternal wrote: > > IMHO, > > In case of cluster parttioning happened, we should introduce some > STONITH techniques to avoid data corruption. > generally, the step is as following: > 1. partitioning detected > 2. wait some interval to confirm the partition > 3. vote to STONITH I'm not familiar with STONITH. What is exactly done in the "vote" phase? Thanks, Kazutaka > > STONITH can be truly shut the node down, or just seize the running of > those nodes. we can discuss further. > > > > On Mon, Oct 31, 2011 at 8:36 PM, Liu Yuan <namei.unix at gmail.com> wrote: > > On 10/31/2011 07:49 PM, MORITA Kazutaka wrote: > > > >> At Mon, 31 Oct 2011 19:37:51 +0800, > >> Liu Yuan wrote: > >>> > >>> On 10/31/2011 07:10 PM, MORITA Kazutaka wrote: > >>> > >>>> At Mon, 31 Oct 2011 18:15:06 +0800, > >>>> Liu Yuan wrote: > >>>>> > >>>>> On 10/31/2011 06:00 PM, MORITA Kazutaka wrote: > >>>>> > >>>>>> At Tue, 25 Oct 2011 15:06:05 +0800, > >>>>>> Liu Yuan wrote: > >>>>>>> > >>>>>>> On 10/25/2011 02:55 PM, zituan at taobao.com wrote: > >>>>>>> > >>>>>>>> From: Yibin Shen <zituan at taobao.com> > >>>>>>>> > >>>>>>>> In some situation, sheep may disconnected from corosync instantaneously, > >>>>>>>> at the same time, both sheep and corosync will keep running but > >>>>>>>> none of them exit, then the disconnected sheep may receive a confchg > >>>>>>>> message from corosync which notify this sheep has left. > >>>>>>>> that will lead to a network partition, this patch fix it. > >>>>>>>> > >>>>>>>> Signed-off-by: Yibin Shen <zituan at taobao.com> > >>>>>>>> --- > >>>>>>>> sheep/group.c | 3 +++ > >>>>>>>> 1 files changed, 3 insertions(+), 0 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/sheep/group.c b/sheep/group.c > >>>>>>>> index e22dabc..ab5a9f0 100644 > >>>>>>>> --- a/sheep/group.c > >>>>>>>> +++ b/sheep/group.c > >>>>>>>> @@ -1467,6 +1467,9 @@ static void sd_leave_handler(struct sheepdog_node_list_entry *left, > >>>>>>>> struct work_leave *w = NULL; > >>>>>>>> int i, size; > >>>>>>>> > >>>>>>>> + if (node_cmp(left, &sys->this_node) == 0) > >>>>>>>> + panic("BUG: this node can't be on the left list\n"); > >>>>>>>> + > >>>>>>> > >>>>>>> > >>>>>>> Hmm, the panic output looks confusing. how about "Network Patition Bug: > >>>>>>> I should have exited.\n"? since the output will be seen by > >>>>>>> administrators, not only programmer. > >>>>>> > >>>>>> Applied after modifying output text, thanks! > >>>>>> > >>>>>> Kazutaka > >>>>> > >>>>> > >>>>> Kazutaka, > >>>>> Maybe we should not panic out when it becomes a single node cluster. > >>>>> The node will change into HALT state which doesn't any harm to its data. > >>>> > >>>> It is much better. Currently, Sheepdog kills a minority cluster in > >>>> __sd_leave() when network partition occurs because it is the simplest > >>>> solution to keep data consistency. > >>>> > >>>> But this looks a different issue from this patch. Does corosync > >>>> include local node in the left list when network partition occurs? If > >>>> so, we should handle it in the corosync cluster driver because it > >>>> looks a corosync specific issue to me. > >>>> > >>> > >>> I am not sure, but if corosync include local node in the left list, it > >>> should be a bug in corosync. > >>> > >>> let's assume (a,b,c) three nodes. > >>> I am suspecting that that left message is for n(b,c), but after n(a) > >>> rejoins, for whatever reason, the message is being broadcasting, and > >>> n(a) just gets it wrongly. > >> > >> IIUC, n(a) should receive the left massage of n(b,c). > > > > Yes, but n(a) should not receive the leave_message(a) which is intended > > for n(b,c). > > > > so the correct sequence should be: > > > > network partition happens > > lm(a) -> n(b,c), lm(b,c) -> n(a). > > then n(a) rejoins. > > jm(a) -> n(a,b,c) > > > > anyway, I am not sure, cause I didn't look at the log. > > > > Thanks, > > Yuan > > -- > > sheepdog mailing list > > sheepdog at lists.wpkg.org > > http://lists.wpkg.org/mailman/listinfo/sheepdog > > > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |