[Sheepdog] [PATCH v2] sheep: tame sheep to recover the

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Tue Sep 27 00:09:21 CEST 2011


At Mon, 26 Sep 2011 11:43:34 -0700 (PDT),
Ski Mountain wrote:
> 
> What happens if one of the nodes in the cluster is not recoverable at all.  IE fried motherboard, can you just start up the vm's that were on the dead machine on another machine in the cluster?

If the unrecoverable node doesn't have the latest epoch info, we need
to do nothing special.  If you start the sheep daemon on all other
machines, then the cluster will work again.

But if the failed node has the latest epoch, this is the case we need
a manual recovery.  It is because there is a risk of data loss in this
case, though I think this rarely happens.


Thanks,

Kazutaka


> 
> 
> 
> I do love the fact that there is not a start up order constraint any more.  
> 
> 
> I also agree that yo should be able to re balance the cluster in the feature, the cluster should easily be willing to accept change.  
> 
> 
> ---
> 
> Message: 1
> Date: Mon, 26 Sep 2011 00:01:02 +0900
> From: MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>
> To: Liu Yuan <namei.unix at gmail.com>
> Cc: sheepdog at lists.wpkg.org
> Subject: Re: [Sheepdog] [PATCH v2] sheep: tame sheep to recover the
>     crash    cluster
> Message-ID: <87d3eoll8x.wl%morita.kazutaka at lab.ntt.co.jp>
> Content-Type: text/plain; charset=US-ASCII
> 
> At Sun, 25 Sep 2011 12:00:20 +0800,
> Liu Yuan wrote:
> > 
> > From: Liu Yuan <tailai.ly at taobao.com>
> > 
> > Hi Kazum,
>>         would this solve the data loss problem as you mentioned when 
> there is no epoch overlap? This patch would allow cluster to recover as 
> if the master (last failed node) were no ever crashed.
> 
> The concept of mastership transfer is great!  I think this is the
> right way to go.
> 
> But your patch has some problems.  Here is a test case to reproduce
> the problems:
> 
>     #!/bin/bash
>     
>     # create a directory which has a different creation time
>     sheep /store/1 -p 7001
>     sleep 1
>     collie cluster format -p 7001
>     collie cluster shutdown -p 7001
>     sleep 1
>     
>     # start Sheepdog
>     sheep /store/0 -p 7000
>     sleep 1
>     collie cluster format -p 7000
>     
>     while true; do
>         sheep /store/1 -p 7001
>         sheep /store/2 -p 7002
>     
>         # wait for node join
>         while [ "`collie cluster info -p 7002 -r 2>&1 | head -1`" != 'running' ]; do
>             sleep 0.1
>         done
>     
>         if [ "`collie node list -p 7002 -r | wc -l`" -ne 2 ]; then
>             # break if the result is not correct
>             break
>         fi
>     
>         pkill -f "sheep /store/2"
>     done
>     
>     # show results
>     collie cluster info -p 7000
>     collie cluster info -p 7002
> 
> 
> The detailed reasons are below.
> 
> > @@ -975,17 +992,6 @@ static void __sd_deliver(struct cpg_event *cevent)
> >          addr_to_str(name, sizeof(name), m->from.addr, m->from.port),
> >          m->pid);
>> > -    /*
> > -     * we don't want to perform any deliver events until we
> > -     * join; we wait for our JOIN message.
> > -     */
> > -    if (!sys->join_finished) {
> > -        if (m->pid != sys->this_pid || m->nodeid != sys->this_nodeid) {
> > -            cevent->skip = 1;
> > -            return;
> > -        }
> > -    }
> > -
> 
> Sheepdog assumes that only joined nodes handle the delived messages,
> so we cannot remove this block.  You should pass only mastership
> transfer events here.
> 
> 
> >      if (m->op == SD_MSG_JOIN) {
> >          uint32_t nodeid = m->nodeid;
> >          uint32_t pid = m->pid;
> > @@ -1052,7 +1058,15 @@ static void send_join_response(struct work_deliver *w)
> >              jm->nr_leave_nodes++;
> >          }
> >          print_node_list(&sys->leave_list);
> > +    } else if (jm->result != SD_RES_SUCCESS &&
> > +            jm->epoch > sys->epoch &&
> > +            jm->cluster_status == SD_STATUS_WAIT_FOR_JOIN) {
> > +        eprintf("Transfer mastership.\n");
> > +        leave_cluster();
> > +        eprintf("Restart me later when master is up, please.Bye.\n");
> > +        exit(1);
> >      }
> > +    jm->epoch = sys->epoch;
> >      send_message(sys->handle, m);
> >  }
>> > @@ -1090,15 +1104,23 @@ static void __sd_deliver_done(struct cpg_event *cevent)
> >                  lm = (struct leave_message *)m;
> >                  add_node_to_leave_list(m);
>> > -                if (lm->epoch > sys->leave_epoch)
> > -                    sys->leave_epoch = lm->epoch;
> > +                /* Sheep needs this to identify itself as master.
> > +                 * Now mastership transfer is done.
> > +                 */
> > +                if (!sys->join_finished) {
> > +                    sys->join_finished = 1;
> > +                    move_node_to_sd_list(sys->this_nodeid, sys->this_pid, sys->this_node);
> > +                    sys->epoch = get_latest_epoch();
> > +                }
> 
> IIUC, this codes assume that all other nodes will send leave messages
> because this node has a newer epoch, so this can be a master.  But the
> assumption is wrong because the node which has a completely wrong
> epoch information (e.g. a node with a different creation time) also
> sends a leave message.
> 
> My suggestion is introducing another message type something like
> SD_MSG_MASTER_TRANSFER.  I think we should clearly distinguish master
> transfer events from leave messages.
> 
> 
> Thanks,
> 
> Kazutaka
> 
> 
>> >                  nr_local = get_nodes_nr_epoch(sys->epoch);
> >                  nr = get_nodes_nr_from(&sys->sd_node_list);
> >                  nr_leave = get_nodes_nr_from(&sys->leave_list);
> > +
> > +                dprintf("%d == %d + %d \n", nr_local, nr, nr_leave);
> >                  if (nr_local == nr + nr_leave) {
> >                      sys->status = SD_STATUS_OK;
> > -                    sys->epoch = sys->leave_epoch + 1;
> > +                    sys->epoch = sys->epoch;
> >                      update_epoch_log(sys->epoch);
> >                      update_epoch_store(sys->epoch);
> >                  }
> > @@ -1931,7 +1953,6 @@ join_retry:
> >      sys->handle = cpg_handle;
> >      sys->this_nodeid = nodeid;
> >      sys->this_pid = getpid();
> > -    sys->leave_epoch = 0;
>> >      ret = set_addr(nodeid, port);
> >      if (ret)
> > diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
> > index 6680f79..4711cdd 100644
> > --- a/sheep/sheep_priv.h
> > +++ b/sheep/sheep_priv.h
> > @@ -144,7 +144,6 @@ struct cluster_info {
> >      int nr_outstanding_reqs;
>> >      uint32_t recovered_epoch;
> > -    uint32_t leave_epoch; /* The highest number in the clsuter */
>> >      int use_directio;
>> > -- 
> > 1.7.6.1
> > 
> > -- 
> > sheepdog mailing list
> > sheepdog at lists.wpkg.org
> > http://lists.wpkg.org/mailman/listinfo/sheepdog
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list