At Tue, 27 Sep 2011 11:43:27 +0800, Yibin Shen wrote: > If the latest epoch is unrecoverable , or is a transient epoch, > should it fall back to the last available epoch? Yes. There is no other way to recover the cluster. Note that "collie cluster recover" would be a dangerous operation. For example, $ sheep /store/0 -p 7000 $ sheep /store/1 -p 7001 $ collie cluster format $ pkill -f "sheep /store/0" $ collie vdi create test 4G # vdi will be created only on the secon node $ collie cluster shutdown $ sheep /store/0 -p 7000 $ collie cluster recover # start Sheepdog with only the first node then, Sheepdog starts working, but the vdi "test" will be discarded. In future, I want a force option for "cluster format" and "cluster recover". Thanks, Kazutaka > > Yibin Shen > > On Tue, Sep 27, 2011 at 11:13 AM, MORITA Kazutaka < > morita.kazutaka at lab.ntt.co.jp> wrote: > > > At Tue, 27 Sep 2011 09:45:49 +0800, > > Liu Yuan wrote: > > > > > > On 09/27/2011 06:09 AM, MORITA Kazutaka wrote: > > > > At Mon, 26 Sep 2011 11:43:34 -0700 (PDT), > > > > Ski Mountain wrote: > > > >> What happens if one of the nodes in the cluster is not recoverable at > > all. IE fried motherboard, can you just start up the vm's that were on the > > dead machine on another machine in the cluster? > > > > If the unrecoverable node doesn't have the latest epoch info, we need > > > > to do nothing special. If you start the sheep daemon on all other > > > > machines, then the cluster will work again. > > > > > > > > But if the failed node has the latest epoch, this is the case we need > > > > a manual recovery. It is because there is a risk of data loss in this > > > > case, though I think this rarely happens. > > > > > > > > > > > > > > Hi Kazutaka, > > > I do have some idea like 'collie cluster recover' hanging over in > > > my head. This kind of brutal force manual recovery would be the last > > > resort to handle physical highest-epoch node failure in crashed cluster > > > or physical nodes failure in shutdown cluster. > > > > Good point. > > > > > > > > The implementation might be rather easy. I am thinking of adding a > > > new SD_MSG_RECOVERY event and broadcast this event to recovery the > > > cluster with the epoch incremented by 1. how do you think of it? > > > > How about adding a new operation SD_OP_CLUSTER_RECOVERY and > > broadcasting it with SD_MSG_VDI_OP? I think It should work like a > > "collie cluster format" command. > > > > > > Thanks, > > > > Kazutaka > > -- > > sheepdog mailing list > > sheepdog at lists.wpkg.org > > http://lists.wpkg.org/mailman/listinfo/sheepdog > > > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |