At Tue, 11 Oct 2011 17:27:13 +0800, Liu Yuan wrote: > > From: Liu Yuan <tailai.ly at taobao.com> > > We use SD_STATUS_HALT to identify the cluster state when it should not serve > IO requests. > > [Test Case] > > steps: > > for i in 0 1 2 3; do ./sheep/sheep -d /store/$i -z $i -p 700$i; sleep 1; done > ./collie/collie cluster format --copies=3; > for i in 0 1; do pkill -f "sheep -d /store/$i"; sleep 1; done > for i in 2 3; do ./collie/collie cluster info -p 700$i; done > for i in 0 1; do ./sheep/sheep -d /store/$i -z $i -p 700$i; sleep 1; done > for i in 0 1 2 3; do ./collie/collie cluster info -p 700$i; done > > output: > > Cluster status: The node is stopped doing IO, short of living nodes > > Creation time Epoch Nodes > 2011-10-11 16:26:02 3 [192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > Cluster status: The node is stopped doing IO, short of living nodes > > Creation time Epoch Nodes > 2011-10-11 16:26:02 3 [192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > Cluster status: running > > Creation time Epoch Nodes > 2011-10-11 16:26:02 5 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 4 [192.168.0.1:7000, 192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 3 [192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > 2011-10-11 16:26:02 1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003] > > ... > > Signed-off-by: Liu Yuan <tailai.ly at taobao.com> > --- > sheep/group.c | 14 ++++++++++++++ > 1 files changed, 14 insertions(+), 0 deletions(-) The following test doesn't work in my environment: $ for i in 0 1; do sheep /store/$i -z $i -p 700$i;sleep 1;done $ collie cluster format $ for i in 0; do pkill -f "sheep /store/$i"; sleep 1; done $ for i in 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done $ for i in 1 2; do pkill -f "sheep /store/$i"; sleep 1; done $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done $ for i in 0 1 2; do collie cluster info -p 700$i;done Cluster status: running Creation time Epoch Nodes 2011-10-12 17:56:12 4 [10.68.14.1:7000, 10.68.14.1:7001] 2011-10-12 17:56:12 3 [10.68.14.1:7001] 2011-10-12 17:56:12 2 [10.68.14.1:7001] 2011-10-12 17:56:12 1 [10.68.14.1:7000, 10.68.14.1:7001] Cluster status: running Creation time Epoch Nodes 2011-10-12 17:56:12 4 [10.68.14.1:7000, 10.68.14.1:7001] 2011-10-12 17:56:12 3 [10.68.14.1:7001, 10.68.14.1:7002] 2011-10-12 17:56:12 2 [10.68.14.1:7001] 2011-10-12 17:56:12 1 [10.68.14.1:7000, 10.68.14.1:7001] failed to connect to localhost:7002, Connection refused localhost:7002 seems to have a wrong creation time. Perhaps, the master multicasts a wrong join_message when its state is SD_STATUS_HALT? Thanks, Kazutaka > > diff --git a/sheep/group.c b/sheep/group.c > index 2871e97..756f8a6 100644 > --- a/sheep/group.c > +++ b/sheep/group.c > @@ -1212,6 +1212,13 @@ static void __sd_notify_done(struct cpg_event *cevent) > } > start_recovery(sys->epoch); > } > + > + if (sys->status == SD_STATUS_HALT) { > + int nr_zones = get_zones_nr_from(&sys->sd_node_list); > + > + if (nr_zones >= sys->nr_sobjs) > + sys->status = SD_STATUS_OK; > + } > } > > static void sd_notify_handler(struct sheepid *sender, void *msg, size_t msg_len) > @@ -1451,6 +1458,13 @@ static void __sd_leave_done(struct cpg_event *cevent) > if (node_left && > (sys->status == SD_STATUS_OK || sys->status == SD_STATUS_HALT)) > start_recovery(sys->epoch); > + > + if (sys->status == SD_STATUS_OK) { > + int nr_zones = get_zones_nr_from(&sys->sd_node_list); > + > + if (nr_zones < sys->nr_sobjs) > + sys->status = SD_STATUS_HALT; > + } > } > > static void cpg_event_free(struct cpg_event *cevent) > -- > 1.7.6.1 > > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |