[Sheepdog] [PATCH v2 3/3] sheep: use SD_STATUS_HALT to stop serving IO

Wed Oct 12 11:12:46 CEST 2011

At Tue, 11 Oct 2011 17:27:13 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <tailai.ly at taobao.com>
> 
> We use SD_STATUS_HALT to identify the cluster state when it should not serve
> IO requests.
> 
> [Test Case]
> 
> steps:
> 
> for i in 0 1 2 3; do ./sheep/sheep -d /store/$i -z $i -p 700$i; sleep 1; done
> ./collie/collie cluster format --copies=3;
> for i in 0 1; do pkill -f "sheep -d /store/$i"; sleep 1; done
> for i in 2 3; do ./collie/collie cluster info -p 700$i; done
> for i in 0 1; do ./sheep/sheep -d /store/$i -z $i -p 700$i; sleep 1; done
> for i in 0 1 2 3; do ./collie/collie cluster info -p 700$i; done
> 
> output:
> 
> Cluster status: The node is stopped doing IO, short of living nodes
> 
> Creation time        Epoch Nodes
> 2011-10-11 16:26:02      3 [192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> Cluster status: The node is stopped doing IO, short of living nodes
> 
> Creation time        Epoch Nodes
> 2011-10-11 16:26:02      3 [192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> Cluster status: running
> 
> Creation time        Epoch Nodes
> 2011-10-11 16:26:02      5 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      4 [192.168.0.1:7000, 192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      3 [192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      2 [192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> 2011-10-11 16:26:02      1 [192.168.0.1:7000, 192.168.0.1:7001, 192.168.0.1:7002, 192.168.0.1:7003]
> 
> ...
> 
> Signed-off-by: Liu Yuan <tailai.ly at taobao.com>
> ---
>  sheep/group.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)

The following test doesn't work in my environment:

  $ for i in 0 1; do sheep /store/$i -z $i -p 700$i;sleep 1;done
  $ collie cluster format
  $ for i in 0; do pkill -f "sheep /store/$i"; sleep 1; done
  $ for i in 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done
  $ for i in 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
  $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done
  $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i;sleep 1;done
  $ for i in 0 1 2; do collie cluster info -p 700$i;done
  Cluster status: running

  Creation time        Epoch Nodes
  2011-10-12 17:56:12      4 [10.68.14.1:7000, 10.68.14.1:7001]
  2011-10-12 17:56:12      3 [10.68.14.1:7001]
  2011-10-12 17:56:12      2 [10.68.14.1:7001]
  2011-10-12 17:56:12      1 [10.68.14.1:7000, 10.68.14.1:7001]
  Cluster status: running

  Creation time        Epoch Nodes
  2011-10-12 17:56:12      4 [10.68.14.1:7000, 10.68.14.1:7001]
  2011-10-12 17:56:12      3 [10.68.14.1:7001, 10.68.14.1:7002]
  2011-10-12 17:56:12      2 [10.68.14.1:7001]
  2011-10-12 17:56:12      1 [10.68.14.1:7000, 10.68.14.1:7001]
  failed to connect to localhost:7002, Connection refused

localhost:7002 seems to have a wrong creation time.  Perhaps, the
master multicasts a wrong join_message when its state is
SD_STATUS_HALT?

Thanks,

Kazutaka

> 
> diff --git a/sheep/group.c b/sheep/group.c
> index 2871e97..756f8a6 100644
> --- a/sheep/group.c
> +++ b/sheep/group.c
> @@ -1212,6 +1212,13 @@ static void __sd_notify_done(struct cpg_event *cevent)
>  		}
>  		start_recovery(sys->epoch);
>  	}
> +
> +	if (sys->status == SD_STATUS_HALT) {
> +		int nr_zones = get_zones_nr_from(&sys->sd_node_list);
> +
> +		if (nr_zones >= sys->nr_sobjs)
> +			sys->status = SD_STATUS_OK;
> +	}
>  }
>  
>  static void sd_notify_handler(struct sheepid *sender, void *msg, size_t msg_len)
> @@ -1451,6 +1458,13 @@ static void __sd_leave_done(struct cpg_event *cevent)
>  	if (node_left &&
>  	    (sys->status == SD_STATUS_OK || sys->status == SD_STATUS_HALT))
>  		start_recovery(sys->epoch);
> +
> +	if (sys->status == SD_STATUS_OK) {
> +		int nr_zones = get_zones_nr_from(&sys->sd_node_list);
> +
> +		if (nr_zones < sys->nr_sobjs)
> +			sys->status = SD_STATUS_HALT;
> +	}
>  }
>  
>  static void cpg_event_free(struct cpg_event *cevent)
> -- 
> 1.7.6.1
> 
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog