[Sheepdog] [PATCH RFC 2/2] sheep: teach sheepdog to better recovery the shut-down cluster

Tue Sep 20 11:33:03 CEST 2011

On 09/20/2011 04:30 PM, MORITA Kazutaka wrote:
> Looks great, but there seems to be some other cases we need to
> consider.  For example:
>
> 1. Start Sheepdog with three daemons.
>    $ for i in 0 1 2; do sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster format
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> 2. Then, kill sheep daemons, and start again in the same order.
>
>    $ for i in 0 1 2; do pkill -f "sheep /store/$i"; sleep 1; done
>    $ for i in 0 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      2 [10.68.14.1:7000]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> The first daemon regards the other two nodes as left nodes, and starts
> working.
>
> 3. Start the other two nodes again.
>
>    $ for i in 1 2; do ./sheep/sheep /store/$i -z $i -p 700$i; sleep 1; done
>    $ collie cluster info
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7000]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    $ collie cluster info -p 7001
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7000, 10.68.14.1:7002]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    $ collie cluster info -p 7002
>    Cluster status: running
>
>    Creation time        Epoch Nodes
>    2011-09-20 16:43:10      4 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      3 [10.68.14.1:7000, 10.68.14.1:7001]
>    2011-09-20 16:43:10      2 [10.68.14.1:7001, 10.68.14.1:7002]
>    2011-09-20 16:43:10      1 [10.68.14.1:7000, 10.68.14.1:7001, 10.68.14.1:7002]
>
> The epoch informations become inconsistent.  It is because the first
> node overwrote the epochs in the other nodes.  Similar situations
> could happen if we start from the daemon which doesn't have the latest
> epoch.
>
> We can get away with claiming that this doesn't happen if the
> administrator is careful enough.  But is there any good idea to solve
> this problem?
>

Good catch. But actually, this patch set doesn't deal with the epoch 
older or newer problem when is started up.

This patch just resolves the cluster startup problem when they are 
*shutdowned* by 'collie cluster shutdown' command. That is, the epoch 
number is the same, but with corrupted epoch content or different ctime.

I think this case (all nodes are down abnormally instead of being 
shutdowned, for e.g power outage) should be solved by another patch, 
because it is, IMHO, a different problem.

When nodes with newer epoch or older epoch should *not* be regarded 
_leave nodes_, they should be processed as soon as they are started up. 
Though, this patch set wrongly take them as leave nodes.

I'll cook a different patch targeting for this problem, well, based on 
this shutdown patch set.
> Some comments are below:
>
>> Signed-off-by: Liu Yuan<tailai.ly at taobao.com>
>> ---
>>   sheep/group.c      |  180 ++++++++++++++++++++++++++++++++++++++++-----------
>>   sheep/sheep_priv.h |   12 ++++
>>   sheep/store.c      |    5 ++
>>   3 files changed, 158 insertions(+), 39 deletions(-)
>>
>> diff --git a/sheep/group.c b/sheep/group.c
>> index 83d85cf..43c352d 100644
>> --- a/sheep/group.c
>> +++ b/sheep/group.c
>> @@ -25,13 +25,6 @@
>>   #include "logger.h"
>>   #include "work.h"
>>
>> -struct node {
>> -	uint32_t nodeid;
>> -	uint32_t pid;
>> -	struct sheepdog_node_list_entry ent;
>> -	struct list_head list;
>> -};
>> -
>>   enum deliver_msg_state {
>>   	DM_INIT = 1,
>>   	DM_CONT,
>> @@ -64,6 +57,12 @@ struct join_message {
>>   		uint32_t pid;
>>   		struct sheepdog_node_list_entry ent;
>>   	} nodes[SD_MAX_NODES];
>> +	uint32_t nr_leave_nodes;
>> +	struct {
>> +		uint32_t nodeid;
>> +		uint32_t pid;
>> +		struct sheepdog_node_list_entry ent;
>> +	} leave_nodes[SD_MAX_NODES];
>>   };
>>
>>   struct leave_message {
>> @@ -396,13 +395,91 @@ static int get_nodes_nr_epoch(int epoch)
>>   	return nr;
>>   }
>>
>> +static struct node * find_leave_node(struct node *node)
> We follow the linux kernel coding style, so "foo * bar" should be
> "foo *bar".
>
Okay.

>> +{
>> +	struct node *n;
>> +	list_for_each_entry(n,&sys->leave_list, list) {
>> +		if (n->nodeid == node->nodeid&&
>> +		    node_cmp(&n->ent,&node->ent) == 0)
> If node_cmp() returns zero, their nodeids are same, no?
>
Ah, I am not sure when corosync also dies. But I think IP+port would be 
enough information to identify a unique node.

>> +			return n;
>> +		dprintf("%d, %d\n", n->nodeid == node->nodeid, node_cmp(&n->ent,&node->ent));
>> +	}
>> +	return NULL;
>> +
>> +}
>> +static int add_node_to_leave_list(struct message_header *msg)
>> +{
>> +	int ret = SD_RES_SUCCESS;
>> +	int nr, i;
>> +	LIST_HEAD(tmp_list);
>> +	struct node *n;
>> +	struct join_message *jm;
>> +
>> +	switch (msg->op) {
>> +	case SD_MSG_LEAVE:
>> +		n = zalloc(sizeof (*n));
> Remove a space between 'sizeof' and '('.
>
>> +		if (!n) {
>> +			ret = SD_RES_NO_MEM;
>> +			goto err;
>> +		}
>> +
>> +		n->nodeid = msg->nodeid;
>> +		n->pid = msg->pid;
>> +		n->ent = msg->from;
>> +		if (find_leave_node(n)) {
>> +			free(n);
>> +			goto ret;
>> +		}
>> +
>> +		list_add_tail(&n->list,&sys->leave_list);
>> +		goto ret;
>> +	case SD_MSG_JOIN:
>> +		jm = (struct join_message *)msg;
>> +		nr = jm->nr_leave_nodes;
>> +		break;
>> +	default:
>> +		ret = SD_RES_INVALID_PARMS;
>> +		goto err;
>> +	}
>> +
>> +	for (i = 0; i<  nr; i++) {
>> +		n = zalloc(sizeof (*n));
> Same here.
>
>> +		if (!n) {
>> +			ret = SD_RES_NO_MEM;
>> +			goto free;
>> +		}
>> +
>> +		n->nodeid = jm->leave_nodes[i].nodeid;
>> +		n->pid = jm->leave_nodes[i].pid;
>> +		n->ent = jm->leave_nodes[i].ent;
>> +		if (find_leave_node(n)) {
>> +			free(n);
>> +			continue;
>> +		}
>> +
>> +		list_add_tail(&n->list,&tmp_list);
>> +	}
>> +	list_splice_init(&tmp_list,&sys->leave_list);
>> +	goto ret;
>> +
>> +free:
>> +	list_for_each_entry(n,&tmp_list, list) {
>> +		free(n);
>> +	}
> Use list_for_each_entry_safe if we remove the list in the loop safely.
> This should be like
>
> 	list_for_each_entry_safe(n, _,&tmp_list, list) {
> 		list_del(&n->list);
> 		free(n);
> 	}
>
>> +ret:
>> +	dprintf("%d\n", get_nodes_nr_from(&sys->leave_list));
>> +	print_node_list(&sys->leave_list);
>> +err:
>> +	return ret;
>> +}
>> +
>>   static int get_cluster_status(struct sheepdog_node_list_entry *from,
>>   			      struct sheepdog_node_list_entry *entries,
>>   			      int nr_entries, uint64_t ctime, uint32_t epoch,
>>   			      uint32_t *status, uint8_t *inc_epoch)
>>   {
>>   	int i;
>> -	int nr_local_entries;
>> +	int nr_local_entries, nr_leave_entries;
>>   	struct sheepdog_node_list_entry local_entries[SD_MAX_NODES];
>>   	struct node *node;
>>   	uint32_t local_epoch;
>> @@ -480,8 +557,18 @@ static int get_cluster_status(struct sheepdog_node_list_entry *from,
>>
>>   		nr_entries = get_nodes_nr_from(&sys->sd_node_list) + 1;
>>
>> -		if (nr_entries != nr_local_entries)
>> +		if (nr_entries != nr_local_entries) {
>> +			nr_leave_entries = get_nodes_nr_from(&sys->leave_list);
>> +			if (nr_local_entries == nr_entries + nr_leave_entries) {
>> +				/* Even though some nodes leave, we can make do with it.
>> +				 * Order cluster to do recovery right now.
>> +				 */
>> +				*inc_epoch = 1;
>> +				*status = SD_STATUS_OK;
>> +				return SD_RES_SUCCESS;
>> +			}
>>   			return SD_RES_SUCCESS;
>> +		}
>>
>>   		for (i = 0; i<  nr_local_entries; i++) {
>>   			if (node_cmp(local_entries + i, from) == 0)
>> @@ -637,13 +724,16 @@ static void update_cluster_info(struct join_message *msg)
>>   {
>>   	int i;
>>   	int ret, nr_nodes = msg->nr_nodes;
>> -	struct sheepdog_node_list_entry entry[SD_MAX_NODES];
>>
>>   	eprintf("status = %d, epoch = %d, %d, %d\n", msg->cluster_status, msg->epoch, msg->result, sys->join_finished);
>>   	if (msg->result != SD_RES_SUCCESS) {
>>   		if (is_myself(msg->header.from.addr, msg->header.from.port)) {
>>   			eprintf("failed to join sheepdog, %d\n", msg->result);
>> -			sys->status = SD_STATUS_JOIN_FAILED;
>> +			leave_cluster();
>> +			eprintf("I am really hurt and gonna leave cluster.\n");
>> +			eprintf("Fix yourself and restart me later, pleaseeeee...Bye.\n");
>> +			exit(1);
>> +			/* sys->status = SD_STATUS_JOIN_FAILED; */
>>   		}
>>   		return;
>>   	}
>> @@ -655,7 +745,7 @@ static void update_cluster_info(struct join_message *msg)
>>   		sys->nr_sobjs = msg->nr_sobjs;
>>
>>   	if (sys->join_finished)
>> -		goto out;
>> +		goto join_finished;
>>
>>   	sys->epoch = msg->epoch;
>>   	for (i = 0; i<  nr_nodes; i++) {
>> @@ -671,19 +761,15 @@ static void update_cluster_info(struct join_message *msg)
>>   				msg->nodes[i].nodeid, msg->nodes[i].pid);
>>   	}
>>
>> -	sys->join_finished = 1;
>> +	if (msg->cluster_status != SD_STATUS_OK)
>> +		add_node_to_leave_list((struct message_header *)msg);
>>
>> -	if (msg->cluster_status == SD_STATUS_OK&&  msg->inc_epoch) {
>> -		nr_nodes = get_ordered_sd_node_list(entry);
>> +	sys->join_finished = 1;
>>
>> -		dprintf("update epoch, %d, %d\n", sys->epoch, nr_nodes);
>> -		ret = epoch_log_write(sys->epoch, (char *)entry,
>> -				      nr_nodes * sizeof(struct sheepdog_node_list_entry));
>> -		if (ret<  0)
>> -			eprintf("can't write epoch %u\n", sys->epoch);
>> -	}
>> +	if (msg->cluster_status == SD_STATUS_OK&&  msg->inc_epoch)
>> +		update_epoch_log(sys->epoch);
>>
>> -out:
>> +join_finished:
>>   	ret = move_node_to_sd_list(msg->header.nodeid, msg->header.pid, msg->header.from);
>>   	/*
>>   	 * this should not happen since __sd_deliver() checks if the
>> @@ -695,16 +781,8 @@ out:
>>
>>   	if (msg->cluster_status == SD_STATUS_OK) {
>>   		if (msg->inc_epoch) {
>> -			nr_nodes = get_ordered_sd_node_list(entry);
>> -
>> -			dprintf("update epoch, %d, %d\n", sys->epoch + 1, nr_nodes);
>> -			ret = epoch_log_write(sys->epoch + 1, (char *)entry,
>> -					      nr_nodes * sizeof(struct sheepdog_node_list_entry));
>> -			if (ret<  0)
>> -				eprintf("can't write epoch %u\n", sys->epoch + 1);
>> -
>>   			sys->epoch++;
>> -
>> +			update_epoch_log(sys->epoch);
>>   			update_epoch_store(sys->epoch);
>>   		}
>>   		if (sys->status != SD_STATUS_OK) {
>> @@ -930,10 +1008,25 @@ static void __sd_deliver(struct cpg_event *cevent)
>>   static void send_join_response(struct work_deliver *w)
>>   {
>>   	struct message_header *m;
>> +	struct join_message *jm;
>> +	struct node *node;
>>
>>   	m = w->msg;
>> -	join((struct join_message *)m);
>> +	jm = (struct join_message *)m;
>> +	join(jm);
>>   	m->state = DM_FIN;
>> +
>> +	dprintf("%d, %d\n", jm->result, jm->cluster_status);
>> +	if (jm->result == SD_RES_SUCCESS&&  jm->cluster_status != SD_STATUS_OK) {
>> +		jm->nr_leave_nodes = 0;
>> +		list_for_each_entry(node,&sys->leave_list, list) {
>> +			jm->leave_nodes[jm->nr_leave_nodes].nodeid = node->nodeid;
>> +			jm->leave_nodes[jm->nr_leave_nodes].pid = node->pid;
>> +			jm->leave_nodes[jm->nr_leave_nodes].ent = node->ent;
>> +			jm->nr_leave_nodes++;
>> +		}
>> +		print_node_list(&sys->leave_list);
>> +	}
>>   	send_message(sys->handle, m);
>>   }
>>
>> @@ -944,8 +1037,7 @@ static void __sd_deliver_done(struct cpg_event *cevent)
>>   	char name[128];
>>   	int do_recovery;
>>   	struct node *node;
>> -	struct sheepdog_node_list_entry e[SD_MAX_NODES];
>> -	int nr;
>> +	int nr, nr_local, nr_leave;
>>
>>   	m = w->msg;
>>
>> @@ -962,13 +1054,22 @@ static void __sd_deliver_done(struct cpg_event *cevent)
>>   				list_del(&node->list);
>>   				free(node);
>>   				if (sys->status == SD_STATUS_OK) {
>> -					nr = get_ordered_sd_node_list(e);
>> -					dprintf("update epoch, %d, %d\n", sys->epoch + 1, nr);
>> -					epoch_log_write(sys->epoch + 1, (char *)e,
>> -							nr * sizeof(struct sheepdog_node_list_entry));
>> -
>>   					sys->epoch++;
>> +					update_epoch_log(sys->epoch);
>> +					update_epoch_store(sys->epoch);
>> +				}
>> +			}
>> +			if (sys->status == SD_STATUS_WAIT_FOR_JOIN) {
>> +				add_node_to_leave_list(m);
>> +
>> +				nr_local = get_nodes_nr_epoch(sys->epoch);
>> +				nr = get_nodes_nr_from(&sys->sd_node_list);
>> +				nr_leave = get_nodes_nr_from(&sys->leave_list);
>>
>> +				if (nr_local == nr + nr_leave) {
>> +					sys->status = SD_STATUS_OK;
>> +					sys->epoch++;
>> +					update_epoch_log(sys->epoch);
>>   					update_epoch_store(sys->epoch);
>>   				}
>>   			}
>> @@ -1815,6 +1916,7 @@ join_retry:
>>   	INIT_LIST_HEAD(&sys->sd_node_list);
>>   	INIT_LIST_HEAD(&sys->cpg_node_list);
>>   	INIT_LIST_HEAD(&sys->pending_list);
>> +	INIT_LIST_HEAD(&sys->leave_list);
>>
>>   	INIT_LIST_HEAD(&sys->outstanding_req_list);
>>   	INIT_LIST_HEAD(&sys->req_wait_for_obj_list);
>> diff --git a/sheep/sheep_priv.h b/sheep/sheep_priv.h
>> index dc56f54..56a5620 100644
>> --- a/sheep/sheep_priv.h
>> +++ b/sheep/sheep_priv.h
>> @@ -38,6 +38,13 @@
>>
>>   #define SD_RES_NETWORK_ERROR    0x81 /* Network error between sheeps */
>>
>> +struct node {
>> +	uint32_t nodeid;
>> +	uint32_t pid;
>> +	struct sheepdog_node_list_entry ent;
>> +	struct list_head list;
>> +};
>> +
>>   enum cpg_event_type {
>>   	CPG_EVENT_CONCHG,
>>   	CPG_EVENT_DELIVER,
>> @@ -117,6 +124,11 @@ struct cluster_info {
>>   	struct list_head cpg_node_list;
>>   	struct list_head sd_node_list;
>>
>> +	/* leave list is only used to account for bad nodes when we start
>> +	 * up the cluster nodes after we shutdown the cluster through collie.
>> +	 */
>> +	struct list_head leave_list;
>> +
>>   	/* this array contains a list of ordered virtual nodes */
>>   	struct sheepdog_vnode_list_entry vnodes[SD_MAX_VNODES];
>>   	int nr_vnodes;
>> diff --git a/sheep/store.c b/sheep/store.c
>> index 8583455..2bcb952 100644
>> --- a/sheep/store.c
>> +++ b/sheep/store.c
>> @@ -1547,6 +1547,7 @@ static void recover_done(struct work *work, int idx)
>>   {
>>   	struct recovery_work *rw = container_of(work, struct recovery_work, work);
>>   	uint64_t oid;
>> +	struct node *node;
>>
>>   	if (rw->state == RW_INIT)
>>   		rw->state = RW_RUN;
>> @@ -1588,6 +1589,10 @@ static void recover_done(struct work *work, int idx)
>>   	free(rw->oids);
>>   	free(rw);
>>
>> +	list_for_each_entry(node,&sys->leave_list, list) {
>> +		list_del(&node->list);
>> +	};
>> +
> Use list_for_each_entry_safe.
>
> Is it possible to delete lists before starting recovery (in group.c)?
> If yes, we don't need to move 'struct node' to sheep_priv.h.
>

Umm, I think, yes.

Thanks for your comments.

I'll address them in v2.

Yuan