[Sheepdog] [PATCH] sheep: get vdi bitmap correctly in join phase

Tue Sep 13 06:46:04 CEST 2011

At Tue, 13 Sep 2011 11:12:32 +0800,
Liu Yuan wrote:
> 
> On 09/13/2011 08:05 AM, MORITA Kazutaka wrote:
> > At Sun, 11 Sep 2011 00:49:20 +0800,
> > Liu Yuan wrote:
> >> From: Liu Yuan<tailai.ly at taobao.com>
> >>
> >> After the new node joins, command *collie-vdi-list*
> >> cannot list any vdi information, that are created
> >> before this node is joined. This is because in the
> >> join phase, get_vdi_bitmap_from_all() cannot get corret
> >> vdi bitmap *if* there is no node in sd_node_list.
> >> This patch moves this function after the function
> >> which updates the sd_node_list.
> >>
> >> With this patch, collie-vdi-list works as expected in
> >> newly added nodes.
> >>
> >> Signed-off-by: Liu Yuan<tailai.ly at taobao.com>
> >> ---
> >>   sheep/group.c |   13 ++-----------
> >>   1 files changed, 2 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/sheep/group.c b/sheep/group.c
> >> index eb0c4e2..22c4f66 100644
> >> --- a/sheep/group.c
> >> +++ b/sheep/group.c
> >> @@ -863,17 +863,6 @@ static void __sd_deliver(struct cpg_event *cevent)
> >>   			break;
> >>   		}
> >>   	}
> >> -
> >> -	if (m->state == DM_FIN) {
> >> -		switch (m->op) {
> >> -		case SD_MSG_JOIN:
> >> -			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> >> -				if (sys->status != SD_STATUS_OK)
> >> -					get_vdi_bitmap_from_all();
> >> -			break;
> >> -		}
> >> -	}
> >> -
> >>   }
> >>
> >>   static void send_join_response(struct work_deliver *w)
> >> @@ -902,6 +891,8 @@ static void __sd_deliver_done(struct cpg_event *cevent)
> >>   		switch (m->op) {
> >>   		case SD_MSG_JOIN:
> >>   			update_cluster_info((struct join_message *)m);
> >> +			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> >> +					get_vdi_bitmap_from_all();
> >>   			break;
> >>   		case SD_MSG_LEAVE:
> >>   			node = find_node(&sys->sd_node_list, m->nodeid, m->pid);
> Hi
>      Thanks for your viewing and explanation
> > Thanks for your contribution, but this doesn't work.
> >
> > We cannot sleep in __sd_deliver_done (the main thread), so we cannot
> > call get_vdi_bitmap_from_all() in it; if all the sheep daemons call
> > get_vdi_bitmap_from_all() at the same time, the cluster will stuck.
> >
> > I think the right way is doing the followings in __sd_deliver (the
> > worker thread):
> >
> >   - call get_vdi_bitmap_from_all()
> >   - extract a new node from join_message, and get a vdi bitmap from the node
> >
> 
>      I don't quit get it. did you mean first call 
> get_vdi_bitmap_from_all() and check then sys->vdi_inuse ourselves or 
> something that can help us understand vdi_bitmap. if not as expected, we 
> then
> extract a new node from join_message, and get a vdi bitmap from the node?

Sorry, my intention was just like:

static void __sd_deliver(struct cpg_event *cevent)
{
	...

	if (m->state == DM_FIN) {
		switch (m->op) {
		case SD_MSG_JOIN:
			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
				if (sys->status != SD_STATUS_OK) {
					get_vdi_bitmap_from_all();
					get_vdi_bitmap_from_one(m->from);
				}
			break;
		}
	}
}

get_vdi_bitmap_from_one() is a function to get a vdi bitmap from one
node, though it is not defined yet.

> 
>      I don't see helper that get vdi bitmap from one node, and from 
> get_vdi_bitmap_from_all(), I gather that
> we need to iterate all the nodes to update the vdi_inuse bitmap, no?
> 
>       after reading the code more, I think __sd_confchg_done should get 
> vdi bitmap done for every node,
> *not* only for first cpg node in the group as it is now. Because of it, 
> all the nodes except the first-cpg-node will have to call 
> update_cluster_info() in a later deliver messge phase (DM_FIN). So my 
> question
> is, why we call udpate_cluster_info() in different places 
> (__sd_confchg_done for first-cpg-node, DM_FIN for
> other).

It is because we cannot decide the first joined node only from
delivered messages.  The first node checks whether the next nodes can
join Sheepdog, so it is necessary to know which node is the first one.

> 
>      I came up the idea to do the following minimal changes (just move 
> update_cluster_info() upwards)
> 
>          if (m->state == DM_FIN) {
>                  switch (m->op) {
>                  case SD_MSG_JOIN:
>                          update_cluster_info((struct join_message *)m);
>                          if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
>                                  get_vdi_bitmap_from_all();
>                          break;
> 
> It fixes the problem on my environment. Is it okay with you? If so, I will send it as V2.

No, this causes a race condition.  update_cluster_info() updates
global info, so it can be called only in main thread
(__sd_deliver_done() and __sd_confchg_done()).

Thanks,

Kazutaka