[Sheepdog] [PATCH] sheep: get vdi bitmap correctly in join phase

Tue Sep 13 06:53:43 CEST 2011

At Tue, 13 Sep 2011 13:46:04 +0900,
MORITA Kazutaka wrote:
> 
> At Tue, 13 Sep 2011 11:12:32 +0800,
> Liu Yuan wrote:
> > 
> > On 09/13/2011 08:05 AM, MORITA Kazutaka wrote:
> > > At Sun, 11 Sep 2011 00:49:20 +0800,
> > > Liu Yuan wrote:
> > >> From: Liu Yuan<tailai.ly at taobao.com>
> > >>
> > >> After the new node joins, command *collie-vdi-list*
> > >> cannot list any vdi information, that are created
> > >> before this node is joined. This is because in the
> > >> join phase, get_vdi_bitmap_from_all() cannot get corret
> > >> vdi bitmap *if* there is no node in sd_node_list.
> > >> This patch moves this function after the function
> > >> which updates the sd_node_list.
> > >>
> > >> With this patch, collie-vdi-list works as expected in
> > >> newly added nodes.
> > >>
> > >> Signed-off-by: Liu Yuan<tailai.ly at taobao.com>
> > >> ---
> > >>   sheep/group.c |   13 ++-----------
> > >>   1 files changed, 2 insertions(+), 11 deletions(-)
> > >>
> > >> diff --git a/sheep/group.c b/sheep/group.c
> > >> index eb0c4e2..22c4f66 100644
> > >> --- a/sheep/group.c
> > >> +++ b/sheep/group.c
> > >> @@ -863,17 +863,6 @@ static void __sd_deliver(struct cpg_event *cevent)
> > >>   			break;
> > >>   		}
> > >>   	}
> > >> -
> > >> -	if (m->state == DM_FIN) {
> > >> -		switch (m->op) {
> > >> -		case SD_MSG_JOIN:
> > >> -			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> > >> -				if (sys->status != SD_STATUS_OK)
> > >> -					get_vdi_bitmap_from_all();
> > >> -			break;
> > >> -		}
> > >> -	}
> > >> -
> > >>   }
> > >>
> > >>   static void send_join_response(struct work_deliver *w)
> > >> @@ -902,6 +891,8 @@ static void __sd_deliver_done(struct cpg_event *cevent)
> > >>   		switch (m->op) {
> > >>   		case SD_MSG_JOIN:
> > >>   			update_cluster_info((struct join_message *)m);
> > >> +			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> > >> +					get_vdi_bitmap_from_all();
> > >>   			break;
> > >>   		case SD_MSG_LEAVE:
> > >>   			node = find_node(&sys->sd_node_list, m->nodeid, m->pid);
> > Hi
> >      Thanks for your viewing and explanation
> > > Thanks for your contribution, but this doesn't work.
> > >
> > > We cannot sleep in __sd_deliver_done (the main thread), so we cannot
> > > call get_vdi_bitmap_from_all() in it; if all the sheep daemons call
> > > get_vdi_bitmap_from_all() at the same time, the cluster will stuck.
> > >
> > > I think the right way is doing the followings in __sd_deliver (the
> > > worker thread):
> > >
> > >   - call get_vdi_bitmap_from_all()
> > >   - extract a new node from join_message, and get a vdi bitmap from the node
> > >
> > 
> >      I don't quit get it. did you mean first call 
> > get_vdi_bitmap_from_all() and check then sys->vdi_inuse ourselves or 
> > something that can help us understand vdi_bitmap. if not as expected, we 
> > then
> > extract a new node from join_message, and get a vdi bitmap from the node?
> 
> Sorry, my intention was just like:
> 
> static void __sd_deliver(struct cpg_event *cevent)
> {
> 	...
> 
> 	if (m->state == DM_FIN) {
> 		switch (m->op) {
> 		case SD_MSG_JOIN:
> 			if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> 				if (sys->status != SD_STATUS_OK) {
> 					get_vdi_bitmap_from_all();
> 					get_vdi_bitmap_from_one(m->from);

And, for newly joined nodes, we also need to get a bitmap from each
node in ((struct join_message *)m)->nodes.

Thanks,

Kazutaka

> 				}
> 			break;
> 		}
> 	}
> }
> 
> get_vdi_bitmap_from_one() is a function to get a vdi bitmap from one
> node, though it is not defined yet.
> 
> > 
> >      I don't see helper that get vdi bitmap from one node, and from 
> > get_vdi_bitmap_from_all(), I gather that
> > we need to iterate all the nodes to update the vdi_inuse bitmap, no?
> > 
> >       after reading the code more, I think __sd_confchg_done should get 
> > vdi bitmap done for every node,
> > *not* only for first cpg node in the group as it is now. Because of it, 
> > all the nodes except the first-cpg-node will have to call 
> > update_cluster_info() in a later deliver messge phase (DM_FIN). So my 
> > question
> > is, why we call udpate_cluster_info() in different places 
> > (__sd_confchg_done for first-cpg-node, DM_FIN for
> > other).
> 
> It is because we cannot decide the first joined node only from
> delivered messages.  The first node checks whether the next nodes can
> join Sheepdog, so it is necessary to know which node is the first one.
> 
> > 
> >      I came up the idea to do the following minimal changes (just move 
> > update_cluster_info() upwards)
> > 
> >          if (m->state == DM_FIN) {
> >                  switch (m->op) {
> >                  case SD_MSG_JOIN:
> >                          update_cluster_info((struct join_message *)m);
> >                          if (((struct join_message *)m)->cluster_status == SD_STATUS_OK)
> >                                  get_vdi_bitmap_from_all();
> >                          break;
> > 
> > It fixes the problem on my environment. Is it okay with you? If so, I will send it as V2.
> 
> No, this causes a race condition.  update_cluster_info() updates
> global info, so it can be called only in main thread
> (__sd_deliver_done() and __sd_confchg_done()).
> 
> 
> Thanks,
> 
> Kazutaka
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog