[sheepdog] leave event does not dispached in a corosync driver.

Hitoshi Mitake mitake.hitoshi at lab.ntt.co.jp
Fri Sep 12 09:57:00 CEST 2014


At Fri, 12 Sep 2014 13:00:58 +0900,
tuji wrote:
> 
> Hi
> 
> I found problem that the node does not left when one of the node is stoped under
> recovery is running
> And it was repported to launchpad(https://bugs.launchpad.net/sheepdog-project/+bug/1368503 ).
> 
> To solve this problem, I make patche for corosyn.c
> 
> [root at node001 BUILD]# diff -u sheepdog-0.7.6-org/sheep/cluster/corosync.c sheepdog-0.7.6/sheep/cluster/corosync.c
> --- sheepdog-0.7.6-org/sheep/cluster/corosync.c 2013-12-22 18:07:34.000000000 +0900
> +++ sheepdog-0.7.6/sheep/cluster/corosync.c     2014-09-12 09:47:37.840975169 +0900
> @@ -368,8 +368,9 @@
>                  * number of alive nodes correctly, we postpone
>                  * processsing events if there are incoming ones.
>                  */
> -               sd_debug("wait for a next dispatch event");
> -               return;
> +               sd_debug("wait for a next dispatch event.not return");
> +               //sd_debug("wait for a next dispatch event");
> +               //return;
>         }
> 
>         nr_majority = 0;
> 
> The problem was solved by this patch.
> I know this is an insufficiency patch because the function described in comment is disabled.
> 
>                 /*
>                  * Corosync dispatches leave events one by one even
>                  * when network partition has occured.  To count the
>                  * number of alive nodes correctly, we postpone
>                  * processsing events if there are incoming ones.
>                  */
> 
> I can't understand about this comment.
> Does anyone give me advice about it.

Thanks a lot for your analysis! The delay of message delivery for node
leave seems to be caused by a past commit
(15df161958a38cf3f7bc83b5bc2c8a1817b3072e). The intention of the
commit was handling network partition well, but it would be a root
cause of the problem.

I created a branch which has a revert commit of the above patch. Could
you test it?
https://github.com/sheepdog/sheepdog/tree/corosync-leave
# I cannot test it because I don't have a corosync cluster now,
# sorry...

Thanks,
Hitoshi

> 
> 
> 
> --------------------------
> Masahiro Tsuji
> 
> A.T.WORKS, INC
> URL http://www.atworks.co.jp
> 
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list