[sheepdog] [PATCH V2] sheep: remove check_majority()
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Wed May 16 17:12:52 CEST 2012
At Wed, 16 May 2012 08:27:45 -0400,
Christoph Hellwig wrote:
>
> On Wed, May 16, 2012 at 08:54:01PM +0900, MORITA Kazutaka wrote:
> > I also think it's the right way to go to check network partition in
> > cluster drivers, but the corosync driver doesn't support it yet. Is
> > it possible to implement a network partition handling in the corosync
> > driver before removing the code from __sd_leave()? There are already
> > some users who use Sheepdog with corosync.
>
> I don't even think the current code work is practice, as it probes
> only the nodes in w->member_list, which doesn't include the nodes that
> left with the current confchg event.
Ah yes, we should consider that there is a left node in the current
confchg event. The count of majority looks wrong in __sd_leave().
>
> The untested patch below implements what I think the intention of the
> check was, can you confirm that?
>
>
> Index: sheepdog/sheep/cluster/corosync.c
> ===================================================================
> --- sheepdog.orig/sheep/cluster/corosync.c 2012-05-16 13:46:25.207717214 +0200
> +++ sheepdog/sheep/cluster/corosync.c 2012-05-16 14:18:11.747699030 +0200
> @@ -541,11 +541,22 @@ static void cdrv_cpg_confchg(cpg_handle_
> int i;
> struct cpg_node joined_sheep[SD_MAX_NODES];
> struct cpg_node left_sheep[SD_MAX_NODES];
> + int nr_total = member_list_entries + left_list_entries;
>
> dprintf("mem:%zu, joined:%zu, left:%zu\n",
> member_list_entries, joined_list_entries,
> left_list_entries);
>
> + /*
> + * Abort as quickly as we can to save ourselves from running into
> + * a split brain scenario in case of cluster partition. We can
> + * only reasonably handle this with more than three nodes.
> + */
> + if (nr_total >= 3 && member_list_entries < nr_total / 2 + 1) {
> + eprintf("the majority of nodes are not alive\n");
> + abort();
> + }
> +
> /* convert cpg_address to cpg_node */
> for (i = 0; i < left_list_entries; i++) {
> left_sheep[i].nodeid = left_list[i].nodeid;
On my environment, when network partition happens, left nodes are not
notified with one confchg event. For example, if there are 4 nodes
(A, B, C, D) and the node A is partitioned from other members, the
node A receives the following 3 events:
- confchg(members: {A, B, C}, joined: {}, left: {D})
- confchg(members: {A, B}, joined: {}, left: {C})
- confchg(members: {A}, joined: {}, left: {B})
This means that we cannot detect a network partition from
member_list_entries and left_list_entries. This is why I use
connect() to detect it. Although the approach is a poor way, it's
more accurate than the above code. If there is a better approach, I'd
really appreciated it.
Even if we discard a general network partition handling for now, I'd
like to keep supporting the NIC error handling, which is handled in
check_majority() because the NIC error is a special case of network
partition problems (one node is partitioned from the other nodes).
Thanks,
Kazutaka
More information about the sheepdog
mailing list