[sheepdog] [PATCH V2] sheep: remove check_majority()

Wed May 16 17:12:52 CEST 2012

At Wed, 16 May 2012 08:27:45 -0400,
Christoph Hellwig wrote:
> 
> On Wed, May 16, 2012 at 08:54:01PM +0900, MORITA Kazutaka wrote:
> > I also think it's the right way to go to check network partition in
> > cluster drivers, but the corosync driver doesn't support it yet.  Is
> > it possible to implement a network partition handling in the corosync
> > driver before removing the code from __sd_leave()?  There are already
> > some users who use Sheepdog with corosync.
> 
> I don't even think the current code work is practice, as it probes
> only the nodes in w->member_list, which doesn't include the nodes that
> left with the current confchg event.

Ah yes, we should consider that there is a left node in the current
confchg event.  The count of majority looks wrong in __sd_leave().

> 
> The untested patch below implements what I think the intention of the
> check was, can you confirm that?
> 
> 
> Index: sheepdog/sheep/cluster/corosync.c
> ===================================================================
> --- sheepdog.orig/sheep/cluster/corosync.c	2012-05-16 13:46:25.207717214 +0200
> +++ sheepdog/sheep/cluster/corosync.c	2012-05-16 14:18:11.747699030 +0200
> @@ -541,11 +541,22 @@ static void cdrv_cpg_confchg(cpg_handle_
>  	int i;
>  	struct cpg_node joined_sheep[SD_MAX_NODES];
>  	struct cpg_node left_sheep[SD_MAX_NODES];
> +	int nr_total = member_list_entries + left_list_entries;
>  
>  	dprintf("mem:%zu, joined:%zu, left:%zu\n",
>  		member_list_entries, joined_list_entries,
>  		left_list_entries);
>  
> +	/*
> +	 * Abort as quickly as we can to save ourselves from running into
> +	 * a split brain scenario in case of cluster partition.  We can
> +	 * only reasonably handle this with more than three nodes.
> +	 */
> +	if (nr_total >= 3 && member_list_entries < nr_total / 2 + 1) {
> +		eprintf("the majority of nodes are not alive\n");
> +		abort();
> +	}
> +
>  	/* convert cpg_address to cpg_node */
>  	for (i = 0; i < left_list_entries; i++) {
>  		left_sheep[i].nodeid = left_list[i].nodeid;

On my environment, when network partition happens, left nodes are not
notified with one confchg event.  For example, if there are 4 nodes
(A, B, C, D) and the node A is partitioned from other members, the
node A receives the following 3 events:

 - confchg(members: {A, B, C}, joined: {}, left: {D})
 - confchg(members: {A, B},    joined: {}, left: {C})
 - confchg(members: {A},       joined: {}, left: {B})

This means that we cannot detect a network partition from
member_list_entries and left_list_entries.  This is why I use
connect() to detect it.  Although the approach is a poor way, it's
more accurate than the above code.  If there is a better approach, I'd
really appreciated it.

Even if we discard a general network partition handling for now, I'd
like to keep supporting the NIC error handling, which is handled in
check_majority() because the NIC error is a special case of network
partition problems (one node is partitioned from the other nodes).

Thanks,

Kazutaka