[Sheepdog] Cluster appears down but nodes report different epochs

Thu Nov 10 06:30:54 CET 2011

At Thu, 10 Nov 2011 12:20:04 +0800,
Liu Yuan wrote:
> 
> On 11/08/2011 07:22 PM, MORITA Kazutaka wrote:
> 
> >>>
> >>> Currently, Sheepdog cannot handle this kinds of false detection.
> >>
> >> This false detection is passed on to sheep from corosync. I guess this
> >> is triggered from corosync's built-in timeout mechanism. The corosync's
> >> heartbeat message might be discarded or jammed for whatever reason.
> >>
> >> I am suspecting networking is hijacked fully by heavy IO, and no channel
> >> for corosync's heart-beat messages.
> > 
> > Probably, we need to add support for using different NICs for data
> > I/Os and monitoring.
> 
> 
> If we use accord, will this problem get mitigated or even removed by
> side effect?

It's up to the implementation of the cluster driver.  If we return a
different ip address when cdrv->init() is called, we can use different
NICs for data and control.

Currently, the corosync driver gets an IP address from
corosync_cfg_get_node_addrs(), but it may be a bad idea if we don't
want to use the same address with corosync's one.

Thanks,

Kazutaka

> 
> If not, we might need assure QoS of heartbeat of corosync either by
> physical dedicated channel or software networking control, since sheep
> should not and can not handle this type of network partition problem
> satisfactorily.