[Sheepdog] Cluster appears down but nodes report different epochs
MORITA Kazutaka
morita.kazutaka at lab.ntt.co.jp
Thu Nov 10 06:30:54 CET 2011
At Thu, 10 Nov 2011 12:20:04 +0800,
Liu Yuan wrote:
>
> On 11/08/2011 07:22 PM, MORITA Kazutaka wrote:
>
> >>>
> >>> Currently, Sheepdog cannot handle this kinds of false detection.
> >>
> >> This false detection is passed on to sheep from corosync. I guess this
> >> is triggered from corosync's built-in timeout mechanism. The corosync's
> >> heartbeat message might be discarded or jammed for whatever reason.
> >>
> >> I am suspecting networking is hijacked fully by heavy IO, and no channel
> >> for corosync's heart-beat messages.
> >
> > Probably, we need to add support for using different NICs for data
> > I/Os and monitoring.
>
>
> If we use accord, will this problem get mitigated or even removed by
> side effect?
It's up to the implementation of the cluster driver. If we return a
different ip address when cdrv->init() is called, we can use different
NICs for data and control.
Currently, the corosync driver gets an IP address from
corosync_cfg_get_node_addrs(), but it may be a bad idea if we don't
want to use the same address with corosync's one.
Thanks,
Kazutaka
>
> If not, we might need assure QoS of heartbeat of corosync either by
> physical dedicated channel or software networking control, since sheep
> should not and can not handle this type of network partition problem
> satisfactorily.
More information about the sheepdog
mailing list