At Thu, 10 Nov 2011 12:20:04 +0800, Liu Yuan wrote: > > On 11/08/2011 07:22 PM, MORITA Kazutaka wrote: > > >>> > >>> Currently, Sheepdog cannot handle this kinds of false detection. > >> > >> This false detection is passed on to sheep from corosync. I guess this > >> is triggered from corosync's built-in timeout mechanism. The corosync's > >> heartbeat message might be discarded or jammed for whatever reason. > >> > >> I am suspecting networking is hijacked fully by heavy IO, and no channel > >> for corosync's heart-beat messages. > > > > Probably, we need to add support for using different NICs for data > > I/Os and monitoring. > > > If we use accord, will this problem get mitigated or even removed by > side effect? It's up to the implementation of the cluster driver. If we return a different ip address when cdrv->init() is called, we can use different NICs for data and control. Currently, the corosync driver gets an IP address from corosync_cfg_get_node_addrs(), but it may be a bad idea if we don't want to use the same address with corosync's one. Thanks, Kazutaka > > If not, we might need assure QoS of heartbeat of corosync either by > physical dedicated channel or software networking control, since sheep > should not and can not handle this type of network partition problem > satisfactorily. |