On 11/08/2011 07:22 PM, MORITA Kazutaka wrote: >>> >>> Currently, Sheepdog cannot handle this kinds of false detection. >> >> This false detection is passed on to sheep from corosync. I guess this >> is triggered from corosync's built-in timeout mechanism. The corosync's >> heartbeat message might be discarded or jammed for whatever reason. >> >> I am suspecting networking is hijacked fully by heavy IO, and no channel >> for corosync's heart-beat messages. > > Probably, we need to add support for using different NICs for data > I/Os and monitoring. If we use accord, will this problem get mitigated or even removed by side effect? If not, we might need assure QoS of heartbeat of corosync either by physical dedicated channel or software networking control, since sheep should not and can not handle this type of network partition problem satisfactorily. Thanks, Yuan |