[Sheepdog] Cluster appears down but nodes report different epochs

Liu Yuan namei.unix at gmail.com
Thu Nov 10 05:20:04 CET 2011


On 11/08/2011 07:22 PM, MORITA Kazutaka wrote:

>>>
>>> Currently, Sheepdog cannot handle this kinds of false detection.
>>
>> This false detection is passed on to sheep from corosync. I guess this
>> is triggered from corosync's built-in timeout mechanism. The corosync's
>> heartbeat message might be discarded or jammed for whatever reason.
>>
>> I am suspecting networking is hijacked fully by heavy IO, and no channel
>> for corosync's heart-beat messages.
> 
> Probably, we need to add support for using different NICs for data
> I/Os and monitoring.


If we use accord, will this problem get mitigated or even removed by
side effect?

If not, we might need assure QoS of heartbeat of corosync either by
physical dedicated channel or software networking control, since sheep
should not and can not handle this type of network partition problem
satisfactorily.

Thanks,
Yuan



More information about the sheepdog mailing list