[sheepdog] [PATCH] sheep: add a kill node operation

Fri Jul 20 09:43:17 CEST 2012

On 07/20/2012 03:33 PM, Chris Webb wrote:
> Liu Yuan <namei.unix at gmail.com> writes:
> 
>> On 07/20/2012 02:55 PM, Dietmar Maurer wrote:
> [brief maintenance on a node causes automatic recovery]
>>> Such large amount of data utilizes the network for 100% until the
>>> rebooted node comes up again.
>>>
>>> That is expected behavior?
>>
>> Yes, for now. Temporary node detection mechanism is not that easy to
>> implement, it needs fundamental change to current recovery and IO path
>> code, especially how do we handle IOs routed to the temporarily failed
>> node is most difficult to get it right.
> 
> Perhaps the simplest interface conceivable here is a collie command to
> temporarily disable and later re-enable node recovery for the entire
> cluster? Switch it off during the kinds of maintenance described above, and
> then switch it back on again once we're running normally.
> 
> I think distinguishing between nodes that are intentionally down and nodes
> which have failed and need to be recovered will be hard, as you say.
> 

Yes, maybe, manual recovery (only update internal state, and not do data
load balance) could be a better approach to handle a smaller range
problem. This also need smaller code changes compared with temporary
failure detection.

Thanks,
Yuan