[sheepdog] [PATCH] sheep: add a kill node operation
Liu Yuan
namei.unix at gmail.com
Fri Jul 20 09:43:17 CEST 2012
On 07/20/2012 03:33 PM, Chris Webb wrote:
> Liu Yuan <namei.unix at gmail.com> writes:
>
>> On 07/20/2012 02:55 PM, Dietmar Maurer wrote:
> [brief maintenance on a node causes automatic recovery]
>>> Such large amount of data utilizes the network for 100% until the
>>> rebooted node comes up again.
>>>
>>> That is expected behavior?
>>
>> Yes, for now. Temporary node detection mechanism is not that easy to
>> implement, it needs fundamental change to current recovery and IO path
>> code, especially how do we handle IOs routed to the temporarily failed
>> node is most difficult to get it right.
>
> Perhaps the simplest interface conceivable here is a collie command to
> temporarily disable and later re-enable node recovery for the entire
> cluster? Switch it off during the kinds of maintenance described above, and
> then switch it back on again once we're running normally.
>
> I think distinguishing between nodes that are intentionally down and nodes
> which have failed and need to be recovered will be hard, as you say.
>
Yes, maybe, manual recovery (only update internal state, and not do data
load balance) could be a better approach to handle a smaller range
problem. This also need smaller code changes compared with temporary
failure detection.
Thanks,
Yuan
More information about the sheepdog
mailing list