[sheepdog] [PATCH] sheep: add a kill node operation

Fri Jul 20 09:33:25 CEST 2012

Liu Yuan <namei.unix at gmail.com> writes:

> On 07/20/2012 02:55 PM, Dietmar Maurer wrote:
[brief maintenance on a node causes automatic recovery]
> > Such large amount of data utilizes the network for 100% until the
> > rebooted node comes up again.
> > 
> > That is expected behavior?
> 
> Yes, for now. Temporary node detection mechanism is not that easy to
> implement, it needs fundamental change to current recovery and IO path
> code, especially how do we handle IOs routed to the temporarily failed
> node is most difficult to get it right.

Perhaps the simplest interface conceivable here is a collie command to
temporarily disable and later re-enable node recovery for the entire
cluster? Switch it off during the kinds of maintenance described above, and
then switch it back on again once we're running normally.

I think distinguishing between nodes that are intentionally down and nodes
which have failed and need to be recovered will be hard, as you say.

Cheers,

Chris.