[sheepdog] [PATCH] sheep: add a kill node operation

Chris Webb chris at arachsys.com
Fri Jul 20 09:33:25 CEST 2012


Liu Yuan <namei.unix at gmail.com> writes:

> On 07/20/2012 02:55 PM, Dietmar Maurer wrote:
[brief maintenance on a node causes automatic recovery]
> > Such large amount of data utilizes the network for 100% until the
> > rebooted node comes up again.
> > 
> > That is expected behavior?
> 
> Yes, for now. Temporary node detection mechanism is not that easy to
> implement, it needs fundamental change to current recovery and IO path
> code, especially how do we handle IOs routed to the temporarily failed
> node is most difficult to get it right.

Perhaps the simplest interface conceivable here is a collie command to
temporarily disable and later re-enable node recovery for the entire
cluster? Switch it off during the kinds of maintenance described above, and
then switch it back on again once we're running normally.

I think distinguishing between nodes that are intentionally down and nodes
which have failed and need to be recovered will be hard, as you say.

Cheers,

Chris.



More information about the sheepdog mailing list