Liu Yuan <namei.unix at gmail.com> writes: > On 07/20/2012 02:55 PM, Dietmar Maurer wrote: [brief maintenance on a node causes automatic recovery] > > Such large amount of data utilizes the network for 100% until the > > rebooted node comes up again. > > > > That is expected behavior? > > Yes, for now. Temporary node detection mechanism is not that easy to > implement, it needs fundamental change to current recovery and IO path > code, especially how do we handle IOs routed to the temporarily failed > node is most difficult to get it right. Perhaps the simplest interface conceivable here is a collie command to temporarily disable and later re-enable node recovery for the entire cluster? Switch it off during the kinds of maintenance described above, and then switch it back on again once we're running normally. I think distinguishing between nodes that are intentionally down and nodes which have failed and need to be recovered will be hard, as you say. Cheers, Chris. |