On 07/20/2012 03:33 PM, Chris Webb wrote: > Liu Yuan <namei.unix at gmail.com> writes: > >> On 07/20/2012 02:55 PM, Dietmar Maurer wrote: > [brief maintenance on a node causes automatic recovery] >>> Such large amount of data utilizes the network for 100% until the >>> rebooted node comes up again. >>> >>> That is expected behavior? >> >> Yes, for now. Temporary node detection mechanism is not that easy to >> implement, it needs fundamental change to current recovery and IO path >> code, especially how do we handle IOs routed to the temporarily failed >> node is most difficult to get it right. > > Perhaps the simplest interface conceivable here is a collie command to > temporarily disable and later re-enable node recovery for the entire > cluster? Switch it off during the kinds of maintenance described above, and > then switch it back on again once we're running normally. > > I think distinguishing between nodes that are intentionally down and nodes > which have failed and need to be recovered will be hard, as you say. > Yes, maybe, manual recovery (only update internal state, and not do data load balance) could be a better approach to handle a smaller range problem. This also need smaller code changes compared with temporary failure detection. Thanks, Yuan |