On Mon, Jul 30, 2012 at 4:54 PM, Liu Yuan <namei.unix at gmail.com> wrote: > On 07/30/2012 04:35 PM, Yunkai Zhang wrote: >> Good question, in order to simplify these patchset, I'll let sheep continue >> to process LEAVE event even if we have start delay recovery. > > Then how do you handle group kill in one epoch? I think actively killed > node & passively killed node are almost the same thing. If you you can > handle group kill operation in one epoch, the randomly failed events can > also be handled the same way, but this will open a potential fatal problem: > > the node failure events (be it killed or failed) will be blocked for > some time window, if this window exceed some value ( for e.g, more than > 120s), some VM will be ill-functioning because its internal timeout on > issued IOs, which unfortunately were routed to the failed nodes. > > I think the most useful usage of your patch set is for group join, that > join multiple nodes in one go, this can be done relatively safely. For > group kill, there might be some potential problems brought in together. Agreed, I originally just want to implement 'group join' feature, 'group kill' is not considered maturely, I just found control recovery *maybe* can implement these two functions at the same time, but I'm not very sure, I need dig more. > > Thanks, > Yuan -- Yunkai Zhang Work at Taobao |