On 07/30/2012 04:35 PM, Yunkai Zhang wrote: > Good question, in order to simplify these patchset, I'll let sheep continue > to process LEAVE event even if we have start delay recovery. Then how do you handle group kill in one epoch? I think actively killed node & passively killed node are almost the same thing. If you you can handle group kill operation in one epoch, the randomly failed events can also be handled the same way, but this will open a potential fatal problem: the node failure events (be it killed or failed) will be blocked for some time window, if this window exceed some value ( for e.g, more than 120s), some VM will be ill-functioning because its internal timeout on issued IOs, which unfortunately were routed to the failed nodes. I think the most useful usage of your patch set is for group join, that join multiple nodes in one go, this can be done relatively safely. For group kill, there might be some potential problems brought in together. Thanks, Yuan |