[sheepdog] [PATCH 1/3] collie: add delay_recovery {start|stop} command

Mon Jul 30 11:02:49 CEST 2012

On Mon, Jul 30, 2012 at 4:54 PM, Liu Yuan <namei.unix at gmail.com> wrote:
> On 07/30/2012 04:35 PM, Yunkai Zhang wrote:
>> Good question, in order to simplify these patchset, I'll let sheep continue
>> to process LEAVE event even if we have start delay recovery.
>
> Then how do you handle group kill in one epoch? I think actively killed
> node & passively killed node are almost the same thing. If you you can
> handle group kill operation in one epoch, the randomly failed events can
> also be handled the same way, but this will open a potential fatal problem:
>
>   the node failure events (be it killed or failed) will be blocked for
> some time window, if this window exceed some value ( for e.g, more than
> 120s), some VM will be ill-functioning because its internal timeout on
> issued IOs, which unfortunately were routed to the failed nodes.
>
> I think the most useful usage of your patch set is for group join, that
> join multiple nodes in one go, this can be done relatively safely. For
> group kill, there might be some potential problems brought in together.

Agreed, I originally just want to implement 'group join' feature, 'group kill'
is not considered maturely, I just found control recovery *maybe* can
implement these
two functions at the same time, but I'm not very sure, I need dig more.

>
> Thanks,
> Yuan

-- 
Yunkai Zhang
Work at Taobao