[sheepdog] [PATCH 1/3] collie: add delay_recovery {start|stop} command

Mon Jul 30 10:54:21 CEST 2012

On 07/30/2012 04:35 PM, Yunkai Zhang wrote:
> Good question, in order to simplify these patchset, I'll let sheep continue
> to process LEAVE event even if we have start delay recovery.

Then how do you handle group kill in one epoch? I think actively killed
node & passively killed node are almost the same thing. If you you can
handle group kill operation in one epoch, the randomly failed events can
also be handled the same way, but this will open a potential fatal problem:

  the node failure events (be it killed or failed) will be blocked for
some time window, if this window exceed some value ( for e.g, more than
120s), some VM will be ill-functioning because its internal timeout on
issued IOs, which unfortunately were routed to the failed nodes.

I think the most useful usage of your patch set is for group join, that
join multiple nodes in one go, this can be done relatively safely. For
group kill, there might be some potential problems brought in together.

Thanks,
Yuan