On 05/24/2012 11:12 AM, Liu Yuan wrote: > I have created a branch called 'live-lock-fix' to get a better review of > this patch set. > > Generally speaking, this patch set tries to fix live lock(busy retrying) > on some of mutually influenced nodes in recovery phase, observed by our > simulated 960 nodes cluster. I am also now fixing yet another fatal problem, dead lock between nodes that send recovery requests to each other and get a retry err code, then doing a timer retry. Those nodes never progress to completion because those recovery requests blocks confchg event(because of sys->nr_outstanding_io), and epoch is never lifted on a agreed value. Thanks, Yuan |