[sheepdog] [PATCH 0/8] make IO requests to wait in recovery instead of busy retrying

Tue May 22 04:51:00 CEST 2012

There're many cases that the request being retried, which not wait
but directly put the request again into the request queue to make
it run again, it may cause CPU too busy.

In our cluster with 960 nodes, when 10 nodes leave and the there're
heavy IO request in VM, the recovery doesn't run well because there're
too many pending requests in the request queue which are retrying IO
requests, and it makes CPU too busy to process the recovery requests.

And also, there's race condition in recovery which keeps nodes retrying
to recovery a single object and make the recovery work hang there.

We should not make the request retry at the same time it fails, but we
should put it into a queue to make it sleep until the epoch or other
needs of this request are met, then we wake it up to make it retry.

thanks,

levin