From: levin li <xingke.lwp at taobao.com> v2 --> v3: 1. ported list_splice_tail_init() from linux kernel for clear_wait_obj_requests() 2. move process_event_request_queue() out of the loops There're many cases that the request being retried, which not wait but directly put the request again into the request queue to make it run again, it may cause CPU too busy. In our cluster with 960 nodes, when 10 nodes leave and the there're heavy IO request in VM, the recovery doesn't run well because there're too many pending requests in the request queue which are retrying IO requests, and it makes CPU too busy to process the recovery requests. And also, there's race condition in recovery which keeps nodes retrying to recovery a single object and make the recovery work hang there. We should not make the request retry at the same time it fails, but we should put it into a queue to make it sleep until the epoch or other needs of this request are met, then we wake it up to make it retry. There're 4 cases that a request needs to wait for retrying: 1. epoch of request sender is older than system epoch In this case, we response the sender with SD_RES_OLD_NODE_VER to make gateway to retry, then gateway would put the request into wait_rw_queue to wait its system epoch get changed. 2. epoch of request sender is newer than system epoch In this case, we put the request into wait_rw_queue, to wait its system epoch to get changed, then to retry this request locally. 3. object requested doesn't exist and recovery work is at RW_INIT state In this case, we make is_recoverying_oid() check whether the object requested exists, if so, process the request, if not, then put the request into wait_rw_queue to wait for recovery work starts. 4. object requested doesn't exist and is pending for recovery. In this case, we put the request into wait_obj_queue, and every time we recovered an object we try to wake up a request in wait_obj_queue which requesting the object just recovered. levin li (8): sheep: port list_splice_tail_init() from linux kernel sheep: make requests with new epoch sleep until epoch is updated sheep: make gateway to retry when received SD_RES_OLD_NODE_VER recovery: make IO request to wait when recovery is in RW_INIT recovery: make IO request to wait when the requested object is in recovery recovery: clear the object wait queue when new recovery work comes recovery: fix a race condition in recovery sheep: make gateway requests only retry in io_op_done() include/list.h | 9 +++++ include/sheepdog_proto.h | 1 + sheep/group.c | 2 ++ sheep/recovery.c | 50 ++++++++++++++++++++++---- sheep/sdnet.c | 88 +++++++++++++++++++++++++++++++++++++--------- sheep/sheep_priv.h | 5 +++ 6 files changed, 133 insertions(+), 22 deletions(-) -- 1.7.10 |