From: Yunkai Zhang <qiushu.zyk at taobao.com> Dead lock was found in the following scenario: Suppose that there are two sheeps: S1, S2, and their event_queues are empty. Now S1 received a notify message: M1, and call sd_notify_handler() which will add notify event to its event_queue and than call process_request_event_queues() to queue_work this event. At the same time, S2 send a notify message: M2 to cluster and an I/O request(eg. do_lookup_vdi operation) was submitted to S1 when S2 calls zk_dispatch() to handle M2. After S1 received I/O request from S2, it would finally call process_request_event_queues() to deal with this event, if S1 call this function before M1's event_done() finished, this I/O request would not to be processed for the event_queue was not empty. This problem leads to dead lock between S1 and S2, S2 would be blocked in read() waitting for the data responsed by S1, and the whole cluster would be suspended forever. To fix this problem, we just modify the code in event_done, so that it can process request_queue after event_queue is empty. Signed-off-by: Yunkai Zhang <qiushu.zyk at taobao.com> --- sheep/group.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/sheep/group.c b/sheep/group.c index b4cf2da..7e19d33 100644 --- a/sheep/group.c +++ b/sheep/group.c @@ -964,8 +964,7 @@ static void event_done(struct work *work) if (ret) panic("failed to register event fd"); - if (!list_empty(&sys->event_queue)) - process_request_event_queues(); + process_request_event_queues(); } int is_access_to_busy_objects(uint64_t oid) -- 1.7.7.6 |