On 06/04/2012 02:52 PM, MORITA Kazutaka wrote: > At Mon, 04 Jun 2012 14:12:10 +0800, > Liu Yuan wrote: >> >> On 06/04/2012 02:04 PM, Liu Yuan wrote: >> >>> The current object_cache_pull() cause bellow bug: >>> ... >>> do_gateway_request(288) 2, 80d6d76e00000000 , 1 >>> Jun 04 10:16:37 connect_to(241) 2126, 10.232.134.3:7000 >>> Jun 04 10:16:37 client_handler(747) closed connection 2116 >>> Jun 04 10:16:37 destroy_client(678) connection from: 127.0.0.1:60214 >>> Jun 04 10:16:37 listen_handler(797) accepted a new connection: 2116 >>> Jun 04 10:16:37 client_rx_handler(586) connection from: 127.0.0.1:60216 >>> Jun 04 10:16:37 queue_request(385) 2 >>> Jun 04 10:16:37 do_gateway_request(288) 2, 80d6d76e00000000 , 1 >>> Jun 04 10:16:37 do_gateway_request(308) failed: 2, 80d6d76e00000000 , 1, 54014b01 >>> ... >>> >>> This is because we use forward_read_obj_req(), which tries to multiplex a socket >>> FD if concurrent requests access to the same object and unforunately routed to >>> the same node. >>> >>> Object cache has a very high pressure of current requests access to the same >>> COW object from cloned VMs, so this problem emerges. It looks to me that, >>> besides object cache, QEMU requests are also be subject to this problem too >>> because QEMU's sheepdog block layer can issue multiple requests in one go. >> >> >> The alternative fix is to write a new fd cache, which allow mutiple FDs >> to the same node. This looks a better fix that sort out all the related >> problems > > Can you explain how the current fd cache causes the above problem > against the concurrent accesses to the same node in more detail? > I am not 100% about this issue. It is from the experience from development of sheepfs, when I use a single FD to read/write. Since FUSE will issue highly concurrent requests, I noticed the same error as above example: the error code is quite random (see above is '54014b01'). After a long time debugging, I came to a conclusion that the problem *might* be: The subsequent read/write requests interleaves with the previous one, and wrongly read the response. When I use a dedicated FD for each request, the problem simply go away. Thanks, Yuan |