At Mon, 04 Jun 2012 15:07:43 +0800, Liu Yuan wrote: > > On 06/04/2012 02:52 PM, MORITA Kazutaka wrote: > > > At Mon, 04 Jun 2012 14:12:10 +0800, > > Liu Yuan wrote: > >> > >> On 06/04/2012 02:04 PM, Liu Yuan wrote: > >> > >>> The current object_cache_pull() cause bellow bug: > >>> ... > >>> do_gateway_request(288) 2, 80d6d76e00000000 , 1 > >>> Jun 04 10:16:37 connect_to(241) 2126, 10.232.134.3:7000 > >>> Jun 04 10:16:37 client_handler(747) closed connection 2116 > >>> Jun 04 10:16:37 destroy_client(678) connection from: 127.0.0.1:60214 > >>> Jun 04 10:16:37 listen_handler(797) accepted a new connection: 2116 > >>> Jun 04 10:16:37 client_rx_handler(586) connection from: 127.0.0.1:60216 > >>> Jun 04 10:16:37 queue_request(385) 2 > >>> Jun 04 10:16:37 do_gateway_request(288) 2, 80d6d76e00000000 , 1 > >>> Jun 04 10:16:37 do_gateway_request(308) failed: 2, 80d6d76e00000000 , 1, 54014b01 > >>> ... > >>> > >>> This is because we use forward_read_obj_req(), which tries to multiplex a socket > >>> FD if concurrent requests access to the same object and unforunately routed to > >>> the same node. > >>> > >>> Object cache has a very high pressure of current requests access to the same > >>> COW object from cloned VMs, so this problem emerges. It looks to me that, > >>> besides object cache, QEMU requests are also be subject to this problem too > >>> because QEMU's sheepdog block layer can issue multiple requests in one go. > >> > >> > >> The alternative fix is to write a new fd cache, which allow mutiple FDs > >> to the same node. This looks a better fix that sort out all the related > >> problems > > > > Can you explain how the current fd cache causes the above problem > > against the concurrent accesses to the same node in more detail? > > > > > I am not 100% about this issue. It is from the experience from > development of sheepfs, when I use a single FD to read/write. Since FUSE > will issue highly concurrent requests, I noticed the same error as above > example: the error code is quite random (see above is '54014b01'). > > After a long time debugging, I came to a conclusion that the problem > *might* be: > > The subsequent read/write requests interleaves with the previous one, > and wrongly read the response. I think we should reveal how they interleave before working out how to fix. The current fd cache seems to allow multiple accesses to the same node because cached_fds is a thread-local variable and there is no fd which is used by multiple threads at the same time. Thanks, Kazutaka > > When I use a dedicated FD for each request, the problem simply go away. > > Thanks, > Yuan > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |