At Wed, 16 May 2012 13:27:31 -0400, Christoph Hellwig wrote: > > On Thu, May 17, 2012 at 01:33:14AM +0900, MORITA Kazutaka wrote: > > > I think the rational is to not change the cluster configuration while > > > I/O is in progress. With the vnode_info structure making the cluster > > > state during and I/O operation explicit (together with hdr->epoch) > > > I suspect this isn't needed any more, but I want to do a full blown > > > audit of the I/O path first. > > > > The reason is that the recovery algorithm assumes that all objects in > > the older epoch are immutable, which means only the objects in the > > current epoch are writable. If outstanding I/Os update objects in the > > previous epoch after they are recovered to the current epoch, their > > replicas result in inconsistent state. > > That's defintively a problem for both approaches to update the epoch and > node information directly from the main thread as I/O can still be in > flight at this point. Hmm, in the previous implementation, Sheepdog flushed all in-flight I/Os before processing join/leave events, and blocked any I/Os until sheep update epoch and node information. It seems that the current code is broken... IIUC, process_request_queue() must not be called while event_running == 1. Thanks, Kazutaka |