At Thu, 17 May 2012 09:56:15 +0800, Liu Yuan wrote: > > On 05/17/2012 01:57 AM, MORITA Kazutaka wrote: > > > At Wed, 16 May 2012 13:27:31 -0400, > > Christoph Hellwig wrote: > >> > >> On Thu, May 17, 2012 at 01:33:14AM +0900, MORITA Kazutaka wrote: > >>>> I think the rational is to not change the cluster configuration while > >>>> I/O is in progress. With the vnode_info structure making the cluster > >>>> state during and I/O operation explicit (together with hdr->epoch) > >>>> I suspect this isn't needed any more, but I want to do a full blown > >>>> audit of the I/O path first. > >>> > >>> The reason is that the recovery algorithm assumes that all objects in > >>> the older epoch are immutable, which means only the objects in the > >>> current epoch are writable. If outstanding I/Os update objects in the > >>> previous epoch after they are recovered to the current epoch, their > >>> replicas result in inconsistent state. > >> > > > This assumption seems not necessary, at least to Farm, where I/O will > always be routed into objects in the working directory. Really? I thought that this problem does not depend on the underlying storage driver. If there are 1 node, A, and the number of copies is 1, how does Farm handle the following case? - the user add the second node B, and there is in-flight I/Os on node A - the node A increments the epoch from 1 to 2, and the node B recovers objects from epoch 1 on node A - after node B receives objects to epoch 2, the in-flight I/Os on node A updates objects in epoch 1 on node A. - node A sends responses to clients as success, but the updated data will be lost?? > I think both recovery code and any assumptions need to be revisited and > possibly this is a long term issue. Agreed. Thanks, Kazutaka |