[sheepdog] Is it necessary for outstanding io block leave/join event?

Thu May 17 09:29:44 CEST 2012

At Thu, 17 May 2012 09:56:15 +0800,
Liu Yuan wrote:
> 
> On 05/17/2012 01:57 AM, MORITA Kazutaka wrote:
> 
> > At Wed, 16 May 2012 13:27:31 -0400,
> > Christoph Hellwig wrote:
> >>
> >> On Thu, May 17, 2012 at 01:33:14AM +0900, MORITA Kazutaka wrote:
> >>>> I think the rational is to not change the cluster configuration while
> >>>> I/O is in progress.  With the vnode_info structure making the cluster
> >>>> state during and I/O operation explicit (together with hdr->epoch)
> >>>> I suspect this isn't needed any more, but I want to do a full blown
> >>>> audit of the I/O path first.
> >>>
> >>> The reason is that the recovery algorithm assumes that all objects in
> >>> the older epoch are immutable, which means only the objects in the
> >>> current epoch are writable.  If outstanding I/Os update objects in the
> >>> previous epoch after they are recovered to the current epoch, their
> >>> replicas result in inconsistent state.
> >>
> 
> 
> This assumption seems not necessary, at least to Farm, where I/O will
> always be routed into objects in the working directory.

Really?  I thought that this problem does not depend on the underlying
storage driver.

If there are 1 node, A, and the number of copies is 1, how does
Farm handle the following case?

 - the user add the second node B, and there is in-flight I/Os on
   node A
 - the node A increments the epoch from 1 to 2, and the node B recovers
   objects from epoch 1 on node A
 - after node B receives objects to epoch 2, the in-flight I/Os on
   node A updates objects in epoch 1 on node A.
 - node A sends responses to clients as success, but the updated data
   will be lost??

> I think both recovery code and any assumptions need to be revisited and
> possibly this is a long term issue.

Agreed.

Thanks,

Kazutaka