[Sheepdog] [PATCH v2] sheep: update inode cache first

Wed May 16 02:36:10 CEST 2012

At Mon, 14 May 2012 11:51:14 +0800,
Liu Yuan wrote:
> 
> On 05/11/2012 10:59 PM, MORITA Kazutaka wrote:
> 
> >> I have tried 'qemu-img convert -t writeback' (NOT writethrough), it
> >> > works well with qemu-img snapshot.
> >> > 
> >> > So what is the point to use unsafe mode to 'qemu-img convert' for Sheepdog?
> >> > 
> >> > I'd suggest to patch our README to use 'qemu-img -t writethrough
> >> > convert'. How about it?
> > I believe we can solve the above problem without it.
> > 
> > Note that there still exists a problem; if the cached data is on
> > another node, it is really difficult to read the latest data.  You may
> > think this is a corner case, but should pay attention as long as we
> > call Sheepdog a block storage.  My suggestion is to document the
> > restriction (e.g. we shouldn't access the same image from the
> > different nodes even if it is not the same time) instead of solving
> > this difficult problem.
> 
> 
> So seems that even with modification to read_object(), we can't solve
> this corner case. This restriction is very expensive to afford for upper
> layer, how it knows where old snapshot is created?
> 
> Mix creating snapshot with any kind of cache is insane, resulting in
> lots of code to maintain the consistency, but those code will not be
> executed at all for most the time(at lest, cautious usage can avoid it
> completely). That is why I insisted to fix the upper first, and the see
> what was left to be done to sheepdog.
> 
> I am wondering if we can patch QEMU's sd_snapshot_create(), asking it
> never use cache requests at all for snapshot creation(ignore qemu cache
> mode). Initially, I only used cache for normal IO requests, later review
> asked to patch all the code patch to use cache flag. Maybe it is time
> for us to re-visit qemu block layer to see if we really need cache flag
> for all the code path, including snapshot, migration, etc.

Even if we limit the object cache only for normal I/O requests, a
similar problem still exists.  For example:

 - there is two nodes (A and B) in Sheepdog cluster
 - run a VM with write-cache enabled on node A
 - the VM is crashed, and restarts on node B
 - shutdown the VM on node B and restarts it on node A, then the VM
   could read the old cache on node A

It actually happens in real use case to restart VMs on other nodes.
For example, OpenStack (open source cloud software) selects the nodes
where VMs start up with its own scheduler.  So, IMHO, it doesn't solve
the fundamental cache coherency problem to disable object cache for
snapshot, and cannot weaken the restriction that we shouldn't access
the same image from the different nodes.

With my read_object() modification, the object cache feature seems to
work well as long as we don't write data to the same object from
different nodes.  I think it is better to document the restriction
anywhere and leave this problem untouched for now, though the
restriction is expensive.

Thanks,

Kazutaka