[sheepdog] [Sheepdog] [PATCH v2] sheep: update inode cache first

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Wed May 16 05:05:51 CEST 2012


At Wed, 16 May 2012 10:10:48 +0800,
Liu Yuan wrote:
> 
> On 05/16/2012 08:36 AM, MORITA Kazutaka wrote:
> 
> > Even if we limit the object cache only for normal I/O requests, a
> > similar problem still exists.  For example:
> > 
> >  - there is two nodes (A and B) in Sheepdog cluster
> >  - run a VM with write-cache enabled on node A
> >  - the VM is crashed, and restarts on node B
> 
> 
> I think upper layer(openstack fail-over management) should take care of
> flushing the cache, we already have mechanism to to do this, qemu-io -c
> "flush".

IMHO, the uppler layer should assume that the underlying one provides
a block storage semantic to simplify the implementation.  And
actually, there is no other block driver in QEMU and OpenStack which
needs a flush request to read the latest data from another node.

> 
> For the worst case, the host node crashes into wreckage (can't boot up
> again), users who use cache should tolerate the dirty update lost in the
> cache, because there is no way to retrieve those updates. Similar to
> pagecache and disk cache (without power-failure protection), they don't
> promise anything for data at all. Any persistent data request should
> come up with sync flag.
> 
> >  - shutdown the VM on node B and restarts it on node A, then the VM
> >    could read the old cache on node A
> > 
> > It actually happens in real use case to restart VMs on other nodes.
> > For example, OpenStack (open source cloud software) selects the nodes
> > where VMs start up with its own scheduler.  So, IMHO, it doesn't solve
> > the fundamental cache coherency problem to disable object cache for
> > snapshot, and cannot weaken the restriction that we shouldn't access
> > the same image from the different nodes.
> 
> 
> It depends the use case, I think we can't expect cache to survive all
> the crashes. For data sensitive APP, I think they should run VM without
> cache. They do have their choice from QEMU and sheep options.

It's okay to discard the cached data when system is crashed.  It's
completely correct as a block device semantic.  The problem is that
the VM can read the old (and invalid) data if the cache coherency
problem exists.  In my example, the VM can read the old cache data on
node A after the VM updates data on node B.  And in the worst case,
the VM can flush the invalid data on node A to Sheepdog.

Anyway, we can avoid the problem if we use Sheepdog carefully, and if
users want to use Sheepdog like my example, they can disable the
object cache with my patch.  That's fine with me.

Thanks,

Kazutaka

> 
> Well, that being said, I am fine with taking your read_object() cache,
> but it should be considered the best effort we try to get freshest data,
> when users want a promise, they are suggested to get rid of this problem
> by correctly using upper layer tools.
> 
> Thanks,
> Yuan
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list