[Sheepdog] [PATCH] sheep: disable object cache by default

Mon May 7 19:48:12 CEST 2012

At Mon, 07 May 2012 10:53:31 +0800,
Liu Yuan wrote:
> 
> On 05/07/2012 03:37 AM, MORITA Kazutaka wrote:
> 
> > Though it is important to support a write caching feature in Sheepdog,
> > I think we should disable it by default for now because there are
> > still some cache coherency problems which looks hard to solve:
> 
> 
> Disagree with the patch.
> 
> QEMU setup already disables object cache as default (writethrough mode).
> If users explicitly enables, he knows better than us to do such a policy.

qemu-img uses cache=unsafe or cache=writeback by default, so the users
may implicitly enable it.

> 
> > 
> >  - When we create a CoW object, we call read_copy_from_replica() to
> >    read the source object.  However, there is no gurantee that the
> >    object is up to date if another node caches it.
> 
> 
> No, the COW object is actually coherent with the snapshot object when we
> do snapshot. In your scenario, it is a offline snapshot, the data is
> guaranteed to be consistent. For a live snapshot, we don't have any
> problem either.
> 
> Do you meet any problem for snapshot? Our production system heavily
> relies on the snapshot mechanism, as of now, we don't meet problem with
> object cache.
> 
> >  - Similarly, there is no guarantee that recover_object_from_replica()
> >    can read the latest object when doing object recovery.
> > 
> 
> 
> Why we need to keep it up to date in recovery stage all the time? VMs
> interact with object cache on local node. We only need to do is to
> guarantee consistency between replica. If one need replica and object in
> object cache consistent *all the time*, he is suggested to disable
> object cache.

Sorry, I was a bit confused.  The first two is not real problems.

> 
> >  - Some vdi operations don't care about cached data.
> > 
> 
> 
> Any problem? If so, we'd solve it.

For example:

  $ qemu-img convert ~/linux.img sheepdog:linux
  $ collie vdi read linux 0 512 | hd | head -1
  00000000  fa eb 7c 6c 62 61 4c 49  4c 4f 01 00 15 04 09 00  |..|lbaLILO......|
  $ qemu-img snapshot -c snap sheepdog:linux
  $ collie vdi read linux 0 512 | hd | head -1
  00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

We cannot read the correct data after snapshotting because the
snapshot operation doesn't read the cached data.

  $ qemu-img create sheepdog:test 1G
  Formatting 'sheepdog:test', fmt=raw size=1073741824 
  $ qemu-img snapshot -c snap sheepdog:test
  $ collie vdi list
    Name        Id    Size    Used  Shared    Creation time   VDI id  Tag
    test         1  1.0 GB  0.0 MB  0.0 MB 2012-05-08 01:35   7c2b25  snap
    test         2  1.0 GB  0.0 MB  0.0 MB 2012-05-08 01:35   7c2b26

The first 'test' is not marked as a snapshot one becuase 'collie vdi list'
doesn't care about the cached vdi object.

I can imagine many other similar examples.

> 
> >  - Cached data should be synced periodically.  Currently, the data is
> >    not synced at all until SD_OP_FLUSH_VDI is requested.
> > 
> 
> 
> I don't think so. It is simpler and cleaner for us only implement flush
> mechanism and let flush policy be determined by VM. We should only
> guarantee the data consistency when we are asked for it.

This is the more right way to reduce the cost of each flush rather
than the current async flush implementation, which doesn't replicate
data for each flush at all.

> 
> Only problem we have is that if VM crashed, what should we do with the
> dirty object? I thought of a timeout mechanism, but then I dropped it
> because
> 
> 1) upper layer knows better than us
> 2) upper layer can make use of qemu-io to issue flush request or simply
> restart the VM.
> 
> > Let's make it default after it becomes stable and mature.
> 
> 
> I don't see any real argument and point for this patch, I think you are
> over-defensive and unnecessary concern of consistency between object
> cache and cluster storage. If users want to use object cache (He has to
> manually set writeback flag when starting up QEMU), we should trust him
> that he knows what he is doing.

Even if users don't want to use the object cache, they enable it if
they doesn't pass '-t writethrough' to qemu-img explicitly.

> 
> Besides, for hundreds nodes of cluster, without object cache, the system
> is actually not usable at all.
> 
> 1) network bandwidth is the bottleneck and heart-beat message of
> membership will be heavily influenced to the extent that the whole
> cluster is completely unstable (membership thrashing)
> 2) any single node join/leave event will panic out most of the VM,
> because many VM will issue requests periodically for meta data update
> and its internally timeoutted by VM, only single failure of such
> requests will put the filesystem inside VM to read-only.

I also think a cache feature is necessary for production use.  I'm
just saying that let's disable the current object cache by default for
now.  To make it default, the followings are necessary at least:

 - We should provide documentation about the object cache.  Otherwise,
   the user cannot know how to specify the cache options of qemu and
   qemu-img.  It is not obvious that we must add '-t writethourgh' to
   the qemu-img to disable the object cache.

 - The cache feature must take it into consider that users may specify
   the different cache mode in some cases.  Currently, the object
   cache doesn't handle this well.

Thanks,

Kazutaka