[Sheepdog] [PATCH] sheep: disable object cache by default

Mon May 7 04:53:31 CEST 2012

On 05/07/2012 03:37 AM, MORITA Kazutaka wrote:

> Though it is important to support a write caching feature in Sheepdog,
> I think we should disable it by default for now because there are
> still some cache coherency problems which looks hard to solve:

Disagree with the patch.

QEMU setup already disables object cache as default (writethrough mode).
If users explicitly enables, he knows better than us to do such a policy.

> 
>  - When we create a CoW object, we call read_copy_from_replica() to
>    read the source object.  However, there is no gurantee that the
>    object is up to date if another node caches it.

No, the COW object is actually coherent with the snapshot object when we
do snapshot. In your scenario, it is a offline snapshot, the data is
guaranteed to be consistent. For a live snapshot, we don't have any
problem either.

Do you meet any problem for snapshot? Our production system heavily
relies on the snapshot mechanism, as of now, we don't meet problem with
object cache.

>  - Similarly, there is no guarantee that recover_object_from_replica()
>    can read the latest object when doing object recovery.
> 

Why we need to keep it up to date in recovery stage all the time? VMs
interact with object cache on local node. We only need to do is to
guarantee consistency between replica. If one need replica and object in
object cache consistent *all the time*, he is suggested to disable
object cache.

>  - Some vdi operations don't care about cached data.
> 

Any problem? If so, we'd solve it.

>  - Cached data should be synced periodically.  Currently, the data is
>    not synced at all until SD_OP_FLUSH_VDI is requested.
> 

I don't think so. It is simpler and cleaner for us only implement flush
mechanism and let flush policy be determined by VM. We should only
guarantee the data consistency when we are asked for it.

Only problem we have is that if VM crashed, what should we do with the
dirty object? I thought of a timeout mechanism, but then I dropped it
because

1) upper layer knows better than us
2) upper layer can make use of qemu-io to issue flush request or simply
restart the VM.

> Let's make it default after it becomes stable and mature.

I don't see any real argument and point for this patch, I think you are
over-defensive and unnecessary concern of consistency between object
cache and cluster storage. If users want to use object cache (He has to
manually set writeback flag when starting up QEMU), we should trust him
that he knows what he is doing.

Besides, for hundreds nodes of cluster, without object cache, the system
is actually not usable at all.

1) network bandwidth is the bottleneck and heart-beat message of
membership will be heavily influenced to the extent that the whole
cluster is completely unstable (membership thrashing)
2) any single node join/leave event will panic out most of the VM,
because many VM will issue requests periodically for meta data update
and its internally timeoutted by VM, only single failure of such
requests will put the filesystem inside VM to read-only.

Thanks,
Yuan