On 05/07/2012 03:37 AM, MORITA Kazutaka wrote: > Though it is important to support a write caching feature in Sheepdog, > I think we should disable it by default for now because there are > still some cache coherency problems which looks hard to solve: Disagree with the patch. QEMU setup already disables object cache as default (writethrough mode). If users explicitly enables, he knows better than us to do such a policy. > > - When we create a CoW object, we call read_copy_from_replica() to > read the source object. However, there is no gurantee that the > object is up to date if another node caches it. No, the COW object is actually coherent with the snapshot object when we do snapshot. In your scenario, it is a offline snapshot, the data is guaranteed to be consistent. For a live snapshot, we don't have any problem either. Do you meet any problem for snapshot? Our production system heavily relies on the snapshot mechanism, as of now, we don't meet problem with object cache. > - Similarly, there is no guarantee that recover_object_from_replica() > can read the latest object when doing object recovery. > Why we need to keep it up to date in recovery stage all the time? VMs interact with object cache on local node. We only need to do is to guarantee consistency between replica. If one need replica and object in object cache consistent *all the time*, he is suggested to disable object cache. > - Some vdi operations don't care about cached data. > Any problem? If so, we'd solve it. > - Cached data should be synced periodically. Currently, the data is > not synced at all until SD_OP_FLUSH_VDI is requested. > I don't think so. It is simpler and cleaner for us only implement flush mechanism and let flush policy be determined by VM. We should only guarantee the data consistency when we are asked for it. Only problem we have is that if VM crashed, what should we do with the dirty object? I thought of a timeout mechanism, but then I dropped it because 1) upper layer knows better than us 2) upper layer can make use of qemu-io to issue flush request or simply restart the VM. > Let's make it default after it becomes stable and mature. I don't see any real argument and point for this patch, I think you are over-defensive and unnecessary concern of consistency between object cache and cluster storage. If users want to use object cache (He has to manually set writeback flag when starting up QEMU), we should trust him that he knows what he is doing. Besides, for hundreds nodes of cluster, without object cache, the system is actually not usable at all. 1) network bandwidth is the bottleneck and heart-beat message of membership will be heavily influenced to the extent that the whole cluster is completely unstable (membership thrashing) 2) any single node join/leave event will panic out most of the VM, because many VM will issue requests periodically for meta data update and its internally timeoutted by VM, only single failure of such requests will put the filesystem inside VM to read-only. Thanks, Yuan |