[Sheepdog] [PATCH 2/2] object cache: introduce async flush

Fri Apr 6 17:19:40 CEST 2012

At Fri, 06 Apr 2012 17:22:51 +0800,
Liu Yuan wrote:
> 
> On 04/05/2012 11:09 PM, MORITA Kazutaka wrote:
> 
> > At Wed, 04 Apr 2012 01:19:10 +0800,
> > Liu Yuan wrote:
> >>
> >> On 04/04/2012 01:00 AM, MORITA Kazutaka wrote:
> >>> At Mon,  2 Apr 2012 16:21:11 +0800,
> >>> Liu Yuan wrote:
> >>>>
> >>>> From: Liu Yuan <tailai.ly at taobao.com>
> >>>>
> >>>> We async flush dirty object as default to achieve the best performance.
> >>>> If users prefer strong consistency over performance, users can launch
> >>>> sheep with -S or --sync option.
> >>>>
> >>>> We need async flush because:
> >>>> 	1) some APP are responsive time sensitive, the writeback of dirty bits in
> >>>> 	the guest will mostly hurt RT because guest need to await its completion.
> >>>> 	This is a considerably long operation in the sheep cluster.
> >>>> 	2) some APP are just memory and CPU intensive, has little of concern of disk
> >>>> 	data. (For e.g, just use disk to store logs of APP)
> >>>> 	3) People simply prefer performance over consistency.
> >>>
> >>> Sheepdog is a block device storage.  This kind of feature must NOT be
> >>> default.  In addition, we had better show a warning about a risk of
> >>> reading old data, which could cause a filesystem corruption, when
> >>> users enable this feature.
> >>>
> >>
> >> Okay, I'll submit a patch to make it optional.
> >>
> >> But in which way we'll risk to read stale data? Guests always try to read objects from
> >> cache with cache enabled, IMO.
> > 
> > If the gateway node fails, the flushed data would be lost.
> > 
> 
> 
> Yes, but disk is not volatile media as memory, so failure doesn't
> strictly mean data lose. When we get it rebooted as for normal case that
> disk is okay, we can flush the data again.

If disk failure happens, the flushed data would be lost.  It means
that local disks would be SPOFs.  Sheepdog should be no SPOF system.

> 
> I guess maybe we really need async flush for putting cluster to good use
> currently, because
> 
> 1) I noticed that sync request is mostly issued by file system's meta
> data update, which is very harsh on error, that is, only one EIO will
> let file system put itself to readonly. This is evidenced by our test
> cluster, when cluster is being recovering, there is a high possibility
> of failure of flush request, which notify guest of EIO in sync flush
> mode, that resulting in large set of Guest to be set as read only.
> 
> 2) Current object cache flush the object in a very coarse uint (4M),
> causing flushing operation to be slower than necessary. Though this can
> be mitigated later if we support finer unit flushing by adding more
> complex data structure to manage the dirty data.

It is easy to get higher performance at the exponse of reliability,
but it is not the goal of Sheepdog.  We should find a way to improve
Sheepdog without losing reliability.

> 
> Being that said, I suggest making async flushing as default for now.

If you explains the risk in log messages, I'm not against to introduce
this feature as an optional one, but disagreee to make it default.

Thanks,

Kazutaka