[sheepdog-users] sheepdog replication got stuck

Wed Jan 8 06:52:26 CET 2014

On Wed, Jan 08, 2014 at 06:34:17AM +0100, Gerald Richter - ECOS wrote:
> Hi,
> 
> I made a further test regarding the cache size issue.
> 
> When I import only 6 images (120GB) which is less than the free disk space, everything is fine.
> 
> So the problem seems not related to the parallel operation, but all cache operations seem to get on hold when the object cache runs out of disk space. Also it seems that the object cache does not honor the cache size limit.
> 

Okay, I think this problem is valid bug in object cache as you observed that we
don't honor the size limit. Object cache can outgrow the size specified by user
in heavy workload.

This is inherent problem of current object reclaim algorithm, which doesn't do
direct reclaim and just a background reclaim. So when the threhold is reached,
a background reclaimer(only one) will be triggered and try to reclaim objects.
I think what we need is to introduce direct reclaim, that is when read/write
need allocate new slot in the cache, if it is full, we should directly reclaim
the objects before allocating new one.

P.S, I'd suggest you use v1.7 QEMU which support auto-reconnect and full object
cache support which honor the 'sync' request inside VM.

Thanks
Yuan