[sheepdog] Users inputs for new reclaim algorithm, please

Mon Mar 17 14:43:25 CET 2014

On Mon, Mar 17, 2014 at 10:16:09PM +0900, Hitoshi Mitake wrote:
> At Mon, 17 Mar 2014 16:12:03 +0800,
> Liu Yuan wrote:
> > 
> > Hi all,
> > 
> >    I think this would be a big topic regarding new deletion algirthm, which is
> > currently bening undertaken by Hithsh.
> > 
> >  The motivation is very well explained as follows:
> > 
> >  $ dog vdi create image
> >  $ dog vdi write image < some_data
> >  $ dog vdi snapshot image -s snap1
> >  $ dog vdi write image < some_data
> >  $ dog vdi delete image            <- this doesn't reclaim the objects
> >                                          of the image
> >  $ dog vdi delete image -s snap1   <- this reclaims all the data objects
> >                                          of both image and image:snap1
> > 
> > Simply put, we use a simple and stupid algirthm that when all the vdis on the
> > snapshot chain are deleted, the space will then be released.
> > 
> > The new algorithm add more complexity to handle this problem, but also introduce
> > a new big problem.
> > 
> > With new algorithm,
> > 
> >  $ dog vdi create image
> >  $ dog vdi snapshot image -s snap1
> >  $ dog vdi clone -s snap1 image clone
> >  $ dog vdi delete clone  <-- this operation will surprise you that it won't
> >                              release space but instead increase the space.
> > 
> > Following is the real case, we can see that deletion of a clone, which uses 316MB
> > space, will actaully cause 5.2GB more space to be used.
> 
> The patchset v7 consumes amount of disk space by ledger objects
> because of incorrect implementation of sparse object. It was just a
> bug. The latest snapshot-object-reclaim branch has its fix.
> 
> > 
> > yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi list
> >   Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
> > c clone        0   40 GB  316 MB  1.5 GB 2014-03-17 14:35   72a1e2    2:2              
> > s test         1   40 GB  1.8 GB  0.0 MB 2014-03-17 14:16   7c2b25    2:2              
> >   test         0   40 GB  0.0 MB  1.8 GB 2014-03-17 14:34   7c2b26    2:2              
> > yliu at ubuntu-precise:~/sheepdog$ dog/dog node info
> > Id	Size	Used	Avail	Use%
> >  0	39 GB	932 MB	38 GB	  2%
> >  1	39 GB	878 MB	38 GB	  2%
> >  2	39 GB	964 MB	38 GB	  2%
> >  3	39 GB	932 MB	38 GB	  2%
> >  4	39 GB	876 MB	38 GB	  2%
> >  5	39 GB	978 MB	38 GB	  2%
> > Total	234 GB	5.4 GB	229 GB	  2%
> > 
> > Total virtual image size	80 GB
> > yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi delete clone
> > yliu at ubuntu-precise:~/sheepdog$ dog/dog node info
> > Id	Size	Used	Avail	Use%
> >  0	34 GB	1.7 GB	33 GB	  4%
> >  1	34 GB	1.7 GB	33 GB	  4%
> >  2	35 GB	1.9 GB	33 GB	  5%
> >  3	35 GB	1.8 GB	33 GB	  5%
> >  4	35 GB	1.8 GB	33 GB	  5%
> >  5	35 GB	1.9 GB	33 GB	  5%
> > Total	208 GB	11 GB	197 GB	  5%
> > 
> > Total virtual image size	40 GB
> > yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi list
> >   Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
> > s test         1   40 GB  1.8 GB  0.0 MB 2014-03-17 14:16   7c2b25    2:2              
> >   test         0   40 GB  0.0 MB  1.8 GB 2014-03-17 14:34   7c2b26    2:2              
> > 
> > For old algorithm, the clones 316MB will be released without posing any problem.
> > 
> > I think this is a very important issue for following use case:
> > 
> >  - suppose you are providing VM services with pre-defined iamges as bases
> >  - these pre-defined images are actually snapshots in the sheepdog and you
> >    you seldom delete them
> >  - VM instance are provided by clone operation
> >  - since VM instance are all created on demand, they are likely to be released
> >    or recreated very often.
> > 
> > So if you have this usage in mind, you'll expect a catastrophic prolem:
> >  - frequent cloned instance release and creation will pose much more space
> >    pressure on you.
> >  - when space is near low watermark, you are not allowed to delete clones because
> >    deletion will actually increase the space and end up destroying your cluster.
> >    You have no choise, either add more nodes nor deny create of new clones and
> >    never try to delete clones later.
> > 
> > Any ideas?
> 
> Let sheepdog cluster run with small left disk space is really
> dangerous. Because death of few nodes can exhaust left space and kill
> the entire cluster. At least, our team has a guideline that sheepdog
> cluster should run with enough left disk space (ideally, 50%).
> 

I don't buy this idea. Sheepdog is said to be a cheap storage solution and if
we can only run safely with 50% capacity, it means that we double the storage
cost. More importantly, any storage who can't support deletion to reclaim space
after data are near full, should be considered unacceptable.

Suppose you are writing a file system and telling people should not try to fill
it more than 50% and if you do, assume we reach 90%, your file system is dead.
you can't reclaim any space at all by deletion it and worsely, any deletion
will destroy the file system.

For the bottom line, we deletion should reduce space consumption instead of
posing more space consumption because for any users and any system, deletion
*means* means reclaim space. We should never try to break this intuition.

> If admins should delete VDIs for allocating disk space, it means that
> they already commited a serious fault. 

As above commented, it is a serious fault to me that I can't reclaim space
by simply deleting of existing vdis.

>
> Disk space shortage should be
> resolved by adding more disks/nodes or avoided by controling  a number
> of VDIs and their size (including snapshots, clones).
> 
> As an emergency solution, I can implement VDI family deletion which
> requires no additional disk space (almost every process of the
> deletion would be done in dog). With this solution and qemu-img
> convert, users can free disk space in a safe manner.
> 

We need this dirty workaround because we need new algorithm. So what we gain from
it by introducing more and more problems (performance degration, deletion problem
and some other unseen problems)? We just solve a prolem that people want to free
space after snapshots are deleted!

Note, ironically, the initial motivation is to reclaim space for snapshot deletion.
but with this new algorithm, we actually can't relcaim space for a more broader
use case, deletion of clones are forbidden if you find sotrage is out of space!

Please think twice about new algorithm. If we need it just because we write it
more than the real needs....

There might be some users need this new algorithm for their specific usage, but
I'd suggest that:

 1 make old algorithm as default reclaim one
 2 modularize the reclaim algorithm and add new algorithm as an option for users
   in this way, we can improve the new algorithm steps by steps and possibly
   we can introduce more algorithms to meet varoius needs.

Thanks
Yuan