[sheepdog-users] Users inputs for new reclaim algorithm, please

Tue Mar 18 14:08:48 CET 2014

I have to agree with Yuan here.  If I'm deleting something, I expect 
that space to be available for reuse elsewhere, not end up taking 
several times the deleted amount.

On 03/17/2014 09:43 AM, Liu Yuan wrote:
> On Mon, Mar 17, 2014 at 10:16:09PM +0900, Hitoshi Mitake wrote:
>> At Mon, 17 Mar 2014 16:12:03 +0800,
>> Liu Yuan wrote:
>>> Hi all,
>>>
>>>     I think this would be a big topic regarding new deletion algirthm, which is
>>> currently bening undertaken by Hithsh.
>>>
>>>   The motivation is very well explained as follows:
>>>
>>>   $ dog vdi create image
>>>   $ dog vdi write image < some_data
>>>   $ dog vdi snapshot image -s snap1
>>>   $ dog vdi write image < some_data
>>>   $ dog vdi delete image            <- this doesn't reclaim the objects
>>>                                           of the image
>>>   $ dog vdi delete image -s snap1   <- this reclaims all the data objects
>>>                                           of both image and image:snap1
>>>
>>> Simply put, we use a simple and stupid algirthm that when all the vdis on the
>>> snapshot chain are deleted, the space will then be released.
>>>
>>> The new algorithm add more complexity to handle this problem, but also introduce
>>> a new big problem.
>>>
>>> With new algorithm,
>>>
>>>   $ dog vdi create image
>>>   $ dog vdi snapshot image -s snap1
>>>   $ dog vdi clone -s snap1 image clone
>>>   $ dog vdi delete clone  <-- this operation will surprise you that it won't
>>>                               release space but instead increase the space.
>>>
>>> Following is the real case, we can see that deletion of a clone, which uses 316MB
>>> space, will actaully cause 5.2GB more space to be used.
>> The patchset v7 consumes amount of disk space by ledger objects
>> because of incorrect implementation of sparse object. It was just a
>> bug. The latest snapshot-object-reclaim branch has its fix.
>>
>>> yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi list
>>>    Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
>>> c clone        0   40 GB  316 MB  1.5 GB 2014-03-17 14:35   72a1e2    2:2
>>> s test         1   40 GB  1.8 GB  0.0 MB 2014-03-17 14:16   7c2b25    2:2
>>>    test         0   40 GB  0.0 MB  1.8 GB 2014-03-17 14:34   7c2b26    2:2
>>> yliu at ubuntu-precise:~/sheepdog$ dog/dog node info
>>> Id	Size	Used	Avail	Use%
>>>   0	39 GB	932 MB	38 GB	  2%
>>>   1	39 GB	878 MB	38 GB	  2%
>>>   2	39 GB	964 MB	38 GB	  2%
>>>   3	39 GB	932 MB	38 GB	  2%
>>>   4	39 GB	876 MB	38 GB	  2%
>>>   5	39 GB	978 MB	38 GB	  2%
>>> Total	234 GB	5.4 GB	229 GB	  2%
>>>
>>> Total virtual image size	80 GB
>>> yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi delete clone
>>> yliu at ubuntu-precise:~/sheepdog$ dog/dog node info
>>> Id	Size	Used	Avail	Use%
>>>   0	34 GB	1.7 GB	33 GB	  4%
>>>   1	34 GB	1.7 GB	33 GB	  4%
>>>   2	35 GB	1.9 GB	33 GB	  5%
>>>   3	35 GB	1.8 GB	33 GB	  5%
>>>   4	35 GB	1.8 GB	33 GB	  5%
>>>   5	35 GB	1.9 GB	33 GB	  5%
>>> Total	208 GB	11 GB	197 GB	  5%
>>>
>>> Total virtual image size	40 GB
>>> yliu at ubuntu-precise:~/sheepdog$ dog/dog vdi list
>>>    Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
>>> s test         1   40 GB  1.8 GB  0.0 MB 2014-03-17 14:16   7c2b25    2:2
>>>    test         0   40 GB  0.0 MB  1.8 GB 2014-03-17 14:34   7c2b26    2:2
>>>
>>> For old algorithm, the clones 316MB will be released without posing any problem.
>>>
>>> I think this is a very important issue for following use case:
>>>
>>>   - suppose you are providing VM services with pre-defined iamges as bases
>>>   - these pre-defined images are actually snapshots in the sheepdog and you
>>>     you seldom delete them
>>>   - VM instance are provided by clone operation
>>>   - since VM instance are all created on demand, they are likely to be released
>>>     or recreated very often.
>>>
>>> So if you have this usage in mind, you'll expect a catastrophic prolem:
>>>   - frequent cloned instance release and creation will pose much more space
>>>     pressure on you.
>>>   - when space is near low watermark, you are not allowed to delete clones because
>>>     deletion will actually increase the space and end up destroying your cluster.
>>>     You have no choise, either add more nodes nor deny create of new clones and
>>>     never try to delete clones later.
>>>
>>> Any ideas?
>> Let sheepdog cluster run with small left disk space is really
>> dangerous. Because death of few nodes can exhaust left space and kill
>> the entire cluster. At least, our team has a guideline that sheepdog
>> cluster should run with enough left disk space (ideally, 50%).
>>
> I don't buy this idea. Sheepdog is said to be a cheap storage solution and if
> we can only run safely with 50% capacity, it means that we double the storage
> cost. More importantly, any storage who can't support deletion to reclaim space
> after data are near full, should be considered unacceptable.
>
> Suppose you are writing a file system and telling people should not try to fill
> it more than 50% and if you do, assume we reach 90%, your file system is dead.
> you can't reclaim any space at all by deletion it and worsely, any deletion
> will destroy the file system.
>
> For the bottom line, we deletion should reduce space consumption instead of
> posing more space consumption because for any users and any system, deletion
> *means* means reclaim space. We should never try to break this intuition.
>
>> If admins should delete VDIs for allocating disk space, it means that
>> they already commited a serious fault.
> As above commented, it is a serious fault to me that I can't reclaim space
> by simply deleting of existing vdis.
>
>> Disk space shortage should be
>> resolved by adding more disks/nodes or avoided by controling  a number
>> of VDIs and their size (including snapshots, clones).
>>
>> As an emergency solution, I can implement VDI family deletion which
>> requires no additional disk space (almost every process of the
>> deletion would be done in dog). With this solution and qemu-img
>> convert, users can free disk space in a safe manner.
>>
> We need this dirty workaround because we need new algorithm. So what we gain from
> it by introducing more and more problems (performance degration, deletion problem
> and some other unseen problems)? We just solve a prolem that people want to free
> space after snapshots are deleted!
>
> Note, ironically, the initial motivation is to reclaim space for snapshot deletion.
> but with this new algorithm, we actually can't relcaim space for a more broader
> use case, deletion of clones are forbidden if you find sotrage is out of space!
>
> Please think twice about new algorithm. If we need it just because we write it
> more than the real needs....
>
> There might be some users need this new algorithm for their specific usage, but
> I'd suggest that:
>
>   1 make old algorithm as default reclaim one
>   2 modularize the reclaim algorithm and add new algorithm as an option for users
>     in this way, we can improve the new algorithm steps by steps and possibly
>     we can introduce more algorithms to meet varoius needs.
>
> Thanks
> Yuan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140318/a9f2b8f8/attachment-0005.vcf>