[sheepdog] [PATCH v2 0/5] garbage collect needless VIDs and inode objects

Liu Yuan namei.unix at gmail.com
Mon Mar 16 08:39:26 CET 2015


On Mon, Mar 16, 2015 at 03:55:43PM +0900, Hitoshi Mitake wrote:
> At Mon, 16 Mar 2015 14:48:47 +0800,
> Liu Yuan wrote:
> > 
> > On Mon, Mar 16, 2015 at 03:39:17PM +0900, Hitoshi Mitake wrote:
> > > At Mon, 16 Mar 2015 14:31:44 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Mon, Mar 16, 2015 at 02:08:25PM +0800, Liu Yuan wrote:
> > > > > On Mon, Mar 16, 2015 at 02:36:57PM +0900, Hitoshi Mitake wrote:
> > > > > > At Mon, 16 Mar 2015 10:21:50 +0800,
> > > > > > Liu Yuan wrote:
> > > > > > > 
> > > > > > > On Thu, Mar 12, 2015 at 08:14:33PM +0900, Hitoshi Mitake wrote:
> > > > > > > > At Thu, 12 Mar 2015 14:41:56 +0800,
> > > > > > > > Liu Yuan wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, Jan 13, 2015 at 10:37:40AM +0900, Hitoshi Mitake wrote:
> > > > > > > > > > Current sheepdog never recycles VIDs. But it will cause problems
> > > > > > > > > > e.g. VID space exhaustion, too much garbage inode objects.
> > > > > > > > > > 
> > > > > > > > > > Keeping deleted inode objects is required because living inodes
> > > > > > > > > > (snapshots or clones) can point objects of the deleted inodes. So if
> > > > > > > > > > every member of VDI family is deleted, it is safe to remove deleted
> > > > > > > > > > inode objects.
> > > > > > > > > > 
> > > > > > > > > > v2:
> > > > > > > > > >  - update test scripts
> > > > > > > > > 
> > > > > > > > > All the nodes of our test cluster panic out for the following problem:
> > > > > > > > > 
> > > > > > > > > Mar 12 00:05:03  DEBUG [main] zk_handle_notify(1216) NOTIFY
> > > > > > > > > Mar 12 00:05:03  DEBUG [main] sd_notify_handler(960) op NOTIFY_VDI_ADD, size: 96, from: IPv4 ip:192.168.39.177 port:7000
> > > > > > > > > Mar 12 00:05:03  DEBUG [main] do_add_vdi_state(362) 7c2b2b, 3, 0, 22, 0
> > > > > > > > > Mar 12 00:05:03  DEBUG [main] do_add_vdi_state(362) 7c2b2c, 3, 0, 22, 7c2b2b
> > > > > > > > > Mar 12 00:05:03  EMERG [main] update_vdi_family(127) PANIC: parent VID: 7c2b2b not found
> > > > > > > > > Mar 12 00:05:03  EMERG [main] crash_handler(286) sheep exits unexpectedly (Aborted), si pid 4786, uid 0, errno 0, code -6
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) sheep.c:288: crash_handler
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libpthread.so.0() [0x338200f4ff]
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libc.so.6(gsignal+0x34) [0x3381c328a4]
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libc.so.6(abort+0x174) [0x3381c34084]
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) vdi.c:127: update_vdi_family
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) vdi.c:398: add_vdi_state
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) ops.c:711: cluster_notify_vdi_add
> > > > > > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) group.c:975: sd_notify_handler
> > > > > > > > > 
> > > > > > > > > So I tracked back to this patch set. The problem of this patch set tried to
> > > > > > > > > solve is very clear and come along with sheepdog since its born. This reveals
> > > > > > > > > actually the defeciency of our vdi allocation algorithm, which we need rethink
> > > > > > > > > a completely new algorithm to replace it and is not fixable, unfortunately.
> > > > > > > > > 
> > > > > > > > > One simple rule, we can't recyle any vid if it is once created because of its
> > > > > > > > > current hash collision handling. Our current implementation forbigs recycling.
> > > > > > > > > 
> > > > > > > > > Instead of fixing the above panic bug, I'd suggest we revert this patch set.
> > > > > > > > > For the problem this patch set mentioned, I think we need a new algoirthm and
> > > > > > > > > implementation. But before that, we should stay with old one, it is stable and
> > > > > > > > > reliable and should work for small size cluster.
> > > > > > > > > 
> > > > > > > > > How do you think, Hitoshi and Kazutaka?
> > > > > > > > 
> > > > > > > > How about providing switch turn on/off VID recycling? e.g. dog cluster
> > > > > > > > format --enable-vid-recycle. The code can easily be pushed into
> > > > > > > > conditional branches. I can post a patch if this way is good for you.
> > > > > > > > 
> > > > > > > 
> > > > > > > This temporary workaroud looks okay but not good enough to me, what I am
> > > > > > > concerned is that vdi recycle will probably never be implemented if we stick to
> > > > > > > current vdi allocation algorithm. Once the new vdi allocation is intruduced
> > > > > > > someday in the future, the new algorithm would have no this kind of problem at
> > > > > > > all. If this is the case, the above code we leave here is also useless.
> > > > > > > 
> > > > > > > I think we should focus on the new vdi allocation algorithm, e.g, store
> > > > > > > {name, vid} directly into a kv engine either implemented by sheep or by with the
> > > > > > > help of other software like zookeeper.
> > > > > > > 
> > > > > > > I'm inclined to revert above patch set, for
> > > > > > > 1. it can't fix a non-fixable problem inherently
> > > > > > > 2. the code is probalematic and can cause a catastraphic disaster (all node die)
> > > > > > > 3. we might not need it in the future because it is specific for current vdi
> > > > > > >    allocation algorithm.
> > > > > > 
> > > > > > We can simply employ whole range lookup of bitmap as a VID allocation
> > > > > > algorithm for recycling policy. Of course it would be harmful for
> > > > > > snapshot and clone creation, but it can work correctly (and we have
> > > > > > optimization e.g. parallelizing, caching, etc). In addition, the
> > > > > > performance degradation can happen potentially even if we use the
> > > > > > existing VID allocation algorithm (e.g. hash collision, although
> > > > > > course it can happen rearely).
> > > > > 
> > > > > Do we really need vdi recycle if we bring very complex lines of code? Current
> > > > > algorithm can *reuse* of deleted vdi IDs and inodes. So the very problem is
> > > > > actually the space effeciency, so you try to reclaim the space occupied by
> > > > > deleted vdis.
> > > > > 
> > > > > If it is very easy to reclaim deleted inodes, I'd say great and let's go ahead.
> > > > > But it apparently not. We have this patch set and then the lookup algorithm is
> > > > > heavily degrated.
> > > > > 
> > > > > I'm afraid, lookup the whole range is too costy, considering the deleted inodes
> > > > > space we reclaim. I think, most of users can bear the very little space overhead
> > > > > for better performance. So this patch set trade the code complexity and
> > > > > performance for the space efficiency. Note, we can reclaim inodes only in the
> > > > > case that we delete the whole snapshot chain and parent. This is actually a rare
> > > > > case.
> > > > 
> > > > Delete the vdi & snapthos data objects is really good enough to me. Your patch
> > > > set is the one of efforts to perfect current algorithms. But the cost is too
> > > > high because the hottest path of vdi_lookup() is heavily degrated for gerenal
> > > > cases, even though later we can fix all the bugs related to the this patch set.
> > > > 
> > > > Please consider it.
> > > 
> > > Of course the increased cost of vdi_lookup() is problem. So I'm
> > > posting a patch for providing an option for enabling/disabling vid
> > > recycling. In default, the recycling will be disabled with the
> > > patch. So users can choose two different policies with different
> > > pros/cons.
> > > 
> > > The recycling VID is an actual requirement from the development team
> > > of NTT DATA. I need to provide it at least as an option.
> >  
> > THe requrement is for reclaim the deleted inodes? The vid exaustion problem
> > mentioned in your patch set, is actually not a problem, no? We can reuse deleted
> > inode and vid. If so, reclaim the deleted inodes, which are very little and
> > reuable, is so important?
> 
> The primary problem is VID. But reusing deleted VID correctly will
> require mucm more complex code e.g. rewriting parent/child
> relationship (and it would be superset of current code).

I guess so. I don't have a glance of this problem in a code manner, but we have
two state to indicate if the vid is used or not,

1. bit int bitmap of system_info
2. inode's name field

For the non-snapshot case, the valid vid is a) bit is set, b) name is not empty.
So if we zero name, we can resue this bit by check if name is empty.

For the snapshot case, the above criteria still hold true, no?

001111000
  ABCD

D is the working vid, ABC are the snapshots. After we remove B,

0011 11000
  AB'CD

B' mean its name is empty. So the snapshot chain becomes A->C->D. B is still
reusable, no?

Thanks,
Yuan



More information about the sheepdog mailing list