[sheepdog] [PATCH v2 0/5] garbage collect needless VIDs and inode objects

Mon Mar 16 07:31:31 CET 2015

At Mon, 16 Mar 2015 14:08:25 +0800,
Liu Yuan wrote:
> 
> On Mon, Mar 16, 2015 at 02:36:57PM +0900, Hitoshi Mitake wrote:
> > At Mon, 16 Mar 2015 10:21:50 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Thu, Mar 12, 2015 at 08:14:33PM +0900, Hitoshi Mitake wrote:
> > > > At Thu, 12 Mar 2015 14:41:56 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Tue, Jan 13, 2015 at 10:37:40AM +0900, Hitoshi Mitake wrote:
> > > > > > Current sheepdog never recycles VIDs. But it will cause problems
> > > > > > e.g. VID space exhaustion, too much garbage inode objects.
> > > > > > 
> > > > > > Keeping deleted inode objects is required because living inodes
> > > > > > (snapshots or clones) can point objects of the deleted inodes. So if
> > > > > > every member of VDI family is deleted, it is safe to remove deleted
> > > > > > inode objects.
> > > > > > 
> > > > > > v2:
> > > > > >  - update test scripts
> > > > > 
> > > > > All the nodes of our test cluster panic out for the following problem:
> > > > > 
> > > > > Mar 12 00:05:03  DEBUG [main] zk_handle_notify(1216) NOTIFY
> > > > > Mar 12 00:05:03  DEBUG [main] sd_notify_handler(960) op NOTIFY_VDI_ADD, size: 96, from: IPv4 ip:192.168.39.177 port:7000
> > > > > Mar 12 00:05:03  DEBUG [main] do_add_vdi_state(362) 7c2b2b, 3, 0, 22, 0
> > > > > Mar 12 00:05:03  DEBUG [main] do_add_vdi_state(362) 7c2b2c, 3, 0, 22, 7c2b2b
> > > > > Mar 12 00:05:03  EMERG [main] update_vdi_family(127) PANIC: parent VID: 7c2b2b not found
> > > > > Mar 12 00:05:03  EMERG [main] crash_handler(286) sheep exits unexpectedly (Aborted), si pid 4786, uid 0, errno 0, code -6
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) sheep.c:288: crash_handler
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libpthread.so.0() [0x338200f4ff]
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libc.so.6(gsignal+0x34) [0x3381c328a4]
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(847) /lib64/libc.so.6(abort+0x174) [0x3381c34084]
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) vdi.c:127: update_vdi_family
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) vdi.c:398: add_vdi_state
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) ops.c:711: cluster_notify_vdi_add
> > > > > Mar 12 00:05:03  EMERG [main] sd_backtrace(833) group.c:975: sd_notify_handler
> > > > > 
> > > > > So I tracked back to this patch set. The problem of this patch set tried to
> > > > > solve is very clear and come along with sheepdog since its born. This reveals
> > > > > actually the defeciency of our vdi allocation algorithm, which we need rethink
> > > > > a completely new algorithm to replace it and is not fixable, unfortunately.
> > > > > 
> > > > > One simple rule, we can't recyle any vid if it is once created because of its
> > > > > current hash collision handling. Our current implementation forbigs recycling.
> > > > > 
> > > > > Instead of fixing the above panic bug, I'd suggest we revert this patch set.
> > > > > For the problem this patch set mentioned, I think we need a new algoirthm and
> > > > > implementation. But before that, we should stay with old one, it is stable and
> > > > > reliable and should work for small size cluster.
> > > > > 
> > > > > How do you think, Hitoshi and Kazutaka?
> > > > 
> > > > How about providing switch turn on/off VID recycling? e.g. dog cluster
> > > > format --enable-vid-recycle. The code can easily be pushed into
> > > > conditional branches. I can post a patch if this way is good for you.
> > > > 
> > > 
> > > This temporary workaroud looks okay but not good enough to me, what I am
> > > concerned is that vdi recycle will probably never be implemented if we stick to
> > > current vdi allocation algorithm. Once the new vdi allocation is intruduced
> > > someday in the future, the new algorithm would have no this kind of problem at
> > > all. If this is the case, the above code we leave here is also useless.
> > > 
> > > I think we should focus on the new vdi allocation algorithm, e.g, store
> > > {name, vid} directly into a kv engine either implemented by sheep or by with the
> > > help of other software like zookeeper.
> > > 
> > > I'm inclined to revert above patch set, for
> > > 1. it can't fix a non-fixable problem inherently
> > > 2. the code is probalematic and can cause a catastraphic disaster (all node die)
> > > 3. we might not need it in the future because it is specific for current vdi
> > >    allocation algorithm.
> > 
> > We can simply employ whole range lookup of bitmap as a VID allocation
> > algorithm for recycling policy. Of course it would be harmful for
> > snapshot and clone creation, but it can work correctly (and we have
> > optimization e.g. parallelizing, caching, etc). In addition, the
> > performance degradation can happen potentially even if we use the
> > existing VID allocation algorithm (e.g. hash collision, although
> > course it can happen rearely).
> 
> Do we really need vdi recycle if we bring very complex lines of code? Current
> algorithm can *reuse* of deleted vdi IDs and inodes. So the very problem is
> actually the space effeciency, so you try to reclaim the space occupied by
> deleted vdis.

Space is just a part of the problems. inode objects of deleted VDIs
can be harmful for recovery speed. It causes additional request
roundtrips between sheeps.

> 
> If it is very easy to reclaim deleted inodes, I'd say great and let's go ahead.
> But it apparently not. We have this patch set and then the lookup algorithm is
> heavily degrated.
> 
> I'm afraid, lookup the whole range is too costy, considering the deleted inodes
> space we reclaim. I think, most of users can bear the very little space overhead
> for better performance. So this patch set trade the code complexity and
> performance for the space efficiency. Note, we can reclaim inodes only in the
> case that we delete the whole snapshot chain and parent. This is actually a rare
> case.

Recycling VID is an actual requirement from the development team at
NTT DATA. Not recycling VID is harmful for long living clusters with
periodical frequent snapshots. And updating VID allocation policy
without cluster format is virtually impossible, so providing an option
must be helpful for users who own such clusters.

In addition, current sheepdog can support deletion of whole snapshot
chain and parent with dog vdi clone --no-share. The option cuts
relationship between a parent snapshot and newly created clone. So we
can easily reclaim original family of the snapshot.

Thanks,
Hitoshi