[sheepdog] A radical rethinking on struct vdi_state

Wed May 13 18:57:24 CEST 2015

Hi y'all,

Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
which might looks very radical to you, that we should remove struct vdi_state
completely from sheepdog code.

Let me show you the background picture how it was introduced. It was
introduced by Leven Li by the idea to provide per volume redundancy, which
amazed me at the time. To implement the per volume redundancy, the naturaul way
is to associate each vdi a runtime state. It was born as

struct vdi_copy {
	uint32_t vid;
	uint32_t nr_copies;
};

There is no centric place to store there runtime states, so every node generates
this vdi states on their own first at their startup and then exchange this state
with every other nodes. This kind of sync-up, from a afterthought, is the root
of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
for a single vdi state entry. It servers more than per volume redundancy, for e.g

a) it allow fast lookup whether one vdi is snapshot or not
b) allow lock/unlock/share for tgtd

Since it was born, we had been hassled by it, if you remember well. As far as I
remember, in some corner case, the vdi state is not synced as we expect and a
annoying bug shows up: vdi can't find it is copy number and has to set it as a
global copy number sometimes. This problem, unfortunately, is never resolved
because it is too hard to reproduce for developers. More importantly, I never
saw the real use of per volume redundancy. Every one I know just use the global
copy number for the whole vdis created for production.

Despite unstability, which might be sovled after a long time test & dev, the vdi
scalibiity is a real problem inherent in vdi state:

1. The bloated size will cause lethal network communication traffic for sync-up
   Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
   that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
   cluster will need 30GB to sync-up! At start up, a sync-up storm might also
   cause problem since every node need to sync up with each other. This means we
   might only suppose vdi number less than 10,000 at most. This is really big
   scalability problem that keeps users away from sheepdog.

2. Current vdi states sync-up is more complex than you think. It serves too many
   features, tgtd locks, information lookup, inode coherence protocol, family
   trees and so on. The complexity of sync-up algorithm is worsen by the
   distributed nature of sheepdog. It is very easy to panic sheep if state is
   not synced as expected. Just see how many panic() is called on these algorithms!
   We actually already saw many panics of these algorhtm really happens.

So, if we remove vdi state, what about those benefits we get from vdi states
mentioned above?

1 for information cache, we don't really need it. Yes, it faster the lookup
process. But there lookup process isn't in the critical path, so we can live
without it and just read the inode and operate on it. It is way slower for some
operation but not so hurt.

2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
implemented, I think add a shared state is not so hard. With zookeeper, we don't
need to sync up any lock state at all between sheep nodes. Zookeeper store them
for use as a centric system.

To conclude, vdi state eats stability and scalability. It is not something we
can't live without it.

any ideas? It's okay for me to tail sheepdog to remove it as a in-house patch
by the way. I'm not writing this mail to bash it, instead I hope you guys can
rethink of it in the long run.

[1] How frustrating? We just set up a less than 20 storage nodes cluster and
    create as many vdis as possible and with several TB data in it. We did
    fio tests, vdi deletion, vdi snapshot & clone. But just these normal
    operations, the cluster sometimes crashes twice a day. Many panics,
    unexplainable behaviours and logs, failures to recvover objects, etc. Even
    restart one node sometimes crash another node.

Thanks,
Yuan