[sheepdog] A radical rethinking on struct vdi_state

Mon May 18 02:52:03 CEST 2015

At Thu, 14 May 2015 00:57:24 +0800,
Liu Yuan wrote:
> 
> Hi y'all,
> 
> Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> which might looks very radical to you, that we should remove struct vdi_state
> completely from sheepdog code.
> 
> Let me show you the background picture how it was introduced. It was
> introduced by Leven Li by the idea to provide per volume redundancy, which
> amazed me at the time. To implement the per volume redundancy, the naturaul way
> is to associate each vdi a runtime state. It was born as
> 
> struct vdi_copy {
> 	uint32_t vid;
> 	uint32_t nr_copies;
> };
> 
> There is no centric place to store there runtime states, so every node generates
> this vdi states on their own first at their startup and then exchange this state
> with every other nodes. This kind of sync-up, from a afterthought, is the root
> of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> for a single vdi state entry. It servers more than per volume redundancy, for e.g
> 
> a) it allow fast lookup whether one vdi is snapshot or not
> b) allow lock/unlock/share for tgtd
> 
> Since it was born, we had been hassled by it, if you remember well. As far as I
> remember, in some corner case, the vdi state is not synced as we expect and a
> annoying bug shows up: vdi can't find it is copy number and has to set it as a
> global copy number sometimes. This problem, unfortunately, is never resolved
> because it is too hard to reproduce for developers. More importantly, I never
> saw the real use of per volume redundancy. Every one I know just use the global
> copy number for the whole vdis created for production.
> 
> Despite unstability, which might be sovled after a long time test & dev, the vdi
> scalibiity is a real problem inherent in vdi state:
> 
> 1. The bloated size will cause lethal network communication traffic for sync-up
>    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
>    that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
>    cluster will need 30GB to sync-up! At start up, a sync-up storm might also
>    cause problem since every node need to sync up with each other. This means we
>    might only suppose vdi number less than 10,000 at most. This is really big
>    scalability problem that keeps users away from sheepdog.

Current implementation of syncing vdi bitmap and vdi state is cleally
inefficient. The amount of data can be reduced when a cluster is
already in SD_STATUS_OK (simply copying it from existing node from
newly joining one is enough). I'm already working on it.

> 
> 2. Current vdi states sync-up is more complex than you think. It serves too many
>    features, tgtd locks, information lookup, inode coherence protocol, family
>    trees and so on. The complexity of sync-up algorithm is worsen by the
>    distributed nature of sheepdog. It is very easy to panic sheep if state is
>    not synced as expected. Just see how many panic() is called on these algorithms!
>    We actually already saw many panics of these algorhtm really happens.
> 
> So, if we remove vdi state, what about those benefits we get from vdi states
> mentioned above?
> 
> 1 for information cache, we don't really need it. Yes, it faster the lookup
> process. But there lookup process isn't in the critical path, so we can live
> without it and just read the inode and operate on it. It is way slower for some
> operation but not so hurt.

No. The vdi state is used in hotpath. oid_is_readonly() is called in a
path of handling COW requests.

> 
> 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
> implemented, I think add a shared state is not so hard. With zookeeper, we don't
> need to sync up any lock state at all between sheep nodes. Zookeeper store them
> for use as a centric system.

Using zookeeper as a metadata server conflicts with the principle of
sheepdog. Every node should have a same role. Zookeeper based locking
must be avoided in a case of the block storage feature.

Thanks,
Hitoshi

> 
> To conclude, vdi state eats stability and scalability. It is not something we
> can't live without it.
> 
> any ideas? It's okay for me to tail sheepdog to remove it as a in-house patch
> by the way. I'm not writing this mail to bash it, instead I hope you guys can
> rethink of it in the long run.
> 
> [1] How frustrating? We just set up a less than 20 storage nodes cluster and
>     create as many vdis as possible and with several TB data in it. We did
>     fio tests, vdi deletion, vdi snapshot & clone. But just these normal
>     operations, the cluster sometimes crashes twice a day. Many panics,
>     unexplainable behaviours and logs, failures to recvover objects, etc. Even
>     restart one node sometimes crash another node.
> 
> Thanks,
> Yuan