[sheepdog] A radical rethinking on struct vdi_state

Mon May 18 06:42:59 CEST 2015

At Mon, 18 May 2015 09:46:25 +0800,
Liu Yuan wrote:
> 
> On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> > At Thu, 14 May 2015 00:57:24 +0800,
> > Liu Yuan wrote:
> > > 
> > > Hi y'all,
> > > 
> > > Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> > > which might looks very radical to you, that we should remove struct vdi_state
> > > completely from sheepdog code.
> > > 
> > > Let me show you the background picture how it was introduced. It was
> > > introduced by Leven Li by the idea to provide per volume redundancy, which
> > > amazed me at the time. To implement the per volume redundancy, the naturaul way
> > > is to associate each vdi a runtime state. It was born as
> > > 
> > > struct vdi_copy {
> > > 	uint32_t vid;
> > > 	uint32_t nr_copies;
> > > };
> > > 
> > > There is no centric place to store there runtime states, so every node generates
> > > this vdi states on their own first at their startup and then exchange this state
> > > with every other nodes. This kind of sync-up, from a afterthought, is the root
> > > of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> > > for a single vdi state entry. It servers more than per volume redundancy, for e.g
> > > 
> > > a) it allow fast lookup whether one vdi is snapshot or not
> > > b) allow lock/unlock/share for tgtd
> > > 
> > > Since it was born, we had been hassled by it, if you remember well. As far as I
> > > remember, in some corner case, the vdi state is not synced as we expect and a
> > > annoying bug shows up: vdi can't find it is copy number and has to set it as a
> > > global copy number sometimes. This problem, unfortunately, is never resolved
> > > because it is too hard to reproduce for developers. More importantly, I never
> > > saw the real use of per volume redundancy. Every one I know just use the global
> > > copy number for the whole vdis created for production.
> > > 
> > > Despite unstability, which might be sovled after a long time test & dev, the vdi
> > > scalibiity is a real problem inherent in vdi state:
> > > 
> > > 1. The bloated size will cause lethal network communication traffic for sync-up
> > >    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
> > >    that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
> > >    cluster will need 30GB to sync-up! At start up, a sync-up storm might also
> > >    cause problem since every node need to sync up with each other. This means we
> > >    might only suppose vdi number less than 10,000 at most. This is really big
> > >    scalability problem that keeps users away from sheepdog.
> > 
> > Current implementation of syncing vdi bitmap and vdi state is cleally
> > inefficient. The amount of data can be reduced when a cluster is
> > already in SD_STATUS_OK (simply copying it from existing node from
> > newly joining one is enough). I'm already working on it.
> > 
> > > 
> > > 2. Current vdi states sync-up is more complex than you think. It serves too many
> > >    features, tgtd locks, information lookup, inode coherence protocol, family
> > >    trees and so on. The complexity of sync-up algorithm is worsen by the
> > >    distributed nature of sheepdog. It is very easy to panic sheep if state is
> > >    not synced as expected. Just see how many panic() is called on these algorithms!
> > >    We actually already saw many panics of these algorhtm really happens.
> > > 
> > > So, if we remove vdi state, what about those benefits we get from vdi states
> > > mentioned above?
> > > 
> > > 1 for information cache, we don't really need it. Yes, it faster the lookup
> > > process. But there lookup process isn't in the critical path, so we can live
> > > without it and just read the inode and operate on it. It is way slower for some
> > > operation but not so hurt.
> > 
> > No. The vdi state is used in hotpath. oid_is_readonly() is called in a
> > path of handling COW requests.
> 
> I noticed it and I think it is can be improved by a private state, that is no
> need to sync up.
> 
> oid_is_readonly() is used to implement online snapshot. When the client(VM)
> issued a 'snapshot' to connected sheep, sheep change the working vid as the
> parent of the new vdi and then make the parent vdi as the readonly for client
> to refresh the inode. For this logic, we can only store this kind of vdi state
> in the private state of connected vdi.

Do you mean qemu daemons need to process request of snapshot?

> 
> > 
> > > 
> > > 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> > > we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
> > > implemented, I think add a shared state is not so hard. With zookeeper, we don't
> > > need to sync up any lock state at all between sheep nodes. Zookeeper store them
> > > for use as a centric system.
> > 
> > Using zookeeper as a metadata server conflicts with the principle of
> > sheepdog. Every node should have a same role. Zookeeper based locking
> > must be avoided in a case of the block storage feature.
> 
> It might look necessary attractive in the perspective of design principle, but
> for practice use, zookeeper-like software is the only way to achive scalability.
> Corosync-like software can only support 10 nodes at most because huge node
> sync-ups traffic.

In our past benchmarking, corosync can support 100 of nodes.

> If we scale out with zookeeper, why not rely other features on
> it? It removes the necessity of syncs between nodes and indeed looks attractive
> than 'fully symmetric' in terms of scalability and stability.
> 
> For a distributed system win over other distributed ones, the tuple
> {scalability, stability, performace} is the top factor. 'vdi state' has
> scalability and stability problems and would stop people from making use of
> sheepdog, I'm afraid.

Symmetry contributes to simplicity of management. If sheepdog uses
zookeeper as a metadata server, it will require more complicated
capacity planning and changing a way of operation.

Thanks,
Hitoshi