[sheepdog] A radical rethinking on struct vdi_state

Mon May 18 08:07:50 CEST 2015

At Mon, 18 May 2015 13:40:20 +0800,
Liu Yuan wrote:
> 
> On Mon, May 18, 2015 at 01:42:59PM +0900, Hitoshi Mitake wrote:
> > At Mon, 18 May 2015 09:46:25 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> > > > At Thu, 14 May 2015 00:57:24 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > Hi y'all,
> > > > > 
> > > > > Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> > > > > which might looks very radical to you, that we should remove struct vdi_state
> > > > > completely from sheepdog code.
> > > > > 
> > > > > Let me show you the background picture how it was introduced. It was
> > > > > introduced by Leven Li by the idea to provide per volume redundancy, which
> > > > > amazed me at the time. To implement the per volume redundancy, the naturaul way
> > > > > is to associate each vdi a runtime state. It was born as
> > > > > 
> > > > > struct vdi_copy {
> > > > > 	uint32_t vid;
> > > > > 	uint32_t nr_copies;
> > > > > };
> > > > > 
> > > > > There is no centric place to store there runtime states, so every node generates
> > > > > this vdi states on their own first at their startup and then exchange this state
> > > > > with every other nodes. This kind of sync-up, from a afterthought, is the root
> > > > > of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> > > > > for a single vdi state entry. It servers more than per volume redundancy, for e.g
> > > > > 
> > > > > a) it allow fast lookup whether one vdi is snapshot or not
> > > > > b) allow lock/unlock/share for tgtd
> > > > > 
> > > > > Since it was born, we had been hassled by it, if you remember well. As far as I
> > > > > remember, in some corner case, the vdi state is not synced as we expect and a
> > > > > annoying bug shows up: vdi can't find it is copy number and has to set it as a
> > > > > global copy number sometimes. This problem, unfortunately, is never resolved
> > > > > because it is too hard to reproduce for developers. More importantly, I never
> > > > > saw the real use of per volume redundancy. Every one I know just use the global
> > > > > copy number for the whole vdis created for production.
> > > > > 
> > > > > Despite unstability, which might be sovled after a long time test & dev, the vdi
> > > > > scalibiity is a real problem inherent in vdi state:
> > > > > 
> > > > > 1. The bloated size will cause lethal network communication traffic for sync-up
> > > > >    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
> > > > >    that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
> > > > >    cluster will need 30GB to sync-up! At start up, a sync-up storm might also
> > > > >    cause problem since every node need to sync up with each other. This means we
> > > > >    might only suppose vdi number less than 10,000 at most. This is really big
> > > > >    scalability problem that keeps users away from sheepdog.
> > > > 
> > > > Current implementation of syncing vdi bitmap and vdi state is cleally
> > > > inefficient. The amount of data can be reduced when a cluster is
> > > > already in SD_STATUS_OK (simply copying it from existing node from
> > > > newly joining one is enough). I'm already working on it.
> > > > 
> > > > > 
> > > > > 2. Current vdi states sync-up is more complex than you think. It serves too many
> > > > >    features, tgtd locks, information lookup, inode coherence protocol, family
> > > > >    trees and so on. The complexity of sync-up algorithm is worsen by the
> > > > >    distributed nature of sheepdog. It is very easy to panic sheep if state is
> > > > >    not synced as expected. Just see how many panic() is called on these algorithms!
> > > > >    We actually already saw many panics of these algorhtm really happens.
> > > > > 
> > > > > So, if we remove vdi state, what about those benefits we get from vdi states
> > > > > mentioned above?
> > > > > 
> > > > > 1 for information cache, we don't really need it. Yes, it faster the lookup
> > > > > process. But there lookup process isn't in the critical path, so we can live
> > > > > without it and just read the inode and operate on it. It is way slower for some
> > > > > operation but not so hurt.
> > > > 
> > > > No. The vdi state is used in hotpath. oid_is_readonly() is called in a
> > > > path of handling COW requests.
> > > 
> > > I noticed it and I think it is can be improved by a private state, that is no
> > > need to sync up.
> > > 
> > > oid_is_readonly() is used to implement online snapshot. When the client(VM)
> > > issued a 'snapshot' to connected sheep, sheep change the working vid as the
> > > parent of the new vdi and then make the parent vdi as the readonly for client
> > > to refresh the inode. For this logic, we can only store this kind of vdi state
> > > in the private state of connected vdi.
> > 
> > Do you mean qemu daemons need to process request of snapshot?
> 
> No modification to QEMU daemon, it will refresh inode if it receives
> SD_RES_READONLY. With private vdi state on the connected sheep, we also return
> SD_RES_READONLY.
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> > > > > we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
> > > > > implemented, I think add a shared state is not so hard. With zookeeper, we don't
> > > > > need to sync up any lock state at all between sheep nodes. Zookeeper store them
> > > > > for use as a centric system.
> > > > 
> > > > Using zookeeper as a metadata server conflicts with the principle of
> > > > sheepdog. Every node should have a same role. Zookeeper based locking
> > > > must be avoided in a case of the block storage feature.
> > > 
> > > It might look necessary attractive in the perspective of design principle, but
> > > for practice use, zookeeper-like software is the only way to achive scalability.
> > > Corosync-like software can only support 10 nodes at most because huge node
> > > sync-ups traffic.
> > 
> > In our past benchmarking, corosync can support 100 of nodes.
> 
> I guess you use 10Gb network or even better. We use 1Gb network and found 10 is
> the top number.
> 
> By the way, did you test stability with 100 nodes? Stress the cluster and run
> test of days with thousands of VDIs, not just for benchmark. We had run Corosync
> in production for several weeks (1.4.x version) with ten nodes. It was broken
> several times, less stable than zookeeper, which we had run for half
> the year.

AFAIK, just benchmarking. But corosync 1.x has serious bug in
messaging. 2.x is stable as far as we know.

> 
> > 
> > > If we scale out with zookeeper, why not rely other features on
> > > it? It removes the necessity of syncs between nodes and indeed looks attractive
> > > than 'fully symmetric' in terms of scalability and stability.
> > > 
> > > For a distributed system win over other distributed ones, the tuple
> > > {scalability, stability, performace} is the top factor. 'vdi state' has
> > > scalability and stability problems and would stop people from making use of
> > > sheepdog, I'm afraid.
> > 
> > Symmetry contributes to simplicity of management. If sheepdog uses
> > zookeeper as a metadata server, it will require more complicated
> > capacity planning and changing a way of operation.
> 
> It is always two folds. In perspective of admins, set up a zk cluster is not
> always harder than setting up corosync on every node.

We (NTT) had a discussion about your opinion and concluded the change
cannot coexist with the principle of sheepdog. If you want to use
zookeeper as a metadata server, please fork the project.

Thanks,
Hitoshi