[sheepdog] A radical rethinking on struct vdi_state

Mon May 18 03:46:25 CEST 2015

On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> At Thu, 14 May 2015 00:57:24 +0800,
> Liu Yuan wrote:
> > 
> > Hi y'all,
> > 
> > Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> > which might looks very radical to you, that we should remove struct vdi_state
> > completely from sheepdog code.
> > 
> > Let me show you the background picture how it was introduced. It was
> > introduced by Leven Li by the idea to provide per volume redundancy, which
> > amazed me at the time. To implement the per volume redundancy, the naturaul way
> > is to associate each vdi a runtime state. It was born as
> > 
> > struct vdi_copy {
> > 	uint32_t vid;
> > 	uint32_t nr_copies;
> > };
> > 
> > There is no centric place to store there runtime states, so every node generates
> > this vdi states on their own first at their startup and then exchange this state
> > with every other nodes. This kind of sync-up, from a afterthought, is the root
> > of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> > for a single vdi state entry. It servers more than per volume redundancy, for e.g
> > 
> > a) it allow fast lookup whether one vdi is snapshot or not
> > b) allow lock/unlock/share for tgtd
> > 
> > Since it was born, we had been hassled by it, if you remember well. As far as I
> > remember, in some corner case, the vdi state is not synced as we expect and a
> > annoying bug shows up: vdi can't find it is copy number and has to set it as a
> > global copy number sometimes. This problem, unfortunately, is never resolved
> > because it is too hard to reproduce for developers. More importantly, I never
> > saw the real use of per volume redundancy. Every one I know just use the global
> > copy number for the whole vdis created for production.
> > 
> > Despite unstability, which might be sovled after a long time test & dev, the vdi
> > scalibiity is a real problem inherent in vdi state:
> > 
> > 1. The bloated size will cause lethal network communication traffic for sync-up
> >    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
> >    that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
> >    cluster will need 30GB to sync-up! At start up, a sync-up storm might also
> >    cause problem since every node need to sync up with each other. This means we
> >    might only suppose vdi number less than 10,000 at most. This is really big
> >    scalability problem that keeps users away from sheepdog.
> 
> Current implementation of syncing vdi bitmap and vdi state is cleally
> inefficient. The amount of data can be reduced when a cluster is
> already in SD_STATUS_OK (simply copying it from existing node from
> newly joining one is enough). I'm already working on it.
> 
> > 
> > 2. Current vdi states sync-up is more complex than you think. It serves too many
> >    features, tgtd locks, information lookup, inode coherence protocol, family
> >    trees and so on. The complexity of sync-up algorithm is worsen by the
> >    distributed nature of sheepdog. It is very easy to panic sheep if state is
> >    not synced as expected. Just see how many panic() is called on these algorithms!
> >    We actually already saw many panics of these algorhtm really happens.
> > 
> > So, if we remove vdi state, what about those benefits we get from vdi states
> > mentioned above?
> > 
> > 1 for information cache, we don't really need it. Yes, it faster the lookup
> > process. But there lookup process isn't in the critical path, so we can live
> > without it and just read the inode and operate on it. It is way slower for some
> > operation but not so hurt.
> 
> No. The vdi state is used in hotpath. oid_is_readonly() is called in a
> path of handling COW requests.

I noticed it and I think it is can be improved by a private state, that is no
need to sync up.

oid_is_readonly() is used to implement online snapshot. When the client(VM)
issued a 'snapshot' to connected sheep, sheep change the working vid as the
parent of the new vdi and then make the parent vdi as the readonly for client
to refresh the inode. For this logic, we can only store this kind of vdi state
in the private state of connected vdi.

> 
> > 
> > 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> > we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
> > implemented, I think add a shared state is not so hard. With zookeeper, we don't
> > need to sync up any lock state at all between sheep nodes. Zookeeper store them
> > for use as a centric system.
> 
> Using zookeeper as a metadata server conflicts with the principle of
> sheepdog. Every node should have a same role. Zookeeper based locking
> must be avoided in a case of the block storage feature.

It might look necessary attractive in the perspective of design principle, but
for practice use, zookeeper-like software is the only way to achive scalability.
Corosync-like software can only support 10 nodes at most because huge node
sync-ups traffic. If we scale out with zookeeper, why not rely other features on
it? It removes the necessity of syncs between nodes and indeed looks attractive
than 'fully symmetric' in terms of scalability and stability.

For a distributed system win over other distributed ones, the tuple
{scalability, stability, performace} is the top factor. 'vdi state' has
scalability and stability problems and would stop people from making use of
sheepdog, I'm afraid.

Thanks,
Yuan