[sheepdog] A radical rethinking on struct vdi_state
Liu Yuan
namei.unix at gmail.com
Mon May 18 07:40:20 CEST 2015
On Mon, May 18, 2015 at 01:42:59PM +0900, Hitoshi Mitake wrote:
> At Mon, 18 May 2015 09:46:25 +0800,
> Liu Yuan wrote:
> >
> > On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> > > At Thu, 14 May 2015 00:57:24 +0800,
> > > Liu Yuan wrote:
> > > >
> > > > Hi y'all,
> > > >
> > > > Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> > > > which might looks very radical to you, that we should remove struct vdi_state
> > > > completely from sheepdog code.
> > > >
> > > > Let me show you the background picture how it was introduced. It was
> > > > introduced by Leven Li by the idea to provide per volume redundancy, which
> > > > amazed me at the time. To implement the per volume redundancy, the naturaul way
> > > > is to associate each vdi a runtime state. It was born as
> > > >
> > > > struct vdi_copy {
> > > > uint32_t vid;
> > > > uint32_t nr_copies;
> > > > };
> > > >
> > > > There is no centric place to store there runtime states, so every node generates
> > > > this vdi states on their own first at their startup and then exchange this state
> > > > with every other nodes. This kind of sync-up, from a afterthought, is the root
> > > > of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> > > > for a single vdi state entry. It servers more than per volume redundancy, for e.g
> > > >
> > > > a) it allow fast lookup whether one vdi is snapshot or not
> > > > b) allow lock/unlock/share for tgtd
> > > >
> > > > Since it was born, we had been hassled by it, if you remember well. As far as I
> > > > remember, in some corner case, the vdi state is not synced as we expect and a
> > > > annoying bug shows up: vdi can't find it is copy number and has to set it as a
> > > > global copy number sometimes. This problem, unfortunately, is never resolved
> > > > because it is too hard to reproduce for developers. More importantly, I never
> > > > saw the real use of per volume redundancy. Every one I know just use the global
> > > > copy number for the whole vdis created for production.
> > > >
> > > > Despite unstability, which might be sovled after a long time test & dev, the vdi
> > > > scalibiity is a real problem inherent in vdi state:
> > > >
> > > > 1. The bloated size will cause lethal network communication traffic for sync-up
> > > > Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
> > > > that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
> > > > cluster will need 30GB to sync-up! At start up, a sync-up storm might also
> > > > cause problem since every node need to sync up with each other. This means we
> > > > might only suppose vdi number less than 10,000 at most. This is really big
> > > > scalability problem that keeps users away from sheepdog.
> > >
> > > Current implementation of syncing vdi bitmap and vdi state is cleally
> > > inefficient. The amount of data can be reduced when a cluster is
> > > already in SD_STATUS_OK (simply copying it from existing node from
> > > newly joining one is enough). I'm already working on it.
> > >
> > > >
> > > > 2. Current vdi states sync-up is more complex than you think. It serves too many
> > > > features, tgtd locks, information lookup, inode coherence protocol, family
> > > > trees and so on. The complexity of sync-up algorithm is worsen by the
> > > > distributed nature of sheepdog. It is very easy to panic sheep if state is
> > > > not synced as expected. Just see how many panic() is called on these algorithms!
> > > > We actually already saw many panics of these algorhtm really happens.
> > > >
> > > > So, if we remove vdi state, what about those benefits we get from vdi states
> > > > mentioned above?
> > > >
> > > > 1 for information cache, we don't really need it. Yes, it faster the lookup
> > > > process. But there lookup process isn't in the critical path, so we can live
> > > > without it and just read the inode and operate on it. It is way slower for some
> > > > operation but not so hurt.
> > >
> > > No. The vdi state is used in hotpath. oid_is_readonly() is called in a
> > > path of handling COW requests.
> >
> > I noticed it and I think it is can be improved by a private state, that is no
> > need to sync up.
> >
> > oid_is_readonly() is used to implement online snapshot. When the client(VM)
> > issued a 'snapshot' to connected sheep, sheep change the working vid as the
> > parent of the new vdi and then make the parent vdi as the readonly for client
> > to refresh the inode. For this logic, we can only store this kind of vdi state
> > in the private state of connected vdi.
>
> Do you mean qemu daemons need to process request of snapshot?
No modification to QEMU daemon, it will refresh inode if it receives
SD_RES_READONLY. With private vdi state on the connected sheep, we also return
SD_RES_READONLY.
>
> >
> > >
> > > >
> > > > 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> > > > we make use of zookeeper to implement it. For now, we already have cluster->lock/unlock
> > > > implemented, I think add a shared state is not so hard. With zookeeper, we don't
> > > > need to sync up any lock state at all between sheep nodes. Zookeeper store them
> > > > for use as a centric system.
> > >
> > > Using zookeeper as a metadata server conflicts with the principle of
> > > sheepdog. Every node should have a same role. Zookeeper based locking
> > > must be avoided in a case of the block storage feature.
> >
> > It might look necessary attractive in the perspective of design principle, but
> > for practice use, zookeeper-like software is the only way to achive scalability.
> > Corosync-like software can only support 10 nodes at most because huge node
> > sync-ups traffic.
>
> In our past benchmarking, corosync can support 100 of nodes.
I guess you use 10Gb network or even better. We use 1Gb network and found 10 is
the top number.
By the way, did you test stability with 100 nodes? Stress the cluster and run
test of days with thousands of VDIs, not just for benchmark. We had run Corosync
in production for several weeks (1.4.x version) with ten nodes. It was broken
several times, less stable than zookeeper, which we had run for half the year.
>
> > If we scale out with zookeeper, why not rely other features on
> > it? It removes the necessity of syncs between nodes and indeed looks attractive
> > than 'fully symmetric' in terms of scalability and stability.
> >
> > For a distributed system win over other distributed ones, the tuple
> > {scalability, stability, performace} is the top factor. 'vdi state' has
> > scalability and stability problems and would stop people from making use of
> > sheepdog, I'm afraid.
>
> Symmetry contributes to simplicity of management. If sheepdog uses
> zookeeper as a metadata server, it will require more complicated
> capacity planning and changing a way of operation.
It is always two folds. In perspective of admins, set up a zk cluster is not
always harder than setting up corosync on every node.
Thanks,
Yuan
More information about the sheepdog
mailing list