[sheepdog] A radical rethinking on struct vdi_state

Mon May 18 07:50:34 CEST 2015

On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> At Thu, 14 May 2015 00:57:24 +0800,
> Liu Yuan wrote:
> > 
> > Hi y'all,
> > 
> > Based on recent frustrating[1] debug on our test cluster, I'd like propsose,
> > which might looks very radical to you, that we should remove struct vdi_state
> > completely from sheepdog code.
> > 
> > Let me show you the background picture how it was introduced. It was
> > introduced by Leven Li by the idea to provide per volume redundancy, which
> > amazed me at the time. To implement the per volume redundancy, the naturaul way
> > is to associate each vdi a runtime state. It was born as
> > 
> > struct vdi_copy {
> > 	uint32_t vid;
> > 	uint32_t nr_copies;
> > };
> > 
> > There is no centric place to store there runtime states, so every node generates
> > this vdi states on their own first at their startup and then exchange this state
> > with every other nodes. This kind of sync-up, from a afterthought, is the root
> > of evil. The vdi_copy evoles into vdi_state and up to now, it has more than 1k
> > for a single vdi state entry. It servers more than per volume redundancy, for e.g
> > 
> > a) it allow fast lookup whether one vdi is snapshot or not
> > b) allow lock/unlock/share for tgtd
> > 
> > Since it was born, we had been hassled by it, if you remember well. As far as I
> > remember, in some corner case, the vdi state is not synced as we expect and a
> > annoying bug shows up: vdi can't find it is copy number and has to set it as a
> > global copy number sometimes. This problem, unfortunately, is never resolved
> > because it is too hard to reproduce for developers. More importantly, I never
> > saw the real use of per volume redundancy. Every one I know just use the global
> > copy number for the whole vdis created for production.
> > 
> > Despite unstability, which might be sovled after a long time test & dev, the vdi
> > scalibiity is a real problem inherent in vdi state:
> > 
> > 1. The bloated size will cause lethal network communication traffic for sync-up
> >    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, meaning
> >    that two nodes sync-up will need 1GB data to transfer. A node join a 30 nodes
> >    cluster will need 30GB to sync-up! At start up, a sync-up storm might also
> >    cause problem since every node need to sync up with each other. This means we
> >    might only suppose vdi number less than 10,000 at most. This is really big
> >    scalability problem that keeps users away from sheepdog.
> 
> Current implementation of syncing vdi bitmap and vdi state is cleally
> inefficient. The amount of data can be reduced when a cluster is
> already in SD_STATUS_OK (simply copying it from existing node from
> newly joining one is enough). I'm already working on it.

This is just one case. How about cluster is not in status okay? For a cluster
large enough, the cluster would be in recovery all day long, nodes are leaving
and joining. This might be a problem for a huge cluster that we might don't
care for now. But sync storm at startup would easily cause problem, every node
has partial state, so every node need sync with every other node, how do you
handle this case?

Actually, I'm more concerned in stability while code is being bloating. The
distributed algorithm is notorious to implement *right*. Take agreement in
nodes as example, it might need Paxos protocol, which is very complex and
hard to write correct without bugs. If we write so many algorithms (lock, inode
cohrence, family tree, etc.) in a truely distributed manner, I'm afraid it would
take years to mature or even, the worst case, we'll never see that day come by.

Thanks,
Yuan