> > > For example: > > > > > > 1. There are two node, A and B. > > > 2. Node C joins Sheepdog, and journal data is written on node C until > > > it finishes recovery. > > > 3. If node D joins Sheepdog before Node C finishes recovery, the node > > > reads actual data from node A and B, and journal data from node C. > > > At the same time, node C also needs to write journal data in local > > > to handle write requests. > > > 4. If node E joins Sheepdog before node C and D finish recovery, node > > > E needs to read journal data from node C and D. Node E needs to > > > know which journal is newer to apply journal in the correct order. > > > > The real problem is that sheepdog change node mapping as soon as a new > > node joins. > > Yes, so I suggested delaying recovering objects which are not accessed by > VMs to avoid redundant object move. I thought that it looks much simpler > than changing recovery algorithm. Are there any problems with it? Not really. I just seems more natural to me, because you can then manually trigger re-balance. You simply have better control. > > For me, it seems safer to keep the current mapping until all new nodes > > are in sync. > > > > One can implement that by tracking the node status together with epoch. > > A node can be DOWN, UP (but not synced), and UP_SYNCED. > > > > During writes, we consider 2 mappings. One only using UP_SYNCED nodes, > > the second consider UP and UP_SYNCED nodes. We write to all those > > nodes. For reads we only consider nodes in status UP. > > > > That would avoid above error case? > > Maybe it would work, but looks complicated to me. Doesn't it need many > changes to the current codes? I am quite new to the project, so I can't really tell. - Dietmar |