At Wed, 25 Jul 2012 20:20:31 +0000, Dietmar Maurer wrote: > > > For example: > > > > 1. There are two node, A and B. > > 2. Node C joins Sheepdog, and journal data is written on node C until > > it finishes recovery. > > 3. If node D joins Sheepdog before Node C finishes recovery, the node > > reads actual data from node A and B, and journal data from node C. > > At the same time, node C also needs to write journal data in local > > to handle write requests. > > 4. If node E joins Sheepdog before node C and D finish recovery, node > > E needs to read journal data from node C and D. Node E needs to > > know which journal is newer to apply journal in the correct order. > > The real problem is that sheepdog change node mapping as soon > as a new node joins. Yes, so I suggested delaying recovering objects which are not accessed by VMs to avoid redundant object move. I thought that it looks much simpler than changing recovery algorithm. Are there any problems with it? > For me, it seems safer to keep the current > mapping until all new nodes are in sync. > > One can implement that by tracking the node status together with epoch. > A node can be DOWN, UP (but not synced), and UP_SYNCED. > > During writes, we consider 2 mappings. One only using UP_SYNCED nodes, the second > consider UP and UP_SYNCED nodes. We write to all those nodes. For > reads we only consider nodes in status UP. > > That would avoid above error case? Maybe it would work, but looks complicated to me. Doesn't it need many changes to the current codes? Thanks, Kazutaka |