At Fri, 31 May 2013 21:50:56 +0800, Liu Yuan wrote: > > On 05/31/2013 08:55 PM, MORITA Kazutaka wrote: > > To reduce the risk of data loss, we shouldn't remove stale objects if > > there are some sheeps who failed to recover objects. > > So once it it was set true, we'll never get a chance to purge stale > objects? This looks kind of unacceptable to me. > > I think we should only stop notification for this very recovery only. > And next recovery should work as normal. If once we failed in object recovery, we cannot assure that all the objects in the working directory are not stale even if we succeed in the next recovery. For example, - epoch 1, node [A, B] Node A has object o. - epoch 2, node [A, B, C] Object o is moved to Node C, and o is updated to o'. - epoch 3, node [A, B, C, D] Node D tries to recover the object o' from C but fails. Then node C reads the object o (stale) from the node A. (node D is safe mode) - epoch 4, node [A, B, C, D, E] Node E reads the object o (stale) from node D. After all the nodes finish object recovery, node C removes the latest object o' in the stale directory if we allow node D to notify recovery completion. I think there is no way to recover automaticaly from the safe_mode state. The risk of data loss is not acceptable to me. As long as the risk exists, we must not remove the stale direcotry. In this case, the user has to look into why it happens and restart the sheep daemon after the problem is fixed. Thanks, Kazutaka |