[sheepdog] [PATCH] recovery: notify completion only when all objects are fresh

Sat Jun 1 21:09:17 CEST 2013

At Sat, 01 Jun 2013 02:19:05 +0800,
Liu Yuan wrote:
> 
> On 05/31/2013 10:10 PM, MORITA Kazutaka wrote:
> > At Fri, 31 May 2013 21:50:56 +0800,
> > Liu Yuan wrote:
> >>
> >> On 05/31/2013 08:55 PM, MORITA Kazutaka wrote:
> >>> To reduce the risk of data loss, we shouldn't remove stale objects if
> >>> there are some sheeps who failed to recover objects.
> >>
> >> So once it it was set true, we'll never get a chance to purge stale
> >> objects? This looks kind of unacceptable to me.
> >>
> >> I think we should only stop notification for this very recovery only.
> >> And next recovery should work as normal.
> > 
> > If once we failed in object recovery, we cannot assure that all the
> > objects in the working directory are not stale even if we succeed in
> > the next recovery.  For example,
> > 
> > - epoch 1, node [A, B]
> >   Node A has object o.
> > 
> > - epoch 2, node [A, B, C]
> >   Object o is moved to Node C, and o is updated to o'.
> > 
> > - epoch 3, node [A, B, C, D]
> >   Node D tries to recover the object o' from C but fails.  Then node C
> >   reads the object o (stale) from the node A.  (node D is safe mode)
> > 
> 
> Why failed recover o' from C? So in a multiple node events, stop_notify

E.g. network timeout.  But in such case, epoch would be likely to be
incremented, so this example was not good, sorry.  I add a test script
for a better example at the end of this mail.

> won't be easily to be false positive?

I don't think we can ignore the alarm even if it's a false positive in
most cases, and that's why I added printf messages with the SDOG_ALERT
priority (action must be taken immediately) in the commit 235a21bd.
If stop_notify can be easily set to true on the user's environment, we
must really reconsider the current recovery algorithm.

> 
> > - epoch 4, node [A, B, C, D, E]
> >   Node E reads the object o (stale) from node D.  After all the nodes
> >   finish object recovery, node C removes the latest object o' in the
> >   stale directory if we allow node D to notify recovery completion.
> > 
> > I think there is no way to recover automaticaly from the safe_mode
> > state.  The risk of data loss is not acceptable to me.  As long as the
> > risk exists, we must not remove the stale direcotry.  In this case,
> > the user has to look into why it happens and restart the sheep daemon
> > after the problem is fixed.
> > 
> 
> Even if you don't remove stale objects, users are not easy to recover
> the *right* objects. How can users tell which is the right noe? This

For the manual recovery, we have to read sheep.log carefully and
determine which object is the correct one in the stale objects.
Actually, I did this several months ago on some user environment.  At
that time, I could recover objects because their sheepdog crashed and
the stale directories were not cleaned up.  The crash reason is fixed
in the curret sheepdog, so if they would use the latest version of
sheepdog, I couldn't fix their environment.

What this patch tries to do is just only giving the chance for users
who have deep knowledge of sheepdog object recovery.  If you think we
shouldn't include such feature, it's okay for me to keep this change
for my own tree.  Perhaps, what I should add is rather a documentation
about the risk of data loss.

> will just throw users more problems than it solves.
> 
> Most users will simply use 'vdi check' to restore consistency.
> Assumption of manual recovery is not feasible to ordinary users. What we
> really need is, IMHO, to teach 'vdi check' to recover the latest objects
> as hard as possible. By the way, with multiple copies, it seems very

I agree on that, but I think it is diffcult to implement the feature
in near future.

> unlikely to meet a inconsistent case for one object just because of
> recovery. I'd like to see a test case to demonstrate what is the real
> case (not a single copy) and consider the solutions for it, not targeted
> for imaginary case.

Here is an example.  This test causes a lot of recovery failure, and
fails sometimes if sheep cannot recover the latest object.  With my
patch, we can find the latest objects from the stale direcotries in
the worst case.

diff --git a/tests/065 b/tests/065
new file mode 100755
index 0000000..d8a212e
--- /dev/null
+++ b/tests/065
@@ -0,0 +1,57 @@
+#!/bin/bash
+
+# 
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1        # failure is the default!
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_cleanup
+
+for i in `seq 0 1`; do
+    _start_sheep $i
+done
+
+_wait_for_sheep 2
+
+$COLLIE cluster format -c 2
+sleep 1
+
+$COLLIE vdi create test 100M -P
+
+for n in `seq 5`; do
+    for i in `seq 2 9`; do
+	_start_sheep $i
+	sleep 0.1
+    done
+
+    for i in `seq 2 9`; do
+	_kill_sheep $i
+	sleep 0.1
+    done
+done &
+
+pid=$!
+
+tr '\0' '\377' < /dev/zero | $COLLIE vdi write test 
+
+kill $pid 2>/dev/null
+wait $pid 2>/dev/null
+
+for i in `seq 2 9`; do
+    pgrep -f "$SHEEP_PROG $STORE/$i" > /dev/null
+    if [ $? != 0 ]; then
+	_start_sheep $i
+    fi
+done
+
+_wait_for_sheep 10
+
+$COLLIE vdi check test > /dev/null 2>&1
+$COLLIE vdi read test | hd
diff --git a/tests/065.out b/tests/065.out
new file mode 100644
index 0000000..8a4f80e
--- /dev/null
+++ b/tests/065.out
@@ -0,0 +1,5 @@
+QA output created by 065
+using backend plain store
+00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+*
+06400000
diff --git a/tests/group b/tests/group
index d8824e3..02011ff 100644
--- a/tests/group
+++ b/tests/group
@@ -78,3 +78,4 @@
 062 auto quick cluster md
 063 auto quick cluster md
 064 auto quick cluster md
+065 auto cluster md