[sheepdog] [PATCH] sheep: fix a dead-lock bug in screen_object_list()
levin li
levin108 at gmail.com
Mon Nov 26 06:48:50 CET 2012
From: levin li <xingke.lwp at taobao.com>
The bug is introduced in the patch 97ccd87ea15e606b6ec9fecb54f5de453f9c5c1f
which causes the cluster hangs in recovery, this patch is like a revert patch
except that I add comment to explain why we should not ignore the objects
whose nr_objs are zero in screen_object_list().
screen_object_list() is called in recovery, if a VDI creation is in progress,
then the VDI creation operation is blocked by gateway to wait for completion of
recovery, and if we make screen_object_list() to retry infinitely to wait the
nr_objs to be not zero, which also means it waits for VDI creation to complete,
so, both VDI creation and recovery can not finish, it's a dead-lock.
As a fact, we don't need to recovery the objects whose nr_objs are zero, these
objects can only be VDI objects which has not been created completely because
of a recovery, after the recovery finishes, gateway would retry to create the
objects and put it into correct location.
Signed-off-by: levin li <xingke.lwp at taobao.com>
---
sheep/recovery.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/sheep/recovery.c b/sheep/recovery.c
index 9c5c712..bb71cba 100644
--- a/sheep/recovery.c
+++ b/sheep/recovery.c
@@ -539,17 +539,18 @@ static void screen_object_list(struct recovery_work *rw,
int i, j;
for (i = 0; i < nr_oids; i++) {
-again:
nr_objs = get_obj_copy_number(oids[i], rw->cur_vinfo->nr_zones);
if (!nr_objs) {
dprintf("can not find copy number for object %" PRIx64
"\n", oids[i]);
- dprintf("probably, vdi was created but "
- "post_cluster_new_vdi() is not called yet\n");
- /* FIXME: can we wait for post_cluster_new_vdi
- * with a better way? */
- sleep(1);
- goto again;
+ /*
+ * If nr_objs is zero, then the object is a VDI object,
+ * and the creation of the VDI is in progress, then we
+ * don't need to recover this object, as after recovery,
+ * the gateway would retry to complete the creation by
+ * putting the objects of the VDI into corrent location.
+ */
+ continue;
}
oid_to_vnodes(rw->cur_vinfo->vnodes, rw->cur_vinfo->nr_vnodes,
oids[i], nr_objs, vnodes);
--
1.7.9.5
More information about the sheepdog
mailing list