[sheepdog] Fw: Fixing method for data wipe bug in the recovery

张扬 3100100878 at zju.edu.cn
Tue Dec 2 15:25:30 CET 2014


Hi,


I figured that the reason for this issue is that:


1.Precondition:There are only gateway nodes left in the cluster. So the gateway nodes kept the latest epoch. But as gateway node doesn't count in nr_zones, nr_zones equal to 0.
2.The nodes in cluster was restarted, and start to recover from the latest epoch, in which the nr_zones is 0. In function recover_object_from replica, it will skip all the recover process(as nr_copies equals to 0), but return success(as ret was initialled as success).
static int recover_object_from_replica()
{
int nr_copies, ret = SD_RES_SUCCESS, start = 0;
...
   nr_copies = get_obj_copy_number(oid, old->nr_zones);


    for (int i = 0; i < nr_copies; i++) {
...
   }


   for (int i = 0; i < nr_copies; i++) {
...
   }


   if (fully_replicated && ret != SD_RES_SUCCESS)
        ret = SD_RES_STALE_OBJ;


   return ret;
}
3.The node thought it got the latest obj successfully and delete the old version in .stale. but actually there was no obj recovered.


In this case the latest epoch should be the latest one with at least one node who is not in gateway mode. 


Is that possible to let the gateway mode node run without tracking epoch info? In this way the latest epoch will always includes an non-gateway node.


Thanks,
Yang




-----原始邮件-----
发件人: "张扬" <3100100878 at zju.edu.cn>
发送时间: 2014-11-29 17:44:31 (星期六)
收件人: sheepdog at lists.wpkg.org
抄送:
主题: Fixing method for data wipe bug in the recovery

Hi,


We recently encounter the bug which was reported here:
https://bugs.launchpad.net/sheepdog-project/+bug/1327037


I'm not sure if this is the root cause, but by changing the initial value of ret in function recover_object_from_replica(in recovery.c)
ret = SD_RES_SUCCESS 
to 
ret = SD_RES_NO_OBJ (actually anything but SD_RES_SUCCESS), i'm able to avoid the lose of data.


The reason is that, when the node try to recover it self to match the newest epoch, the nr_copies in function recover_object_from_replica is zero, so it return ret == SD_RES_SUCCESS while it actually skipped all the recover function in the loop. I don't know why nr_copies is zero though.


Thanks,
Yang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20141202/373054ec/attachment-0003.html>


More information about the sheepdog mailing list