Hi, I am writing something that can gets object distribution stat for specified image like dev at taobao:~/sheepdog$ collie/collie vdi object tailai.ly --stat node number of objects 192.168.0.1:7000 96 192.168.0.2:7000 95 192.168.0.3:7000 97 .... In the process, I found a bug in get_obj_list(), which would result in sheep aborting when handling SD_OP_GET_OBJ_LIST. I traced and found the culprit was 'buf' that was used to serve as a buffer for object list, zalloced from sheep's heap. The problem is, the metadata that gcc's malloc implementation reserved for 'buf' would sometimes get corrupted and following 'free(buf)' would cause *** glibc detected *** sheep/sheep: double free or corruption (out) or similar problem and sheep process terminated. From my personal understanding of the code, get_obj_list() serves to return a list of *targeted* objects to the requester. The patch[sheep: remove object list file] changed its logic a bit, and there is a loop that iterates from epoch 1 to epoch n, to merge all the object it finds. I am not sure which line of code overrun the 'buf', but when I remove the for loop, and just return object list from one targeted epoch, I have no longer seen the problem. So my question is, what is idea behind the for loop? Because SD_OP_GET_OBJ_LIST request is served when the node is active (agree on the epoch that other nodes can see), so the targeted epoch exists for sure when serving the request. Actually, old objects in the old epoch need to be cleaned up in my opinion. So why bother searching and get list from them? I think they are simply stale hardlinks. Thanks, Yuan |