[Sheepdog] Question about get_obj_list()
Liu Yuan
namei.unix at gmail.com
Wed Sep 14 08:32:39 CEST 2011
Hi,
I am writing something that can gets object distribution stat for
specified image like
dev at taobao:~/sheepdog$ collie/collie vdi object tailai.ly --stat
node number of objects
192.168.0.1:7000 96
192.168.0.2:7000 95
192.168.0.3:7000 97
....
In the process, I found a bug in get_obj_list(), which would result
in sheep aborting when handling SD_OP_GET_OBJ_LIST. I traced and found
the culprit was 'buf' that was used to serve as a buffer for object
list, zalloced from sheep's heap. The problem is, the metadata that
gcc's malloc implementation reserved for 'buf' would sometimes get
corrupted and following 'free(buf)' would cause
*** glibc detected *** sheep/sheep: double free or corruption (out)
or similar problem and sheep process terminated.
From my personal understanding of the code, get_obj_list() serves
to return a list of *targeted* objects to the requester. The
patch[sheep: remove object list file] changed its logic a bit, and there
is a loop that
iterates from epoch 1 to epoch n, to merge all the object it finds.
I am not sure which line of code overrun the 'buf', but when I
remove the for loop, and just return object list
from one targeted epoch, I have no longer seen the problem.
So my question is, what is idea behind the for loop? Because
SD_OP_GET_OBJ_LIST request is served when
the node is active (agree on the epoch that other nodes can see), so the
targeted epoch exists for sure when serving the request. Actually, old
objects in the old epoch need to be cleaned up in my opinion. So why bother
searching and get list from them? I think they are simply stale hardlinks.
Thanks,
Yuan
More information about the sheepdog
mailing list