On 09/15/2011 12:23 PM, MORITA Kazutaka wrote: > This also fixes a buffer overflow problem which occurs when there are > many epochs. > > Signed-off-by: MORITA Kazutaka<morita.kazutaka at lab.ntt.co.jp> > --- > Hi Yuan, > > I guess this will fix the problem you pointed out. > > Thanks, > > Kazutaka > > Hi Actually, I changed it by myself in my environment before running my code. I found out that the buffer overflow was caused by my collie client(vid object stat) that sent a wrong data_length. Sorry for the noise. So we come to the old epoch handling for the sheepdog, I'd like to further our discussion to get the right way dealing with 'objects in the old epoch'. Objects that stays in the old epoch directory really causes a problem in my environment that has more than 100 epochs. I find that there are *redundant* objects stored on each node after couple of node changes. For e.g, I have about 700m image in a 20G cluster and after, say 50 node changes (hence 50 epoch), I found that there is no space left for some of nodes in the cluster due to object transferring in the recovery stage. I have to manually remove old and stale objects in the old epoch to get cluster work again. so for the first place, I think those objects in the old epoch should be processed or even removed out for some time later.I'll leave it for future discussion. now get_obj_list() relys on the objects in the old epoch, I don't think it is a good approach. Those objects might be removed finally and iteration all the history epoch is the way too heavy (need much time and meta-data intensive). So we do this iteration for multiple node failure. I find the current code deals with multiple nodes failure this way: node change event A | start recovery A' | V < -- node change event B | stop recovery A' | start recovery B' On the assumption that we'll deal with objects in the old epoch later(probably remove them), we need to change the logic for the recovery, so how about serialize the recovery process like following: node change event A | start recovery A' | V store the requests< -- node change event B | finished recovery A' | start recovery B' I noted that SD_OP_READ_OBJ in recovery just tries read two epoch: the tgt_epoch and tgt_epoch - 1. So get all the object list from 1 to epoch would be unnecessary. I'd like to cook a patch for recovery stuff and clean get_obj_list() when we reach the agreement. Thanks, Yuan |