[Sheepdog] [PATCH] sheep: list local objects efficently

Thu Sep 15 08:39:49 CEST 2011

On 09/15/2011 02:23 PM, Liu Yuan wrote:
> On 09/15/2011 12:23 PM, MORITA Kazutaka wrote:
>> This also fixes a buffer overflow problem which occurs when there are
>> many epochs.
>>
>> Signed-off-by: MORITA Kazutaka<morita.kazutaka at lab.ntt.co.jp>
>> ---
>> Hi Yuan,
>>
>> I guess this will fix the problem you pointed out.
>>
>> Thanks,
>>
>> Kazutaka
>>
>>
> Hi
>     Actually, I changed it by myself in my environment before running 
> my code. I found out that the buffer overflow was caused by my collie 
> client(vid object stat) that sent a wrong data_length. Sorry for the 
> noise.
>
>   So we come to the old epoch handling for the sheepdog, I'd like to 
> further our discussion to get the right way dealing with 'objects in 
> the old epoch'.
>
>     Objects that stays in the old epoch directory really causes a 
> problem in my environment that has more than
> 100 epochs. I find that there are *redundant* objects stored on each 
> node after couple of node changes. For e.g, I have about 700m image in 
> a 20G cluster and after, say 50 node changes (hence 50 epoch), I found 
> that there is no space left for some of nodes in the cluster due to 
> object transferring in the recovery stage. I have to manually remove 
> old and stale objects in the old epoch to get cluster work again.
>
>     so for the first place, I think those objects in the old epoch 
> should be processed or even removed out for some time later.I'll leave 
> it for future discussion.
>
>     now get_obj_list() relys on the objects in the old epoch, I don't 
> think it is a good approach. Those objects might be removed finally 
> and iteration all the history epoch is the way too heavy (need much 
> time and meta-data intensive).
>
>     So we do this iteration for multiple node failure. I find the 
> current code deals with multiple nodes failure this way:
>      node change  event A
>                  |
>      start recovery A'
>                  |
>                 V
> < -- node change event B
>                                             |
>                                 stop recovery A'
>                                             |
>                                 start recovery B'
>
>     On the assumption that we'll deal with objects in the old epoch 
> later(probably remove them), we need to change the logic for the 
> recovery, so how about serialize the recovery process like following:
>
>      node change  event A
>                  |
>      start recovery A'
>                  |
>                 V
>      store the requests< -- node change event B
>                  |
>      finished recovery A'
>                  |
>      start recovery B'
>
>     I noted that SD_OP_READ_OBJ in recovery just tries read two epoch: 
> the tgt_epoch and tgt_epoch - 1. So get all the object list from 1 to 
> epoch would be unnecessary.
>
>      I'd like to cook a patch for recovery stuff and clean 
> get_obj_list() when we reach the agreement.
>
> Thanks,
> Yuan

     If get_obj_list() can not serve as get the object list from the 
latest epoch, I have to write my own function that get the object list 
from the latest epoch, no more no less to implement 'vdi object stat' 
command. This is kind of duplicate effort and looks awkward.

Thanks,
Yuan