[Sheepdog] [PATCH] sheep: list local objects efficently

Thu Sep 15 08:23:50 CEST 2011

On 09/15/2011 12:23 PM, MORITA Kazutaka wrote:
> This also fixes a buffer overflow problem which occurs when there are
> many epochs.
>
> Signed-off-by: MORITA Kazutaka<morita.kazutaka at lab.ntt.co.jp>
> ---
> Hi Yuan,
>
> I guess this will fix the problem you pointed out.
>
> Thanks,
>
> Kazutaka
>
>
Hi
     Actually, I changed it by myself in my environment before running 
my code. I found out that the buffer overflow was caused by my collie 
client(vid object stat) that sent a wrong data_length. Sorry for the noise.

   So we come to the old epoch handling for the sheepdog, I'd like to 
further our discussion to get the right way dealing with 'objects in the 
old epoch'.

     Objects that stays in the old epoch directory really causes a 
problem in my environment that has more than
100 epochs. I find that there are *redundant* objects stored on each 
node after couple of node changes. For e.g, I have about 700m image in a 
20G cluster and after, say 50 node changes (hence 50 epoch), I found 
that there is no space left for some of nodes in the cluster due to 
object transferring in the recovery stage. I have to manually remove old 
and stale objects in the old epoch to get cluster work again.

     so for the first place, I think those objects in the old epoch 
should be processed or even removed out for some time later.I'll leave 
it for future discussion.

     now get_obj_list() relys on the objects in the old epoch, I don't 
think it is a good approach. Those objects might be removed finally and 
iteration all the history epoch is the way too heavy (need much time and 
meta-data intensive).

     So we do this iteration for multiple node failure. I find the 
current code deals with multiple nodes failure this way:
      node change  event A
                  |
      start recovery A'
                  |
                 V
< -- node change event B
                                             |
                                 stop recovery A'
                                             |
                                 start recovery B'

     On the assumption that we'll deal with objects in the old epoch 
later(probably remove them), we need to change the logic for the 
recovery, so how about serialize the recovery process like following:

      node change  event A
                  |
      start recovery A'
                  |
                 V
      store the requests< -- node change event B
                  |
      finished recovery A'
                  |
      start recovery B'

     I noted that SD_OP_READ_OBJ in recovery just tries read two epoch: 
the tgt_epoch and tgt_epoch - 1. So get all the object list from 1 to 
epoch would be unnecessary.

      I'd like to cook a patch for recovery stuff and clean 
get_obj_list() when we reach the agreement.

Thanks,
Yuan