[Sheepdog] [PATCH] sheep: list local objects efficently
Liu Yuan
namei.unix at gmail.com
Thu Sep 15 08:23:50 CEST 2011
On 09/15/2011 12:23 PM, MORITA Kazutaka wrote:
> This also fixes a buffer overflow problem which occurs when there are
> many epochs.
>
> Signed-off-by: MORITA Kazutaka<morita.kazutaka at lab.ntt.co.jp>
> ---
> Hi Yuan,
>
> I guess this will fix the problem you pointed out.
>
> Thanks,
>
> Kazutaka
>
>
Hi
Actually, I changed it by myself in my environment before running
my code. I found out that the buffer overflow was caused by my collie
client(vid object stat) that sent a wrong data_length. Sorry for the noise.
So we come to the old epoch handling for the sheepdog, I'd like to
further our discussion to get the right way dealing with 'objects in the
old epoch'.
Objects that stays in the old epoch directory really causes a
problem in my environment that has more than
100 epochs. I find that there are *redundant* objects stored on each
node after couple of node changes. For e.g, I have about 700m image in a
20G cluster and after, say 50 node changes (hence 50 epoch), I found
that there is no space left for some of nodes in the cluster due to
object transferring in the recovery stage. I have to manually remove old
and stale objects in the old epoch to get cluster work again.
so for the first place, I think those objects in the old epoch
should be processed or even removed out for some time later.I'll leave
it for future discussion.
now get_obj_list() relys on the objects in the old epoch, I don't
think it is a good approach. Those objects might be removed finally and
iteration all the history epoch is the way too heavy (need much time and
meta-data intensive).
So we do this iteration for multiple node failure. I find the
current code deals with multiple nodes failure this way:
node change event A
|
start recovery A'
|
V
< -- node change event B
|
stop recovery A'
|
start recovery B'
On the assumption that we'll deal with objects in the old epoch
later(probably remove them), we need to change the logic for the
recovery, so how about serialize the recovery process like following:
node change event A
|
start recovery A'
|
V
store the requests< -- node change event B
|
finished recovery A'
|
start recovery B'
I noted that SD_OP_READ_OBJ in recovery just tries read two epoch:
the tgt_epoch and tgt_epoch - 1. So get all the object list from 1 to
epoch would be unnecessary.
I'd like to cook a patch for recovery stuff and clean
get_obj_list() when we reach the agreement.
Thanks,
Yuan
More information about the sheepdog
mailing list