[Sheepdog] [PATCH] sheep: list local objects efficently

Fri Sep 16 06:09:47 CEST 2011

At Thu, 15 Sep 2011 14:23:50 +0800,
Liu Yuan wrote:
> 
>      Objects that stays in the old epoch directory really causes a 
> problem in my environment that has more than
> 100 epochs. I find that there are *redundant* objects stored on each 
> node after couple of node changes. For e.g, I have about 700m image in a 
> 20G cluster and after, say 50 node changes (hence 50 epoch), I found 
> that there is no space left for some of nodes in the cluster due to 
> object transferring in the recovery stage. I have to manually remove old 
> and stale objects in the old epoch to get cluster work again.
> 
>      so for the first place, I think those objects in the old epoch 
> should be processed or even removed out for some time later.I'll leave 
> it for future discussion.

Yes, this is an important issue we need to solve.

> 
>      now get_obj_list() relys on the objects in the old epoch, I don't 
> think it is a good approach. Those objects might be removed finally and 
> iteration all the history epoch is the way too heavy (need much time and 
> meta-data intensive).

To be more precise, we don't need all the histories always.  For
example, after all the other nodes finish object recovery, we can
remove old epoch directories safely.  The old epochs are needed only
when the newer epochs doesn't have a correct object list due to a
multiple node failure.

A problem of this approach is how to notify other nodes that they can
remove old epochs.  I think using a corosync multicast is a simple
way.

> 
>      So we do this iteration for multiple node failure. I find the 
> current code deals with multiple nodes failure this way:
>       node change  event A
>                   |
>       start recovery A'
>                   |
>                  V
> < -- node change event B
>                                              |
>                                  stop recovery A'
>                                              |
>                                  start recovery B'
> 
>      On the assumption that we'll deal with objects in the old epoch 
> later(probably remove them), we need to change the logic for the 
> recovery, so how about serialize the recovery process like following:
> 
>       node change  event A
>                   |
>       start recovery A'
>                   |
>                  V
>       store the requests< -- node change event B

What is actually done in "store the requests"?  Can this approach
handle triple or more nodes failure?

>                   |
>       finished recovery A'
>                   |
>       start recovery B'
> 
>      I noted that SD_OP_READ_OBJ in recovery just tries read two epoch: 
> the tgt_epoch and tgt_epoch - 1. So get all the object list from 1 to 
> epoch would be unnecessary.

A recovery process recovers objects from the older epoch when the
objects are not found in (tgt_epoch - 1).

Thanks,

Kazutaka

> 
>       I'd like to cook a patch for recovery stuff and clean 
> get_obj_list() when we reach the agreement.
> 
> Thanks,
> Yuan
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog