[sheepdog] [PATCH] sheep: fix oid scheduling in recovery

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Tue Jun 5 05:40:09 CEST 2012


At Mon, 04 Jun 2012 20:06:54 +0800,
Liu Yuan wrote:
> 
> On 06/04/2012 06:00 PM, Christoph Hellwig wrote:
> 
> > I think the right fix is to simply give each recover_object_work() call
> > it's own work_struct in a structure also containing the oid.  While
> > this means a memory allocation per object to be recovered it also means
> > complete independence between recovery operations, including kicking off
> > onces that have I/O pending ASAP and allowing multiple recoveries in
> > parallel.  I'm about to leave for a long haul flight and will try to
> > implement this solution while I'm on the plane.
> 
> 
> Parallel recovery looks attractive, per-object allocation is pretty big
> for current scheme (objects in the list are considered to be recovered),
> but actually we only need to recovery those are supposed to be migrated
> in (much smaller size), so we only need to allocate memory for the these
> objects.
> 
> But if we really want a better recovery process, we need to take bigger
> picture into consideration:
> 
> 1: the biggest overhead is actually prepare_object_list(), which tries
> to connect to *all* the nodes in the cluster.
> 
> 2: the number of objects to be recovered from other nodes is quite small
> for already joined nodes, but quite big for newly joining node.
> 
> Actually, I have since long been thinking of a *reverse* recovery
> process: nodes scan its back-end store and actively send copy to
> targeted nodes that need to recovery this copy. This would be much more
> scalable and efficient.

Currently, SD_OP_GET_OBJ_LIST requests whole object ids though the
most of them are discarded in screen_obj_list().  Can we optimize it
and reduce the overhead dramatically?  It is quite a big change to
modify the recovery algorithm from pull-style to push-style.  If
possible, I'd like to avoid it.

Thanks,

Kazutaka



More information about the sheepdog mailing list