At Mon, 04 Jun 2012 20:06:54 +0800, Liu Yuan wrote: > > On 06/04/2012 06:00 PM, Christoph Hellwig wrote: > > > I think the right fix is to simply give each recover_object_work() call > > it's own work_struct in a structure also containing the oid. While > > this means a memory allocation per object to be recovered it also means > > complete independence between recovery operations, including kicking off > > onces that have I/O pending ASAP and allowing multiple recoveries in > > parallel. I'm about to leave for a long haul flight and will try to > > implement this solution while I'm on the plane. > > > Parallel recovery looks attractive, per-object allocation is pretty big > for current scheme (objects in the list are considered to be recovered), > but actually we only need to recovery those are supposed to be migrated > in (much smaller size), so we only need to allocate memory for the these > objects. > > But if we really want a better recovery process, we need to take bigger > picture into consideration: > > 1: the biggest overhead is actually prepare_object_list(), which tries > to connect to *all* the nodes in the cluster. > > 2: the number of objects to be recovered from other nodes is quite small > for already joined nodes, but quite big for newly joining node. > > Actually, I have since long been thinking of a *reverse* recovery > process: nodes scan its back-end store and actively send copy to > targeted nodes that need to recovery this copy. This would be much more > scalable and efficient. Currently, SD_OP_GET_OBJ_LIST requests whole object ids though the most of them are discarded in screen_obj_list(). Can we optimize it and reduce the overhead dramatically? It is quite a big change to modify the recovery algorithm from pull-style to push-style. If possible, I'd like to avoid it. Thanks, Kazutaka |