On 06/04/2012 06:00 PM, Christoph Hellwig wrote: > I think the right fix is to simply give each recover_object_work() call > it's own work_struct in a structure also containing the oid. While > this means a memory allocation per object to be recovered it also means > complete independence between recovery operations, including kicking off > onces that have I/O pending ASAP and allowing multiple recoveries in > parallel. I'm about to leave for a long haul flight and will try to > implement this solution while I'm on the plane. Parallel recovery looks attractive, per-object allocation is pretty big for current scheme (objects in the list are considered to be recovered), but actually we only need to recovery those are supposed to be migrated in (much smaller size), so we only need to allocate memory for the these objects. But if we really want a better recovery process, we need to take bigger picture into consideration: 1: the biggest overhead is actually prepare_object_list(), which tries to connect to *all* the nodes in the cluster. 2: the number of objects to be recovered from other nodes is quite small for already joined nodes, but quite big for newly joining node. Actually, I have since long been thinking of a *reverse* recovery process: nodes scan its back-end store and actively send copy to targeted nodes that need to recovery this copy. This would be much more scalable and efficient. I hope you can take this into consideration. Thanks, Yuan |