[Sheepdog] support object recovery

Tue Feb 16 12:58:57 CET 2010

 Hi,

Thanks for clarifying  the vdi  directory/object difference.

AFAIU the vdi directory does not contain information on which node to
find the vdi object, but just the vdi object id,
and due to consistent hashing based on the vdi object id and node ids we
can derive on which nodes the vdi object is actually stored?

Does the vdi object contain only data object id (since again data object
id and node ids - is enough to derive on which nodes the data object is
stored)
or vdi object also explicitly contains, for each data object id, a list
of nodes where data object is actually stored?

>>>
>>> I think the servers with vdi metadata directory does not become the
>>> bottleneck. It is because the directory contains only the list of vdi
>>> names, creation times, redundancies, etc,
In my case i see only single zero size file 
/sheepdog/0/vdi/vdiname/0000000000040000
from this one can only derive the vdi name ,vdi object id and vdi
creation time, but not redundancies and etceteras ?

What info does the /{store_dir}/epoch/0000000X contain and when it is
accessed?
Also how is the /{store_dir}/obj/0000000(X+1) derived from the previous
epoch
then a new node joins the cluster (and not some node fails)?  Is this
implemented by simple copy on write of the directory and thus depends on
btrfs?

Also to avoid the /{store_dir}/obj/{epoch} directory block indirection (which hurts performance) with large number of vdis,
isn't it better to store data object in the following structure:
 /{store_dir}/obj/{epoch}/{vdi_object_id}/{data_object_id}
instead of:
 /{store_dir}/obj/{epoch}/{data_object_id}
?

>>>  and VMs don't access the vdi
>>> directory at all after they open the vdi.
>>>
>>> The data object IDs of each vdi are stored to the vdi object. The vdi
>>> object is accessed each time the VM allocate a new data object, but
>>> there is no bottleneck server. It is because the vdi objects are
>>> distributed over the all nodes like data objects.
>>>       
>> Thats not that i observed, all vdi metadata objects are currently ending
>> up on the very same set of servers.
>> So if I have 100 nodes and used --copies=3 CURRENTLY there will be just
>> 3 vdi metadata servers for all VMs.
>>
>>     
>
> The node of the vdi directory is fixed, but the vdi object may be
> created in the other nodes like data objects.
>   
  Since the vdi directory nodes are fixed, then how is this info
maintained on master server
and more importantly how does this info survive then the master server
dies or then cluster is rebooted?

AFAIU since the vdi directory nodes are fixed and there is a single
master server (even that it is automatically elected)
coordinating access to vdi metadata servers - sheepdog is not fully
distributed storage.
Maybe if the vdi metadata directory was stored in a distributed hash
across the
sheepdog servers,and any server could be contacted to read/write the vdi
directory information,  there would
be no need for coordinating master server.

Thanks
Alex