[Sheepdog] support object recovery

Mon Feb 15 10:37:25 CET 2010

Hi,

I think the words we are using are bit confusing, sorry.

** vdi directory (super object)
The vdi directory is globally shared data and contains the list of VDI
and each vdi object id, and each entries are stored in the following path.

 /{store_dir}/vdi/{vdi_name}/{vdi_object_id}

** vdi object
The vdi object (not directory) contains the list of data object id for
the VM, and stored in the following path (same as data objects).

 /{store_dir}/obj/{epoch}/{vdi_object_id}

** data object
The data object is a actual data VM will access, and stored in the
following path.

 /{store_dir}/obj/{epoch}/{data_object_id}

e.g. please consider the following entry is included in the vdi
directory.

 /{store_dir}/vdi/linux/0000000000040000

If we open the image `linux', we get the vdi object id
0000000000040000, and read the vdi object from the following path

 /{store_dir}/obj/{epoch}/0000000000040000

in the target node.
The vdi object contains the list of data object id, and it shows where
each data objects are stored.

On Mon, Feb 15, 2010 at 4:21 PM, Piavlo <piavka at cs.bgu.ac.il> wrote:
> MORITA Kazutaka wrote:
>>> Also, with current implementation,  the servers with vdi metadata
>>> (meta-servers) may become the bottleneck.
>>> For exmaple say I have 100 nodes cluster with --copies=3, with current
>>> implementation
>>> just 3 nodes will be the meta-servers (each one with exactly same metadata).
>>> AFAIU this means that all sheepdog queries will have to go through these
>>> meta-servers, which besides
>>> the metadata are also used for vm block storage serving.
>>>
>>
>> I think the servers with vdi metadata directory does not become the
>> bottleneck. It is because the directory contains only the list of vdi
>> names, creation times, redundancies, etc, and VMs don't access the vdi
>> directory at all after they open the vdi.
>>
>> The data object IDs of each vdi are stored to the vdi object. The vdi
>> object is accessed each time the VM allocate a new data object, but
>> there is no bottleneck server. It is because the vdi objects are
>> distributed over the all nodes like data objects.
> Thats not that i observed, all vdi metadata objects are currently ending
> up on the very same set of servers.
> So if I have 100 nodes and used --copies=3 CURRENTLY there will be just
> 3 vdi metadata servers for all VMs.
>

The node of the vdi directory is fixed, but the vdi object may be
created in the other nodes like data objects.

> Just to be sure, is the following a vdi metadata object ?
> fire-srv3 0 # ls -la vdi/pacemaker1.dmz.cs.bgu.ac.il/0000000000040000
> -rw-r----- 1 root root 0 2010-02-12 20:33
> vdi/pacemaker1.dmz.cs.bgu.ac.il/0000000000040000
> fire-srv3 0 #
> So vdi object has no data , but it's name just indicates with which
> prefix (000000000004 in this case) to store the vdi data blocks.
>

It is a entry of the vdi directory, not the vdi object.

>>  The vdi object
>> recovery is also done in the same process as data objects recovery.
>>
>>
>  Now I'm confused, since in the previous mail you've told that vdi
> object recovery is not implemented yet.

The vdi object recovery is implemented because it is the same as data
object recovery. But the vdi directory recovery is not supported yet.

>>> Maybe we can get with following:
>>> Once a specific sheepdog server node receives a request to store a vm
>>> block - and this is the first block for that vm - it will also request
>>> and create vdi metadata for that vm (this should be neglectable overhead).
>>> This way each node will have vm metadata of each vm it stores block for.
>>> So vdi metadata recovery will be piggy backed as part of vm block
>>> recovery, and would not require any special
>>> implementation. Of course before sheepdog cluster starts distributing
>>> block for a new vm in the cluster, it needs first to create the vdi vm
>>> metadata on just the --copies of nodes (which the current implementation
>>> already does anyway).
>>> So this looks ,to me, like a one little change in order to avoid the
>>> need for metadata recovery. What do you think?
>>>
>>>
>>
>> In the initial release of sheepdog, we stored the vdi directory
>> information to the one object - we called it a super object - but
>> to get rid of btrfs dependency, we changed it to the current style.
>>
>>
>  Sorry ,if I'm not clear enough, i'm not suggesting changing the FORMAT
> of the vdi metadata objects.
> What I'm saying is that: if a node stores some specific VM data block
> (even just one data block), then this node would also store a vdi
> metadata object for that VM.
> So then the node receives a request to store the very first data block
> for specific VM, it would also pull the vdi metadata object for that vm.
> In this situation there is no need for separate vdi matadata recovery
> implementation, since it's guaranteed that once node has vm data it also
> has the vm vdi metadata object locally.
>

I think you are talking about the vdi directory. If so, how to lookup
vdi object or data object with your approach? To look up objects, we
need to find the vdi directory at first. With your suggestion, if we
don't know which node has the data object, we cannot find the vdi
directory, yes?

Thanks,

Kazutaka Morita