[Sheepdog] support object recovery

Sun Feb 21 20:28:42 CET 2010

Sorry for late reply.

On Tue, Feb 16, 2010 at 8:58 PM, Piavlo <piavka at cs.bgu.ac.il> wrote:
> AFAIU the vdi directory does not contain information on which node to
> find the vdi object, but just the vdi object id,
> and due to consistent hashing based on the vdi object id and node ids we
> can derive on which nodes the vdi object is actually stored?
>
Yes.

> Does the vdi object contain only data object id (since again data object
> id and node ids - is enough to derive on which nodes the data object is
> stored)
> or vdi object also explicitly contains, for each data object id, a list
> of nodes where data object is actually stored?
>
The vdi object contains only data object ids. Otherwise we had to
update all the vdi object after nodes were added (or removed)
and It's not realistic if there are many vdis.

>>>> I think the servers with vdi metadata directory does not become the
>>>> bottleneck. It is because the directory contains only the list of vdi
>>>> names, creation times, redundancies, etc,
> In my case i see only single zero size file
> /sheepdog/0/vdi/vdiname/0000000000040000
> from this one can only derive the vdi name ,vdi object id and vdi
> creation time, but not redundancies and etceteras ?
>
Sorry, I made a mistake. Redundancies and other stuff are stored in
the vdi object.

> What info does the /{store_dir}/epoch/0000000X contain and when it is
> accessed?

It contains the list of nodes at the epoch and is used to access
objects in the /{store_dir}/obj/0000000X/. After sheepdog completes
moving objects based on the current node list, old epochs in
/{store_dir}/epoch/ is no longer used.

> Also how is the /{store_dir}/obj/0000000(X+1) derived from the previous
> epoch
> then a new node joins the cluster (and not some node fails)?  Is this
> implemented by simple copy on write of the directory and thus depends on
> btrfs?
>

If the target objects are in the local /{store_dir}/obj/0000000X,
sheepdog create hard links from them to the /{store_dir}/obj/0000000(X+1).
And if the target objects are in the remote hosts, sheepdog receives
the objects from the remotes.
Sheepdog does not depend on btrfs now.

> Also to avoid the /{store_dir}/obj/{epoch} directory block indirection (which hurts performance) with large number of vdis,
> isn't it better to store data object in the following structure:
>  /{store_dir}/obj/{epoch}/{vdi_object_id}/{data_object_id}
> instead of:
>  /{store_dir}/obj/{epoch}/{data_object_id}
> ?
>

Using the prefix of the object id may be better.

e.g.
/{store_dir}/obj/{epoch}/12/345678
instead of
/{store_dir}/obj/{epoch}/12345678

Because, in some cases, collie need to get the list of stored object id
by specifying the id range.

>>>>  and VMs don't access the vdi
>>>> directory at all after they open the vdi.
>>>>
>>>> The data object IDs of each vdi are stored to the vdi object. The vdi
>>>> object is accessed each time the VM allocate a new data object, but
>>>> there is no bottleneck server. It is because the vdi objects are
>>>> distributed over the all nodes like data objects.
>>>>
>>> Thats not that i observed, all vdi metadata objects are currently ending
>>> up on the very same set of servers.
>>> So if I have 100 nodes and used --copies=3 CURRENTLY there will be just
>>> 3 vdi metadata servers for all VMs.
>>>
>>>
>>
>> The node of the vdi directory is fixed, but the vdi object may be
>> created in the other nodes like data objects.
>>
>  Since the vdi directory nodes are fixed, then how is this info
> maintained on master server
> and more importantly how does this info survive then the master server
> dies or then cluster is rebooted?
>
Sorry, I assumed the node membership does not change
in the explanation. Sheepdog deal with the vdi directory
like the object with id zero.

> AFAIU since the vdi directory nodes are fixed and there is a single
> master server (even that it is automatically elected)
> coordinating access to vdi metadata servers - sheepdog is not fully
> distributed storage.
> Maybe if the vdi metadata directory was stored in a distributed hash
> across the
> sheepdog servers,

It is what sheepdog does.

> and any server could be contacted to read/write the vdi
> directory information,  there would
> be no need for coordinating master server.
>
And it is not. Updating replications with multiple writer is
fundamentally difficult if there is no coordinator. Sheepdog elects
the master by using corosync.

I think sheepdog is fully symmetric from the administration view point
because the administrator don't care about which is a special server
by the following reasons:

 - There is no requirement for the master server.
 - Even if the master server is down, the new master server is
   automatically elected again.
 - The vdi directory is accessed only when manipulating vdi objects,
   so the load of the master server is light.

Thanks,

Kazutaka Morita