[Sheepdog] support object recovery

Mon Feb 15 05:37:05 CET 2010

On Sat, Feb 13, 2010 at 6:49 AM, Piavlo <piavka at cs.bgu.ac.il> wrote:
> MORITA Kazutaka wrote:
>> On Fri, Jan 22, 2010 at 11:01 AM, Piavlo <piavka at cs.bgu.ac.il> wrote:
>>
>>>> Yes, vdi directory is like metadata, which we calld `super object' before.
>>>> Currently, the redundancy of vdi directory is same as objects redundancy.
>>>> Perhaps, we should change it because vdi directory is more important than
>>>> data objects.
>>>>
>>>>
>>> Since this vdi metadata overhead is very small , it is probably
>>> reasonable to store the vdi metadata for all vm images in the sheepdog
>>> cluster
>>> on each storage node. This would not require any metadata recovery if
>>> some node goes offline.
>>>
>>
>> Probably, the cost of updating vdi directory on every nodes is not so cheap
>> if the number of nodes is large, because each node must return the ack
>>
> Return the ack to where exactly? I mean there is no head node in the
> cluster - so who will wait for these acks?
> The node on which vm image creation was invoked? As we discussed earlier
> it should be possible in the future implementations to separate the
> sheepdog client from the server.

Strictly speaking, there is a master server which is automatically
elected from the sheepdog nodes. The master server is marked with an
asterisk in the output of `shepherd info -t dog`. The write requests
of the vdi directory is invoked by the master server, and it receive
the acks.

>> after updating local vdi directory.
>>
>  Also, with current implementation,  the servers with vdi metadata
> (meta-servers) may become the bottleneck.
> For exmaple say I have 100 nodes cluster with --copies=3, with current
> implementation
> just 3 nodes will be the meta-servers (each one with exactly same metadata).
> AFAIU this means that all sheepdog queries will have to go through these
> meta-servers, which besides
> the metadata are also used for vm block storage serving.

I think the servers with vdi metadata directory does not become the
bottleneck. It is because the directory contains only the list of vdi
names, creation times, redundancies, etc, and VMs don't access the vdi
directory at all after they open the vdi.

The data object IDs of each vdi are stored to the vdi object. The vdi
object is accessed each time the VM allocate a new data object, but
there is no bottleneck server. It is because the vdi objects are
distributed over the all nodes like data objects. The vdi object
recovery is also done in the same process as data objects recovery.

> Maybe we can get with following:
> Once a specific sheepdog server node receives a request to store a vm
> block - and this is the first block for that vm - it will also request
> and create vdi metadata for that vm (this should be neglectable overhead).
> This way each node will have vm metadata of each vm it stores block for.
> So vdi metadata recovery will be piggy backed as part of vm block
> recovery, and would not require any special
> implementation. Of course before sheepdog cluster starts distributing
> block for a new vm in the cluster, it needs first to create the vdi vm
> metadata on just the --copies of nodes (which the current implementation
> already does anyway).
> So this looks ,to me, like a one little change in order to avoid the
> need for metadata recovery. What do you think?
>

In the initial release of sheepdog, we stored the vdi directory
information to the one object - we called it a super object - but
to get rid of btrfs dependency, we changed it to the current style.

The related discussion is here:
  http://lists.wpkg.org/pipermail/sheepdog/2009-December/000076.html

Thanks,

Kazutaka Morita