[sheepdog-users] node recovery after reboot, virtual machines slow to boot

Wed Oct 8 08:06:03 CEST 2014

On Wed, Oct 08, 2014 at 02:00:01PM +0800, Liu Yuan wrote:
> On Sun, Oct 05, 2014 at 03:09:10PM +0900, Hitoshi Mitake wrote:
> > On Fri, Oct 3, 2014 at 3:49 PM, Valerio Pachera <sirio81 at gmail.com> wrote:
> > > Are you able to tell us if the node was accessing the disks mostly reading
> > > or writing?
> > > You can see that by atop during recovery.
> > >
> > > @Histoshi, I remember sheepdog was transfering all data during recovery in
> > > old versions.
> > > Later on checksumming was introduced.
> > > Does 0.7.5 already use checksum?
> > 
> > It wouldn't related to checksuming. 0.7.x tends to transfer larger
> > amount of data than 0.8.x. But stopping requests from VMs completely
> > is really strange because sheepdog prioritizes the requests from VMs
> > (as Andrew mentioned).
> 
> This is the very reason why I added multi-threaded recovery. The problem looks
> very close to the one that we found in the past.
> 
> "
> 	 * Rationale for multi-threaded recovery:
> 	 * 1. If one node is added, we find that all the VMs on other nodes will
> 	 *    get noticeably affected until 50% data is transferred to the new
> 	 *    node.
> "
> 
> Especially for adding new node, there is big chance that all the VMs has their
> data scatted on the new node and there is only a single thread for data recovery
> 
> As you noticed we priorities VM requests over recovery requests, but for the
> case (adding new node), image that
> 
> 1 there is no data on the new node
> 2 all the VMs try to read/write its slice of data on the new node, there might
>   be thousands of requests proportional to the number of VMs.
> 3 before we can read/write on the targeted objects, we need to transfer/rebuild
>   them first.
> 4 so the bottle neck is how fast we can transfer/rebuild the targeted objects in
>   this new node.
> 5 unfortunately, there is single thread to recover the targeted objects because
>   this is one of means that how we priorities the VM IOs over reocery requests.
> 
> For now only master and 0.9 series will have this feature.
> 
> If we have old data on the joining node, we also need first check if the 
> exsiting data is stale or not. The checking process is part of reocery algorithm
> and the check speed can only be accelerated by multi-threaded recovery.
> 
> We statically allocate 2 threads for one disk device, so make use you have
> sheep md feature enabled and use multi disks, so that we can have enough threads
> to handle the requests flood at the early stage of reocery process and VMs will
> be more responsive.

We priorities the VM IO, meaning that if the target objects the VMs access is
existing on the targeted node, we make sure executing the VMs reqs and blocking
sheep internal requests that try to reocver other objects.

The problem is, the target objects are unfortunately the objects we try to
recover and *there is only one thread to recoover the objects* and blocks the
VM IO in return.

Thanks,
Yuan