[sheepdog-users] node recovery after reboot, virtual machines slow to boot

Wed Oct 8 08:00:01 CEST 2014

On Sun, Oct 05, 2014 at 03:09:10PM +0900, Hitoshi Mitake wrote:
> On Fri, Oct 3, 2014 at 3:49 PM, Valerio Pachera <sirio81 at gmail.com> wrote:
> > Are you able to tell us if the node was accessing the disks mostly reading
> > or writing?
> > You can see that by atop during recovery.
> >
> > @Histoshi, I remember sheepdog was transfering all data during recovery in
> > old versions.
> > Later on checksumming was introduced.
> > Does 0.7.5 already use checksum?
> 
> It wouldn't related to checksuming. 0.7.x tends to transfer larger
> amount of data than 0.8.x. But stopping requests from VMs completely
> is really strange because sheepdog prioritizes the requests from VMs
> (as Andrew mentioned).

This is the very reason why I added multi-threaded recovery. The problem looks
very close to the one that we found in the past.

"
	 * Rationale for multi-threaded recovery:
	 * 1. If one node is added, we find that all the VMs on other nodes will
	 *    get noticeably affected until 50% data is transferred to the new
	 *    node.
"

Especially for adding new node, there is big chance that all the VMs has their
data scatted on the new node and there is only a single thread for data recovery

As you noticed we priorities VM requests over recovery requests, but for the
case (adding new node), image that

1 there is no data on the new node
2 all the VMs try to read/write its slice of data on the new node, there might
  be thousands of requests proportional to the number of VMs.
3 before we can read/write on the targeted objects, we need to transfer/rebuild
  them first.
4 so the bottle neck is how fast we can transfer/rebuild the targeted objects in
  this new node.
5 unfortunately, there is single thread to recover the targeted objects because
  this is one of means that how we priorities the VM IOs over reocery requests.

For now only master and 0.9 series will have this feature.

If we have old data on the joining node, we also need first check if the 
exsiting data is stale or not. The checking process is part of reocery algorithm
and the check speed can only be accelerated by multi-threaded recovery.

We statically allocate 2 threads for one disk device, so make use you have
sheep md feature enabled and use multi disks, so that we can have enough threads
to handle the requests flood at the early stage of reocery process and VMs will
be more responsive.

Thanks
Yuan