[sheepdog-users] node recovery after reboot, virtual machines slow to boot

Fri Oct 3 17:32:19 CEST 2014

Sheepdog prioritizes requests over recovery.  I have nodes with VM's 
flagged for autostart that will automatically fire on rebooting the 
server.  They will run a bit slower during startup, but the node does 
not need to be recovered for them to function.

As a test, I just rebooted a non-critical server running three VMs, the 
VMs were flagged autostart via virsh autostart <vm name>, we are running 
0.8.3 sheepdog and the 14.04LTS version of virsh/qemu.  The VMs started 
responding to pings (were booted) within 20 seconds of the host server 
responding to pings.  Web services are running and responsive on those VMs.

dog node info:
14TB available, 2.0TB used, 12 tb available, 14% used, total virtual 
image size 3.4TB

Sheepdog recovery started Oct 03 10:02:25
Sheepdog recovery ended Oct 03 11:30:40 epoch 200 (I've been upgrading 
machines one at a time recently)
Total recovery time: 1:28:15

I build a custom dpkg for sheepdog for our servers, simple using the 
make debian target.

Some important notes I've collected over the time we've run Sheepdog:
1) Do not, in a production system, use anything but zookeeper.
2) While it's attractive to look at btrfs/xfs for the backing 
filesystem, ext4 is the best choice for stability and throughput
3) Use a dedicated network for sheepdog and zookeeper (most servers have 
2-4 ports).  During recovery, especially, you can get saturation which 
will kill performance.
4) Use Sheepdog md over hardware raid.  The disks will balance much 
faster than a raid array will rebuild in the event of failure (note that 
this is specific to our hardware, Dell servers with up to 6 drives on a 
Perc 7 controller and may vary depending on your hardware).  Our nodes 
are raid 1 for OS and sheepmeta, ext4 with -m 0 during format for each 
drive exposed by the perc 7 as a raid 0 single disk.  It's a kludge, but 
Perc 7s won't work as JBODs. The disk line looks something like 
/var/lib/sheepmeta,/var/lib/sheepdisk1,/var/lib/sheepdisk2...  Got to 
this point after having issues during rebuild where the Perc 7 + btrfs 
on Raid 5 would cause a kernel panic due to time outs. Sheepdog md + 
ext4 has proven to be more robust in our environment.
5) Journalling is useful only in corner cases, and adds complexity that 
may not make sense.  Basically, only VMs that will be constantly writing 
small amounts, mail servers, database images,etc.  We currently do not 
have any journal enabled nodes.
6) Object cache.  While it can help performance, it's not been worth it 
for our purposes.  I still have nfs home directory server that can send 
data to clients at near-line speed (over 100MB/s, gigabit connection).
7) I experimented with directio, allowing Linux to do what it does best 
is the best for us.  Your mileage may vary.  That means no direct io for 
us, the Linux kernel can cache for us.
8) We have soft and hard nofile limits set to 1,000,000.  No sheepdog 
daemon is currently using over 800 during monitoring.
9) Not directly related, but performance is helped on VMs by enabling 
transparent_hugepages=always for the grub boot line.
10) Ensure servers are time sync'd using ntp.

On 10/02/2014 12:08 PM, Philip Crotwell wrote:
> Hi
>
> I have a small sheepdog cluster and when I reboot a node, for example
> after applying security patches, it takes a long time to recovery
> afterwards. The data volume is not that large,  545Gb, but it takes
> close to an hour to finish the recovery. The problem with this is that
> virtual machines on the node that was rebooted do not themselves boot
> until after the recovery finishes, meaning that for a node reboot that
> takes maybe 2 minutes, I have an hour of downtime for the virtual
> machines. Virsh itself even locks up during the recovery process as
> well, so you can't even do "virsh list".
>
> It seems like qemu/libvirt on the node should continue to function
> during the recovery process by making use of the other nodes that are
> up and functional. Is this possible? Is there any other way to make it
> so the virtual machines can start up before the recovery process is
> finished? Or to reduce the time it takes to do the recovery process?
>
> This is on ubuntu trusty (14.04) so sheepdog 0.7.5-1. Is this is
> improved in 0.8 which will be in 14.10 later this month?
>
> Here is an example libvirt device for a sheepdog disk:
>    <disk type='network' device='disk'>
>        <driver name='qemu'/>
>        <source protocol='sheepdog' name='xxxxxx'/>
>        <target dev='hda' bus='ide'/>
>      </disk>
>
> I do not have <host> elements, would explicitly adding multiple hosts help?
>
> Thanks,
> Philip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20141003/c5831239/attachment-0005.vcf>