[sheepdog-users] node recovery after reboot, virtual machines slow to boot
Andrew J. Hobbs
ajhobbs at desu.edu
Fri Oct 3 17:32:19 CEST 2014
Sheepdog prioritizes requests over recovery. I have nodes with VM's
flagged for autostart that will automatically fire on rebooting the
server. They will run a bit slower during startup, but the node does
not need to be recovered for them to function.
As a test, I just rebooted a non-critical server running three VMs, the
VMs were flagged autostart via virsh autostart <vm name>, we are running
0.8.3 sheepdog and the 14.04LTS version of virsh/qemu. The VMs started
responding to pings (were booted) within 20 seconds of the host server
responding to pings. Web services are running and responsive on those VMs.
dog node info:
14TB available, 2.0TB used, 12 tb available, 14% used, total virtual
image size 3.4TB
Sheepdog recovery started Oct 03 10:02:25
Sheepdog recovery ended Oct 03 11:30:40 epoch 200 (I've been upgrading
machines one at a time recently)
Total recovery time: 1:28:15
I build a custom dpkg for sheepdog for our servers, simple using the
make debian target.
Some important notes I've collected over the time we've run Sheepdog:
1) Do not, in a production system, use anything but zookeeper.
2) While it's attractive to look at btrfs/xfs for the backing
filesystem, ext4 is the best choice for stability and throughput
3) Use a dedicated network for sheepdog and zookeeper (most servers have
2-4 ports). During recovery, especially, you can get saturation which
will kill performance.
4) Use Sheepdog md over hardware raid. The disks will balance much
faster than a raid array will rebuild in the event of failure (note that
this is specific to our hardware, Dell servers with up to 6 drives on a
Perc 7 controller and may vary depending on your hardware). Our nodes
are raid 1 for OS and sheepmeta, ext4 with -m 0 during format for each
drive exposed by the perc 7 as a raid 0 single disk. It's a kludge, but
Perc 7s won't work as JBODs. The disk line looks something like
/var/lib/sheepmeta,/var/lib/sheepdisk1,/var/lib/sheepdisk2... Got to
this point after having issues during rebuild where the Perc 7 + btrfs
on Raid 5 would cause a kernel panic due to time outs. Sheepdog md +
ext4 has proven to be more robust in our environment.
5) Journalling is useful only in corner cases, and adds complexity that
may not make sense. Basically, only VMs that will be constantly writing
small amounts, mail servers, database images,etc. We currently do not
have any journal enabled nodes.
6) Object cache. While it can help performance, it's not been worth it
for our purposes. I still have nfs home directory server that can send
data to clients at near-line speed (over 100MB/s, gigabit connection).
7) I experimented with directio, allowing Linux to do what it does best
is the best for us. Your mileage may vary. That means no direct io for
us, the Linux kernel can cache for us.
8) We have soft and hard nofile limits set to 1,000,000. No sheepdog
daemon is currently using over 800 during monitoring.
9) Not directly related, but performance is helped on VMs by enabling
transparent_hugepages=always for the grub boot line.
10) Ensure servers are time sync'd using ntp.
On 10/02/2014 12:08 PM, Philip Crotwell wrote:
> I have a small sheepdog cluster and when I reboot a node, for example
> after applying security patches, it takes a long time to recovery
> afterwards. The data volume is not that large, 545Gb, but it takes
> close to an hour to finish the recovery. The problem with this is that
> virtual machines on the node that was rebooted do not themselves boot
> until after the recovery finishes, meaning that for a node reboot that
> takes maybe 2 minutes, I have an hour of downtime for the virtual
> machines. Virsh itself even locks up during the recovery process as
> well, so you can't even do "virsh list".
> It seems like qemu/libvirt on the node should continue to function
> during the recovery process by making use of the other nodes that are
> up and functional. Is this possible? Is there any other way to make it
> so the virtual machines can start up before the recovery process is
> finished? Or to reduce the time it takes to do the recovery process?
> This is on ubuntu trusty (14.04) so sheepdog 0.7.5-1. Is this is
> improved in 0.8 which will be in 14.10 later this month?
> Here is an example libvirt device for a sheepdog disk:
> <disk type='network' device='disk'>
> <driver name='qemu'/>
> <source protocol='sheepdog' name='xxxxxx'/>
> <target dev='hda' bus='ide'/>
> I do not have <host> elements, would explicitly adding multiple hosts help?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 353 bytes
More information about the sheepdog-users