[sheepdog-users] node recovery after reboot, virtual machines slow to boot

Fri Oct 3 17:45:04 CEST 2014

Thanks for the info. One thing that jumped out and that I did not
mention that might be important is that I am not using zookeeper,
using corosync instead. No good reason for that other then that was
the default when I set up sheepdog. Maybe I should try switching to
zookeeper to see if that makes a difference.

Thanks,
Philip

On Fri, Oct 3, 2014 at 11:32 AM, Andrew J. Hobbs <ajhobbs at desu.edu> wrote:
> Sheepdog prioritizes requests over recovery.  I have nodes with VM's
> flagged for autostart that will automatically fire on rebooting the
> server.  They will run a bit slower during startup, but the node does
> not need to be recovered for them to function.
>
> As a test, I just rebooted a non-critical server running three VMs, the
> VMs were flagged autostart via virsh autostart <vm name>, we are running
> 0.8.3 sheepdog and the 14.04LTS version of virsh/qemu.  The VMs started
> responding to pings (were booted) within 20 seconds of the host server
> responding to pings.  Web services are running and responsive on those VMs.
>
> dog node info:
> 14TB available, 2.0TB used, 12 tb available, 14% used, total virtual
> image size 3.4TB
>
> Sheepdog recovery started Oct 03 10:02:25
> Sheepdog recovery ended Oct 03 11:30:40 epoch 200 (I've been upgrading
> machines one at a time recently)
> Total recovery time: 1:28:15
>
> I build a custom dpkg for sheepdog for our servers, simple using the
> make debian target.
>
> Some important notes I've collected over the time we've run Sheepdog:
> 1) Do not, in a production system, use anything but zookeeper.
> 2) While it's attractive to look at btrfs/xfs for the backing
> filesystem, ext4 is the best choice for stability and throughput
> 3) Use a dedicated network for sheepdog and zookeeper (most servers have
> 2-4 ports).  During recovery, especially, you can get saturation which
> will kill performance.
> 4) Use Sheepdog md over hardware raid.  The disks will balance much
> faster than a raid array will rebuild in the event of failure (note that
> this is specific to our hardware, Dell servers with up to 6 drives on a
> Perc 7 controller and may vary depending on your hardware).  Our nodes
> are raid 1 for OS and sheepmeta, ext4 with -m 0 during format for each
> drive exposed by the perc 7 as a raid 0 single disk.  It's a kludge, but
> Perc 7s won't work as JBODs. The disk line looks something like
> /var/lib/sheepmeta,/var/lib/sheepdisk1,/var/lib/sheepdisk2...  Got to
> this point after having issues during rebuild where the Perc 7 + btrfs
> on Raid 5 would cause a kernel panic due to time outs. Sheepdog md +
> ext4 has proven to be more robust in our environment.
> 5) Journalling is useful only in corner cases, and adds complexity that
> may not make sense.  Basically, only VMs that will be constantly writing
> small amounts, mail servers, database images,etc.  We currently do not
> have any journal enabled nodes.
> 6) Object cache.  While it can help performance, it's not been worth it
> for our purposes.  I still have nfs home directory server that can send
> data to clients at near-line speed (over 100MB/s, gigabit connection).
> 7) I experimented with directio, allowing Linux to do what it does best
> is the best for us.  Your mileage may vary.  That means no direct io for
> us, the Linux kernel can cache for us.
> 8) We have soft and hard nofile limits set to 1,000,000.  No sheepdog
> daemon is currently using over 800 during monitoring.
> 9) Not directly related, but performance is helped on VMs by enabling
> transparent_hugepages=always for the grub boot line.
> 10) Ensure servers are time sync'd using ntp.
>
> On 10/02/2014 12:08 PM, Philip Crotwell wrote:
>> Hi
>>
>> I have a small sheepdog cluster and when I reboot a node, for example
>> after applying security patches, it takes a long time to recovery
>> afterwards. The data volume is not that large,  545Gb, but it takes
>> close to an hour to finish the recovery. The problem with this is that
>> virtual machines on the node that was rebooted do not themselves boot
>> until after the recovery finishes, meaning that for a node reboot that
>> takes maybe 2 minutes, I have an hour of downtime for the virtual
>> machines. Virsh itself even locks up during the recovery process as
>> well, so you can't even do "virsh list".
>>
>> It seems like qemu/libvirt on the node should continue to function
>> during the recovery process by making use of the other nodes that are
>> up and functional. Is this possible? Is there any other way to make it
>> so the virtual machines can start up before the recovery process is
>> finished? Or to reduce the time it takes to do the recovery process?
>>
>> This is on ubuntu trusty (14.04) so sheepdog 0.7.5-1. Is this is
>> improved in 0.8 which will be in 14.10 later this month?
>>
>> Here is an example libvirt device for a sheepdog disk:
>>    <disk type='network' device='disk'>
>>        <driver name='qemu'/>
>>        <source protocol='sheepdog' name='xxxxxx'/>
>>        <target dev='hda' bus='ide'/>
>>      </disk>
>>
>> I do not have <host> elements, would explicitly adding multiple hosts help?
>>
>> Thanks,
>> Philip
>
>
> --
> sheepdog-users mailing lists
> sheepdog-users at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog-users
>