[sheepdog-users] qemu/KVM+sheepdog HA problem when one sheep leaves the cluster

Mon Aug 17 03:57:00 CEST 2015

Thank you Hitoshi. After testing the iSCSI solution we will try the
reconnection feature. For the testing we'll have to trigger a "software
failure" then reconnect in a reproducible way. Do you have any suggestion
how to achieve that?

2015-08-14 22:19 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:

> On Fri, Aug 14, 2015 at 5:10 PM, Zhaohui Yang <yezonghui at gmail.com> wrote:
> > We have an openstack cluster whose compute nodes have sheepdog installed
> and
> > configured the "standard" way as sheepdog documentation described. That
> is,
> > the qemu/KVM virtual machines uses sheepdog VDI by attaching to a the
> local
> > sheepdog gateway through the qemu block driver.
> >
> > When one sheep decided it has lost connection with the rest of cluster
> nodes
> > (caused by software problem), all VMs on the same node instantly "lost"
> > their VDI and cannot work. In this regard the storage layer is not fully
> HA.
> > We would expect a better situation where the VMs still work fine as long
> as
> > sheeps on other nodes are functioning and network is up.
>
> Currently, the qemu driver supports the reconnection feature. If you
> can simply restart sheep daemon, QEMU VMs can restart their I/O.
> # BTW, the reconnection feature is unfriendly with VDI locking. VDI
> locking is disabled in v0.9.2. Could you update sheepdog to v0.9.2?
>
> >
> > We found a "kind of " solution to this in the following article, where
> the
> > sheepdog storage cluster is deployed separately from the compute cluster,
> > connected via a switch, through a iSCSI interface exposed by tgtd sitting
> > before sheepdog. The client has to use special iSCSI multipath-tool to be
> > able to fail over to another tgtd upon sheepdog node failure.
> >
> http://events.linuxfoundation.org/sites/events/files/slides/COJ2015_Sheepdog_20150604.pdf
> >
> > However this solution add 2 more layers to the complexity - the iSCSI
> > multipath-tool on client side and tgtd on the server side. Also the
> > performance will degrade as everything has to go through the network and
> > iSCSI simulation. We surely don't want to go this direction if there are
> > simpler solutions we are not aware of - e.g. patches to sheepdog or the
> qemu
> > block driver that perform automatic failovers.
>
> Based on our performance evaluation, iSCSI components (e.g. tgtd)
> become bottlenecks in some extreme cases. e.g. 10Gbps and read heavy
> traffic. Did you evaluate performance?
> # of course management cost will increase
>
> Because of the design principle of sheepdog, QEMU driver doesn't
> support HA. The VMs and the sheep process coexist in single host. They
> are assumed to be living or dying togather because power and network
> failure can effect both of them. However, the principle doesn't
> consider about software faults. For this problem, the above
> reconnection feature is supported. Could you test the reconnection
> feature?
>
> Thanks,
> Hitoshi
>
> >
> > Please point a link to us if such solutions exists, or share your idea
> how
> > to avoid "VM losing VDI when sheep on same machine leaves cluster". The
> > version of sheepdog we are using is 0.9.0.
> >
> >
> > Thanks and Regards,
> >
> > Yang, Zhaohui
> >
> > --
> > sheepdog-users mailing lists
> > sheepdog-users at lists.wpkg.org
> > https://lists.wpkg.org/mailman/listinfo/sheepdog-users
> >
>

-- 
Regards,

Yang, Zhaohui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20150817/ce1e5eae/attachment.html>