<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.E-MailFormatvorlage17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DE" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-US">I have a relatively small cluster.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Debian Jessie. Sheepdog 0.8.4 with corosync (as part of Distribution)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">4 Nodes with Storege<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">2 Nodes in Gateway Mode<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">About 12 Active QEMU VMs (Windows & Linux)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Now after a while the following errors start to show up in the logs:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">On Nodes with Storage and VMs:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Aug 10 12:13:00 ERROR [io 32282] err_to_sderr(110) Too many open files, oid=a3923c000010ba<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">On Gateway Nodes:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">ERROR [gway 3986] wait_forward_request(437) fail 56c6d200000342, Network error between sheep<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">It seams to be related to the amount of load on all machines. These errors by itself don’t seam to have a noticeable impact on the work and stability of the cluster.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">But after a while the amount of errors is starting to increase until 100s per Second, one node is blocking and the hole cluster is on hold. After stopping the one node which blocks everything, the cluster continues to
work. But all VMs on that node needed to be destroyed (Turned off by Force) . <o:p>
</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">The last time this happened, we also had a data loss (Log from other node):<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">ALERT [rw] fetch_object_list(933) some objects may be not recovered at epoch 86<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">I played around with ulimit -SHn 1048576, but this didn’t have any effect.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">lsof | wc -l returns values between 20000 and 150000.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Question: Is this a known error solved in newer version ? Do I have any options to solve that problem by parameters.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">I would like to continue the usage of sheepdog, but the freeze and datalose are showstoppers.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Oliver Günther<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
</div>
*---*<br>
Oliver Günther<br>
CS Computer & Service GmbH<br>
Banksstrasse 4<br>
20097 Hamburg<br>
fon: +49 40 8818070<br>
fax: +49 40 88180717<br>
web: www.cs-gmbh.net<br>
mail: info@cs-gmbh.net<br>
<br>
Geschäftsführer: Oliver Günther
</body>
</html>