[sheepdog-users] err_to_sderr(xxx) - Too many open files & ERROR [gway xxx] wait_forward_request(437) fail 56c6d200000342, Network error between sheep
Oliver Günther | CS Computer & Service GmbH
Oliver.Guenther at cs-gmbh.net
Wed Aug 10 12:47:30 CEST 2016
I have a relatively small cluster.
Debian Jessie. Sheepdog 0.8.4 with corosync (as part of Distribution)
4 Nodes with Storege
2 Nodes in Gateway Mode
About 12 Active QEMU VMs (Windows & Linux)
Now after a while the following errors start to show up in the logs:
On Nodes with Storage and VMs:
Aug 10 12:13:00 ERROR [io 32282] err_to_sderr(110) Too many open files, oid=a3923c000010ba
On Gateway Nodes:
ERROR [gway 3986] wait_forward_request(437) fail 56c6d200000342, Network error between sheep
It seams to be related to the amount of load on all machines. These errors by itself don't seam to have a noticeable impact on the work and stability of the cluster.
But after a while the amount of errors is starting to increase until 100s per Second, one node is blocking and the hole cluster is on hold. After stopping the one node which blocks everything, the cluster continues to work. But all VMs on that node needed to be destroyed (Turned off by Force) .
The last time this happened, we also had a data loss (Log from other node):
ALERT [rw] fetch_object_list(933) some objects may be not recovered at epoch 86
I played around with ulimit -SHn 1048576, but this didn't have any effect.
lsof | wc -l returns values between 20000 and 150000.
Question: Is this a known error solved in newer version ? Do I have any options to solve that problem by parameters.
I would like to continue the usage of sheepdog, but the freeze and datalose are showstoppers.
CS Computer & Service GmbH
fon: +49 40 8818070
fax: +49 40 88180717
mail: info at cs-gmbh.net
Geschäftsführer: Oliver Günther
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the sheepdog-users