[sheepdog-users] Unexpeted freeze of sheep on one node

Valerio Pachera sirio81 at gmail.com
Wed Nov 19 00:12:04 CET 2014


This is my production cluster with 3 nodes:

192.168.6.44 (id0)
192.168.6.45 (id1)
192.168.6.23 (id2)

At about 23:00, node id0 stopped working correctly, till it got
disconnected from the cluster.
Sheep was still alive but dog node list wasn't working and kill didn't
work either.
I had to kill -9.

It seems some node started answering slow.
First I thought of a backup, but there were no backup running on any
of the nodes.
All nodes have 2 nics. From the log I see they were falling back on
the non I/O nic.
I still don't get why?
Please, have a look at the logs here and give me a hint if you have any.

Sheepdog version is 0.8.2.

This is what I see in sheep.log of 192.168.6.44:

Nov 18 19:42:49   WARN [gway 23177] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 19:42:49   WARN [gway 23164] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 19:42:49   WARN [gway 23162] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 20:42:28   WARN [gway 4270] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 20:42:28   WARN [gway 4254] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 20:42:28   WARN [gway 4272] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 20:42:28   WARN [gway 4271] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 21:12:28   WARN [gway 12084] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:43:17   WARN [gway 32062] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again

Nov 18 22:59:05  ERROR [main] zk_create_seq_node(250) failed,
path:/sheepdog/queue/, connection loss
Nov 18 22:59:07  ERROR [gway 1686] wait_forward_request(437) fail
2e52400000000, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1775] wait_forward_request(437) fail
2e52400000000, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1727] wait_forward_request(437) fail
2e52400000a00, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1780] wait_forward_request(437) fail
2e52400007921, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1787] wait_forward_request(437) fail
2e52400015000, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1772] wait_forward_request(437) fail
2e52400008e00, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1786] wait_forward_request(437) fail
2e5240001514c, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1774] wait_forward_request(437) fail
2e52400016600, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1776] wait_forward_request(437) fail
2e52400003200, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1752] wait_forward_request(437) fail
2e52400002c01, Request has an old epoch
Nov 18 22:59:07  ERROR [gway 1757] wait_forward_request(437) fail
2e52400003400, Request has an old epoch
Nov 18 22:59:07  ERROR [main] zk_get_data(263) failed,
path:/sheepdog/queue/0000005171, session expired
Nov 18 22:59:07  ERROR [main] zk_find_seq_node(502) failed, session expired
Nov 18 22:59:07  ERROR [main] zk_queue_push(533) failed,
path:/sheepdog/queue/, connection loss
Nov 18 22:59:07  ERROR [main] add_event(681) failed, type: 3, connection loss
Nov 18 22:59:07  ERROR [main] zk_event_handler(1229) detect a session
timeout. reconnecting...
Nov 18 22:59:08   INFO [main] send_join_request(787) IPv4
ip:192.168.5.44 port:7000
Nov 18 22:59:08   INFO [main] zk_find_master(836) detect master leave,
start to compete master
Nov 18 22:59:10   INFO [main] zk_event_handler(1231) reconnected
Nov 18 22:59:12   WARN [gway 1775] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1774] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1772] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1686] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1787] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1727] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1780] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1752] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1757] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1776] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:12   WARN [gway 1786] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 22:59:13  ERROR [gway 1775] wait_forward_request(437) fail
2e52400000000, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1727] wait_forward_request(437) fail
2e52400000a00, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1752] wait_forward_request(437) fail
2e52400002c01, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1780] wait_forward_request(437) fail
2e52400007921, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1786] wait_forward_request(437) fail
2e5240001514c, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1774] wait_forward_request(437) fail
2e52400016600, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1772] wait_forward_request(437) fail
2e52400008e00, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1686] wait_forward_request(437) fail
2e52400000000, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1757] wait_forward_request(437) fail
2e52400003400, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1776] wait_forward_request(437) fail
2e52400003200, Request has an old epoch
Nov 18 22:59:13  ERROR [gway 1787] wait_forward_request(437) fail
2e52400015000, Request has an old epoch


On 19.168.6.45
<cut>
Nov 18 23:00:51   WARN [gway 19852] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 23:00:51   WARN [gway 19850] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 23:00:53  ERROR [main] check_request_epoch(151) old node
version 23, 22 (WRITE_PEER)
Nov 18 23:00:58   WARN [gway 19860] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 23:00:58   WARN [gway 19862] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
<cut>
Nov 18 23:01:22  ERROR [block] do_read(236) failed to read from
socket: -1, Resource temporarily unavailable
Nov 18 23:01:22  ERROR [block] exec_req(347) failed to read a response
Nov 18 23:01:22  ALERT [block] do_get_vdis(499) failed to get vdi
bitmap from IPv4 ip:192.168.5.44 port:7000
<cut>
Nov 18 23:01:45  ERROR [gway 19842] connect_to(193) failed to connect
to 192.168.5.44:3333: Operation now in progress
Nov 18 23:01:45  ERROR [gway 19842] sockfd_cache_get_long(348)
fallback to non-io connection
Nov 18 23:01:45  ERROR [gway 19863] connect_to(193) failed to connect
to 192.168.5.44:3333: Operation now in progress
Nov 18 23:01:45   WARN [gway 19879] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 23:01:45  ERROR [gway 19863] sockfd_cache_get_long(348)
fallback to non-io connection
<cut>
Nov 18 23:03:23  ERROR [io 19829] do_read(236) failed to read from
socket: -1, Resource temporarily unavailable
Nov 18 23:03:23  ERROR [io 19829] exec_req(347) failed to read a response
Nov 18 23:03:23  ERROR [io 19829] sheep_exec_req(1096) failed Failed
to find requested tag
Nov 18 23:03:23  ERROR [rw] do_read(236) failed to read from socket:
-1, Resource temporarily unavailable
Nov 18 23:03:23  ERROR [rw] exec_req(347) failed to read a response
Nov 18 23:03:23  ALERT [rw] fetch_object_list(931) cannot get object
list from 192.168.5.44:7000
Nov 18 23:03:23  ALERT [rw] fetch_object_list(933) some objects may be
not recovered at epoch 23
Nov 18 23:03:28   INFO [main] recover_object_main(863) object recovery
progress   1%
Nov 18 23:03:30   INFO [main] recover_object_main(863) object recovery
progress   2%
Nov 18 23:03:32   INFO [main] recover_object_main(863) object recovery
progress   3%
Nov 18 23:03:33   INFO [main] recover_object_main(863) object recovery
progress   4%

on 192.168.6.23
Kind of the same

Nov 18 23:00:03  ERROR [main] check_request_epoch(151) old node
version 22, 21 (WRITE_PEER)
Nov 18 23:00:03  ERROR [main] check_request_epoch(151) old node
version 22, 21 (WRITE_PEER)
Nov 18 23:00:05  ERROR [gway 25655] wait_forward_request(437) fail
e18f0400000693, Request has an old epoch
Nov 18 23:00:05  ERROR [main] check_request_epoch(151) old node
version 23, 22 (WRITE_PEER)
Nov 18 23:00:05  ERROR [main] check_request_epoch(151) old node
version 23, 22 (WRITE_PEER)
Nov 18 23:00:05  ERROR [gway 25654] wait_forward_request(437) fail
e18f0400000693, Request has an old epoch
Nov 18 23:00:05  ERROR [gway 25655] wait_forward_request(437) fail
e18f0400000693, Request has an old epoch
Nov 18 23:00:10   WARN [gway 25656] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 18 23:00:10   WARN [gway 25657] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
<cut>
Nov 18 23:03:12  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:13  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:13  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:13  ERROR [rw] do_read(236) failed to read from socket:
-1, Resource temporarily unavailable
Nov 18 23:03:13  ERROR [rw] exec_req(347) failed to read a response
Nov 18 23:03:13  ALERT [rw] fetch_object_list(931) cannot get object
list from 192.168.5.44:7000
Nov 18 23:03:13  ALERT [rw] fetch_object_list(933) some objects may be
not recovered at epoch 23
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [main] check_request_epoch(151) old node
version 24, 23 (WRITE_PEER)
Nov 18 23:03:15  ERROR [gway 25588] do_write(281) failed to write to
socket: Resource temporarily unavailable
Nov 18 23:03:15  ERROR [gway 25588] send_req(319) failed to send
request a5, 53248: Resource temporarily unavailable
Nov 18 23:03:33   INFO [main] recover_object_main(863) object recovery
progress   1%
Nov 18 23:03:45   INFO [main] recover_object_main(863) object recovery
progress   2%
Nov 18 23:03:54   INFO [main] recover_object_main(863) object recovery
progress   3%
<cut>



More information about the sheepdog-users mailing list