[sheepdog-users] Unexpeted freeze of sheep on one node

Valerio Pachera sirio81 at gmail.com
Wed Nov 19 10:32:03 CET 2014


Last night I inserted back node id0 (without removing metadata).
Recovery took very long, till 8:49 of this morning.
Once done, sheep was frozen again.
After 10 minutes I had to kill it.

On node id0 there are no useful messages (sheep.log)
Nov 19 09:37:40   INFO [main] recover_object_main(863) object recovery
progress  98%
Nov 19 09:43:59   INFO [main] recover_object_main(863) object recovery
progress  99%
Nov 19 09:49:54 NOTICE [main] cluster_recovery_completion(703) all
nodes are recovered, epoch 25

On node id1 I see a huge amount of this messages
Nov 19 09:58:33  ERROR [gway 8476] sockfd_cache_get_long(348) fallback
to non-io connection
Nov 19 09:58:33  ERROR [gway 8628] connect_to(193) failed to connect
to 192.168.5.44:7000: Connection refused
Nov 19 09:58:33  ERROR [gway 8630] connect_to(193) failed to connect
to 192.168.5.44:7000: Connection refused
Nov 19 09:58:33  ERROR [gway 6514] connect_to(193) failed to connect
to 192.168.5.44:3333: Connection refused
Nov 19 09:58:33  ERROR [gway 8628] connect_to(193) failed to connect
to 192.168.5.44:7000: Connection refused

Removing this 'connection refused' messages, I see repeating the
poll-wait and 'failed to connect' till I killed the node
grep 'Nov 19' sheep.log | grep -v 'Connection refused' | grep -v
'fallback to non-io connection'
<cut>
Nov 19 09:45:04  ERROR [io 7515] sheep_exec_req(1096) failed Failed to
find requested tag
Nov 19 09:45:04  ERROR [io 7515] sheep_exec_req(1096) failed Failed to
find requested tag
Nov 19 09:45:04  ERROR [io 7515] sheep_exec_req(1096) failed Failed to
find requested tag
Nov 19 09:49:54 NOTICE [main] cluster_recovery_completion(703) all
nodes are recovered, epoch 25
Nov 19 09:50:08   WARN [gway 8629] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 19 09:50:08   WARN [gway 8628] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
Nov 19 09:50:13   WARN [gway 8629] wait_forward_request(389) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait
again
<cut>
Nov 19 09:51:19  ERROR [gway 8630] connect_to(193) failed to connect
to 192.168.5.44:3333: Operation now in progress

I don't understand what wrong with this node.



More information about the sheepdog-users mailing list