[sheepdog-users] Cluster didn't blacklist faulty node

Fabian Zimmermann dev.faz at gmail.com
Sat Jun 28 14:56:18 CEST 2014


Hello,

just had a strange problem with one of my nodes. I was still able to
ping the system and it looks like zookeeper still got updates, so the
node wasn't removed out of the cluster, but other sheepdogs were unable
to connect to the system - even SSH was unreachable.

--
Jun 28 12:54:00  ERROR [gway 23523] do_write(281) failed to write to
socket: Resource temporarily unavailable
Jun 28 12:54:00  ERROR [gway 23523] send_req(319) failed to send request
a3, 466944: Resource temporarily unavailable
Jun 28 12:54:01  ERROR [gway 23477] connect_to(193) failed to connect to
192.168.20.24:7000: Operation now in progress
--

Sheepdog was really slow - "dog vdi list" didn't complete within
minutes. I had to hard-reset the faulty node to get performance back.

Is there anything I can do to handle such problems a bit smarter?


Thanks a lot,

Fabian



More information about the sheepdog-users mailing list