[sheepdog-users] Cluster didn't blacklist faulty node

Liu Yuan namei.unix at gmail.com
Mon Jun 30 08:11:44 CEST 2014


On Sat, Jun 28, 2014 at 02:56:18PM +0200, Fabian Zimmermann wrote:
> Hello,
> 
> just had a strange problem with one of my nodes. I was still able to
> ping the system and it looks like zookeeper still got updates, so the
> node wasn't removed out of the cluster, but other sheepdogs were unable
> to connect to the system - even SSH was unreachable.
> 
> --
> Jun 28 12:54:00  ERROR [gway 23523] do_write(281) failed to write to
> socket: Resource temporarily unavailable
> Jun 28 12:54:00  ERROR [gway 23523] send_req(319) failed to send request
> a3, 466944: Resource temporarily unavailable
> Jun 28 12:54:01  ERROR [gway 23477] connect_to(193) failed to connect to
> 192.168.20.24:7000: Operation now in progress
> --
> 
> Sheepdog was really slow - "dog vdi list" didn't complete within
> minutes. I had to hard-reset the faulty node to get performance back.
> 
> Is there anything I can do to handle such problems a bit smarter?

For now you need to nplug the network cable of the faulty node or power down
of it. You can file a bug for force kick a node out which is live and can't be
reachable by admins.

Thanks
Yuan



More information about the sheepdog-users mailing list