[sheepdog-users] Cluster hung...

Fri Jul 27 04:06:07 CEST 2012

On 07/26/2012 09:33 PM, Bastian Scholz wrote:
> Hi List,
> 
> I have a small cluster, 3 nodes, with 1 gateway each and on one
> node only one working sheep, and three working sheeps on the
> other two nodes...
> 
> When a node fails, the recovery process starts as expected, but
> when the failed node joins again, the cluster hangs for a long
> time without responding to a lot of collie commands...
> collie node info and collie node recovery dont give an answer
> for at least 20 minutes.
> 
> The connected kvm guest cant access the VDIs in this time and
> the windows guests dont survive this time...
> 
> I am using sheepdog from sheepdog_0.4.0-0+tek2b-7_amd64.deb...
> 
> Could someone explain me briefly what happens here and if I
> can avoid these hung?
> 

I can't reproduce it on latest master. I have tried following steps:

 1) start 5 sheeps, node 0 as gateway only, node [1-4] as storage node.

 VM <---> g(0) <---> s(1,2,3,4)

 2) install a new OS
 3) during installation, I have tried following node failure simulation:
	a) kill -9 pid (one of node[1-4]), then join it back
	b) collie node kill node_id (one of [1-4]), then join it back

In both killing cases, VM is being installed without any problem.

Thanks,
Yuan