On 07/26/2012 09:33 PM, Bastian Scholz wrote: > Hi List, > > I have a small cluster, 3 nodes, with 1 gateway each and on one > node only one working sheep, and three working sheeps on the > other two nodes... > > When a node fails, the recovery process starts as expected, but > when the failed node joins again, the cluster hangs for a long > time without responding to a lot of collie commands... > collie node info and collie node recovery dont give an answer > for at least 20 minutes. > > The connected kvm guest cant access the VDIs in this time and > the windows guests dont survive this time... > > I am using sheepdog from sheepdog_0.4.0-0+tek2b-7_amd64.deb... > > Could someone explain me briefly what happens here and if I > can avoid these hung? > I can't reproduce it on latest master. I have tried following steps: 1) start 5 sheeps, node 0 as gateway only, node [1-4] as storage node. VM <---> g(0) <---> s(1,2,3,4) 2) install a new OS 3) during installation, I have tried following node failure simulation: a) kill -9 pid (one of node[1-4]), then join it back b) collie node kill node_id (one of [1-4]), then join it back In both killing cases, VM is being installed without any problem. Thanks, Yuan |