Hi Yuan, Am 2012-07-27 04:06, schrieb Liu Yuan: > I can't reproduce it on latest master. I have tried following steps: Only for my information, when I try to find a testcase that is easy to reproduce. How big was your testsetup? I tried it right now with a small setup and dont see the problem, but on my working cluster I can reproduce it. I had no logs at the moment, so I can only describe it from my memory... When I rejoin the sheeps, than I can see, that one or more of the non-failed sheeps needs a lot of cputime, after they finished, they log a ''update_cluster_info'', after that, the real recovery of objects starts and the cluster reacts normal again. My working cluster... # collie node list > M Id Host:Port V-Nodes Zone > - 0 10.0.1.61:7000 0 1023475722 > - 1 10.0.1.61:7001 64 1023475722 > - 2 10.0.1.62:7000 0 1040252938 > - 3 10.0.1.62:7001 64 1040252938 > - 4 10.0.1.62:7002 64 1040252938 > - 5 10.0.1.62:7003 64 1040252938 > - 6 10.0.1.63:7000 0 1057030154 > - 7 10.0.1.63:7001 64 1057030154 > - 8 10.0.1.63:7002 64 1057030154 > - 9 10.0.1.63:7003 64 1057030154 # collie node info > Id Size Used Use% > 0 9.7 GB 0.0 MB 0% > 1 541 GB 19 GB 3% > 2 9.8 GB 0.0 MB 0% > 3 549 GB 14 GB 2% > 4 549 GB 11 GB 2% > 5 549 GB 15 GB 2% > 6 9.8 GB 0.0 MB 0% > 7 549 GB 12 GB 2% > 8 549 GB 14 GB 2% > 9 549 GB 14 GB 2% > Total 3.8 TB 99 GB 2% > > Total virtual image size 256 GB I killed all sheeps in zone 1057030154, wait for recovery to finish and rejoin the complete zone. After this Node 1 needs a lot of cputime for the complete hungtime... When I had the 20 Minutes hung, I had around the double amount of data in the cluster... Thanks Bastian |