At Fri, 27 Jul 2012 09:00:53 +0200, Bastian Scholz wrote: > > Hi Yuan, > > Am 2012-07-27 04:06, schrieb Liu Yuan: > > I can't reproduce it on latest master. I have tried following steps: > > Only for my information, when I try to find a testcase that is easy > to reproduce. How big was your testsetup? > > I tried it right now with a small setup and dont see the problem, > but on my working cluster I can reproduce it. I had no logs at the > moment, so I can only describe it from my memory... > > When I rejoin the sheeps, than I can see, that one or more of the > non-failed sheeps needs a lot of cputime, after they finished, they > log a ''update_cluster_info'', after that, the real recovery of > objects starts and the cluster reacts normal again. > > > My working cluster... > # collie node list > > M Id Host:Port V-Nodes Zone > > - 0 10.0.1.61:7000 0 1023475722 > > - 1 10.0.1.61:7001 64 1023475722 > > - 2 10.0.1.62:7000 0 1040252938 > > - 3 10.0.1.62:7001 64 1040252938 > > - 4 10.0.1.62:7002 64 1040252938 > > - 5 10.0.1.62:7003 64 1040252938 > > - 6 10.0.1.63:7000 0 1057030154 > > - 7 10.0.1.63:7001 64 1057030154 > > - 8 10.0.1.63:7002 64 1057030154 > > - 9 10.0.1.63:7003 64 1057030154 > > # collie node info > > Id Size Used Use% > > 0 9.7 GB 0.0 MB 0% > > 1 541 GB 19 GB 3% > > 2 9.8 GB 0.0 MB 0% > > 3 549 GB 14 GB 2% > > 4 549 GB 11 GB 2% > > 5 549 GB 15 GB 2% > > 6 9.8 GB 0.0 MB 0% > > 7 549 GB 12 GB 2% > > 8 549 GB 14 GB 2% > > 9 549 GB 14 GB 2% > > Total 3.8 TB 99 GB 2% > > > > Total virtual image size 256 GB > > I killed all sheeps in zone 1057030154, wait for recovery to finish > and rejoin the complete zone. After this Node 1 needs a lot of > cputime for the complete hungtime... > When I had the 20 Minutes hung, I had around the double amount of > data in the cluster... I encountered the similar problem with 6 servers today. I'll dig into it this weekend. Thanks, Kazutaka |