[sheepdog-users] Cluster hung...

Fri Jul 27 12:49:23 CEST 2012

At Fri, 27 Jul 2012 09:00:53 +0200,
Bastian Scholz wrote:
> 
> Hi Yuan,
> 
> Am 2012-07-27 04:06, schrieb Liu Yuan:
> > I can't reproduce it on latest master. I have tried following steps:
> 
> Only for my information, when I try to find a testcase that is easy
> to reproduce. How big was your testsetup?
> 
> I tried it right now with a small setup and dont see the problem,
> but on my working cluster I can reproduce it. I had no logs at the
> moment, so I can only describe it from my memory...
> 
> When I rejoin the sheeps, than I can see, that one or more of the
> non-failed sheeps needs a lot of cputime, after they finished, they
> log a ''update_cluster_info'', after that, the real recovery of
> objects starts and the cluster reacts normal again.
> 
> 
> My working cluster...
> # collie node list
> > M   Id   Host:Port         V-Nodes       Zone
> > -    0   10.0.1.61:7000      	 0 1023475722
> > -    1   10.0.1.61:7001      	64 1023475722
> > -    2   10.0.1.62:7000      	 0 1040252938
> > -    3   10.0.1.62:7001      	64 1040252938
> > -    4   10.0.1.62:7002      	64 1040252938
> > -    5   10.0.1.62:7003      	64 1040252938
> > -    6   10.0.1.63:7000      	 0 1057030154
> > -    7   10.0.1.63:7001      	64 1057030154
> > -    8   10.0.1.63:7002      	64 1057030154
> > -    9   10.0.1.63:7003      	64 1057030154
> 
> # collie node info
> > Id	Size	Used	Use%
> >  0	9.7 GB	0.0 MB	  0%
> >  1	541 GB	19 GB	  3%
> >  2	9.8 GB	0.0 MB	  0%
> >  3	549 GB	14 GB	  2%
> >  4	549 GB	11 GB	  2%
> >  5	549 GB	15 GB	  2%
> >  6	9.8 GB	0.0 MB	  0%
> >  7	549 GB	12 GB	  2%
> >  8	549 GB	14 GB	  2%
> >  9	549 GB	14 GB	  2%
> > Total	3.8 TB	99 GB	  2%
> >
> > Total virtual image size	256 GB
> 
> I killed all sheeps in zone 1057030154, wait for recovery to finish
> and rejoin the complete zone. After this Node 1 needs a lot of
> cputime for the complete hungtime...
> When I had the 20 Minutes hung, I had around the double amount of
> data in the cluster...

I encountered the similar problem with 6 servers today.  I'll dig into
it this weekend.

Thanks,

Kazutaka