[sheepdog-users] Cluster hung...

Fri Jul 27 09:00:53 CEST 2012

Hi Yuan,

Am 2012-07-27 04:06, schrieb Liu Yuan:
> I can't reproduce it on latest master. I have tried following steps:

Only for my information, when I try to find a testcase that is easy
to reproduce. How big was your testsetup?

I tried it right now with a small setup and dont see the problem,
but on my working cluster I can reproduce it. I had no logs at the
moment, so I can only describe it from my memory...

When I rejoin the sheeps, than I can see, that one or more of the
non-failed sheeps needs a lot of cputime, after they finished, they
log a ''update_cluster_info'', after that, the real recovery of
objects starts and the cluster reacts normal again.

My working cluster...
# collie node list
> M   Id   Host:Port         V-Nodes       Zone
> -    0   10.0.1.61:7000      	 0 1023475722
> -    1   10.0.1.61:7001      	64 1023475722
> -    2   10.0.1.62:7000      	 0 1040252938
> -    3   10.0.1.62:7001      	64 1040252938
> -    4   10.0.1.62:7002      	64 1040252938
> -    5   10.0.1.62:7003      	64 1040252938
> -    6   10.0.1.63:7000      	 0 1057030154
> -    7   10.0.1.63:7001      	64 1057030154
> -    8   10.0.1.63:7002      	64 1057030154
> -    9   10.0.1.63:7003      	64 1057030154

# collie node info
> Id	Size	Used	Use%
>  0	9.7 GB	0.0 MB	  0%
>  1	541 GB	19 GB	  3%
>  2	9.8 GB	0.0 MB	  0%
>  3	549 GB	14 GB	  2%
>  4	549 GB	11 GB	  2%
>  5	549 GB	15 GB	  2%
>  6	9.8 GB	0.0 MB	  0%
>  7	549 GB	12 GB	  2%
>  8	549 GB	14 GB	  2%
>  9	549 GB	14 GB	  2%
> Total	3.8 TB	99 GB	  2%
>
> Total virtual image size	256 GB

I killed all sheeps in zone 1057030154, wait for recovery to finish
and rejoin the complete zone. After this Node 1 needs a lot of
cputime for the complete hungtime...
When I had the 20 Minutes hung, I had around the double amount of
data in the cluster...

Thanks

Bastian