[sheepdog-users] Cluster hung...
Bastian Scholz
nimrodxx at gmx.de
Fri Jul 27 09:00:53 CEST 2012
Hi Yuan,
Am 2012-07-27 04:06, schrieb Liu Yuan:
> I can't reproduce it on latest master. I have tried following steps:
Only for my information, when I try to find a testcase that is easy
to reproduce. How big was your testsetup?
I tried it right now with a small setup and dont see the problem,
but on my working cluster I can reproduce it. I had no logs at the
moment, so I can only describe it from my memory...
When I rejoin the sheeps, than I can see, that one or more of the
non-failed sheeps needs a lot of cputime, after they finished, they
log a ''update_cluster_info'', after that, the real recovery of
objects starts and the cluster reacts normal again.
My working cluster...
# collie node list
> M Id Host:Port V-Nodes Zone
> - 0 10.0.1.61:7000 0 1023475722
> - 1 10.0.1.61:7001 64 1023475722
> - 2 10.0.1.62:7000 0 1040252938
> - 3 10.0.1.62:7001 64 1040252938
> - 4 10.0.1.62:7002 64 1040252938
> - 5 10.0.1.62:7003 64 1040252938
> - 6 10.0.1.63:7000 0 1057030154
> - 7 10.0.1.63:7001 64 1057030154
> - 8 10.0.1.63:7002 64 1057030154
> - 9 10.0.1.63:7003 64 1057030154
# collie node info
> Id Size Used Use%
> 0 9.7 GB 0.0 MB 0%
> 1 541 GB 19 GB 3%
> 2 9.8 GB 0.0 MB 0%
> 3 549 GB 14 GB 2%
> 4 549 GB 11 GB 2%
> 5 549 GB 15 GB 2%
> 6 9.8 GB 0.0 MB 0%
> 7 549 GB 12 GB 2%
> 8 549 GB 14 GB 2%
> 9 549 GB 14 GB 2%
> Total 3.8 TB 99 GB 2%
>
> Total virtual image size 256 GB
I killed all sheeps in zone 1057030154, wait for recovery to finish
and rejoin the complete zone. After this Node 1 needs a lot of
cputime for the complete hungtime...
When I had the 20 Minutes hung, I had around the double amount of
data in the cluster...
Thanks
Bastian
More information about the sheepdog-users
mailing list