[Sheepdog] unstable behavior when nodes join or leave

Thu Sep 1 11:37:32 CEST 2011

At Tue, 30 Aug 2011 14:35:50 +0900,
Keiichi SHIMA wrote:
> 
> Hello,
> 
> I'm now testing sheepdog with 46 PCs.  Making a sheepdog cluster and formatting works perfectly.  I could create several sheepdog disk images in the cluster.
> 
> But when I try to add a storage node, the cluster usually stop working.  Adding a node sometimes works, but when I keep adding nodes to the cluster, it eventually breaks the cluster.
> 
> Also, I couldn't remove a storage node from the cluster.  What is the right way to remove a node from a cluster?  I'm trying to kill the sheep processes on the storage node, which will immediately breaks the cluster operation...
> 
> 
> I'm using the following software set.
> - sheepdog cloned from the git repo (2cda774833e4bbf4754485b620eb7b28ffaa8b07)
> - corosync provided by apt (1.2.1-4ubuntu1)
> 
> 
> The below is one of the operations which will cause the above problem (at least in my environment).
> 
> 1. start sheep processes on 21 PCs out of 46.
> 2. format a cluster with --copies=3 option (with the above 21 storage nodes).
> 3. create 50 disk images in the cluster with 'qemu-img create sheepdog:xxx 1G'
> 4. add 25 storage nodes one by one to the cluster.
> 5. at some point, the sheep cluster stop working.
>   'collie vdi list' is start showing errors ('failed to read a inode header ...')
>   'collie node info' is start showing errors (the same message as above)
> 
> The following is a log file taken from one of the sheep storage nodes.  (I didn't attach the file since it is a bit big)
>   http://member.wide.ad.jp/~shima/tmp/sheep-201108301431.log
> 
> Am I doing any wrong operations?
> Is there any suggestions?

Thanks for your information!

I've sent some patches to fix node joins on the large cluster
environment.  Can you try them?

Those patches are also in the devel branch:
  git://github.com/collie/sheepdog.git devel

Thanks,

Kazutaka