[sheepdog-users] Sheep daemons don't start

Mon Apr 15 13:21:49 CEST 2013

On 04/15/2013 07:07 PM, Tomoe Kitahara wrote:
> I have a problem about starting sheep daemons.
> 
> The version is 0.5.5_249_g588d61b.
> Sheep daemons often don't restart after sheeps were shut down by
> following command,
> "service sheepdog stop".

If you want to shutdown the cluster, use 'collie cluster shutdown'. Stop
sheep one by one will easily have each sheep with different epoch. This
will stop some sheep to join the cluster because sheep check internally
for epoch at join phase.

> 
> My cluster consists of 6 servers and 18 daemons(each server have 3 HDDs,
> and sheep daemon is run in each HDD).
> 
> At some nodes, sheep daemon is started. But other node cannot join this
> cluster.
> 
> The log of the node which cannot start sheep daemon is as below.
> 
> Apr 15 15:41:16 [main] main(684) sheepdog daemon (version
> 0.5.5_249_g588d61b) started
> Apr 15 15:41:16 [main] sd_join_handler(1089) Failed to join, exiting.
> Apr 15 15:41:16 [main] crash_handler(482) sheep pid 10609 exited
> unexpectedly.
> 
> Is there any way to fix this problem?

Can you confirm that the failed node has epoch number higher than
running cluster epoch? (epoch files are in /path/to/store/epoch/)?

What is the status of your cluster, running or waiting for join?

We can manually fix the problem, but this solution for sheep depend on
the why it fails to join.

Thanks
Yuan