[sheepdog-users] add node back-in

Hitoshi Mitake mitake.hitoshi at gmail.com
Sun Oct 6 18:05:21 CEST 2013


Hi Kees,

At Fri, 4 Oct 2013 13:40:23 +0200,
Kees Bos wrote:
> 
> Hi,
> 
> 
> Maybe this is documented somewhere, but I couldn't find it. I'm using
> version 0.7.3 (sheep -v : Sheepdog daemon version 0.7.3)
> 
> 
> When a node gets a power failure, what's the procedure to get it up and
> running in the cluster?
> 
> What I saw was, that if you just start the sheep daemon, it will try to
> replay the journal. Which is OK, unless it outdated.
> 
> In the test, I've shutdown a node, started the vm I had running on a
> different node. After a while, I've started the node that had a power
> failure.
> 
> After starting up, the sheep daemon aborted. Some logging:
> Oct 04 10:52:11   INFO [main] replay_journal_entry(156) /data/sheepdog7000/obj/007cb569000020c3, size 65536, off 3211264, 0
> Oct 04 10:52:11  ERROR [main] replay_journal_entry(163) open No such file or directory
> Oct 04 10:52:11  EMERG [main] check_recover_journal_file(259) PANIC: recoverying from journal file (new) failed
> Oct 04 10:52:11  EMERG [main] crash_handler(250) sheep exits unexpectedly (Aborted).
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:252: crash_handler
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcaf) [0x7f3d134d4caf]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7f3d12515424]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17a) [0x7f3d12518b8a]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) journal.c:259: check_recover_journal_file
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:801: main
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xec) [0x7f3d1250076c]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) sheep() [0x405ff8]
> Oct 04 10:52:11  DEBUG [main] dump_stack_frames(753) cannot find gdb
> Oct 04 10:52:11  DEBUG [main] __sd_dump_variable(707) cannot find gdb
> Oct 04 10:52:12  ERROR [main] crash_handler(490) sheep pid 3210 exited unexpectedly.
> 
> Which just tells me that sheep sees that the journal-to-be-replayed is
> incorrect and that sheep just aborts (which makes makes me happy).
> 
> 
> What seems to work, is to remove the journal files (I had two)
> 
> Is this a correct action, or should I do something else? Also, is there
> more to be done (e.g. wiping the disks)?
> 
> BTW. The drives that where added with "node md plug" (before power
> failure), had disappeared from the configuration (though they were
> mounted at boot time via fstab).

Removing the journal is not a correct action. It seems that your
problem was caused by a bug in sheep. Do you still have the journal
file? If you still have, I'd like to post a patch which let sheep
produce more friendly error messages. If you can test with the patch
again, it would be helpful for analyzing the problem.

Thanks,
Hitoshi



More information about the sheepdog-users mailing list