[sheepdog-users] add node back-in

Fri Oct 4 13:40:23 CEST 2013

Hi,

Maybe this is documented somewhere, but I couldn't find it. I'm using
version 0.7.3 (sheep -v : Sheepdog daemon version 0.7.3)

When a node gets a power failure, what's the procedure to get it up and
running in the cluster?

What I saw was, that if you just start the sheep daemon, it will try to
replay the journal. Which is OK, unless it outdated.

In the test, I've shutdown a node, started the vm I had running on a
different node. After a while, I've started the node that had a power
failure.

After starting up, the sheep daemon aborted. Some logging:
Oct 04 10:52:11   INFO [main] replay_journal_entry(156) /data/sheepdog7000/obj/007cb569000020c3, size 65536, off 3211264, 0
Oct 04 10:52:11  ERROR [main] replay_journal_entry(163) open No such file or directory
Oct 04 10:52:11  EMERG [main] check_recover_journal_file(259) PANIC: recoverying from journal file (new) failed
Oct 04 10:52:11  EMERG [main] crash_handler(250) sheep exits unexpectedly (Aborted).
Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:252: crash_handler
Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcaf) [0x7f3d134d4caf]
Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7f3d12515424]
Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17a) [0x7f3d12518b8a]
Oct 04 10:52:11  EMERG [main] sd_backtrace(843) journal.c:259: check_recover_journal_file
Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:801: main
Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xec) [0x7f3d1250076c]
Oct 04 10:52:11  EMERG [main] sd_backtrace(857) sheep() [0x405ff8]
Oct 04 10:52:11  DEBUG [main] dump_stack_frames(753) cannot find gdb
Oct 04 10:52:11  DEBUG [main] __sd_dump_variable(707) cannot find gdb
Oct 04 10:52:12  ERROR [main] crash_handler(490) sheep pid 3210 exited unexpectedly.

Which just tells me that sheep sees that the journal-to-be-replayed is
incorrect and that sheep just aborts (which makes makes me happy).

What seems to work, is to remove the journal files (I had two)

Is this a correct action, or should I do something else? Also, is there
more to be done (e.g. wiping the disks)?

BTW. The drives that where added with "node md plug" (before power
failure), had disappeared from the configuration (though they were
mounted at boot time via fstab).

- Kees