[sheepdog-users] add node back-in

Mon Oct 7 09:03:29 CEST 2013

On Fri, Oct 04, 2013 at 01:40:23PM +0200, Kees Bos wrote:
> Hi,
> 
> 
> Maybe this is documented somewhere, but I couldn't find it. I'm using
> version 0.7.3 (sheep -v : Sheepdog daemon version 0.7.3)
> 
> 
> When a node gets a power failure, what's the procedure to get it up and
> running in the cluster?
> 
> What I saw was, that if you just start the sheep daemon, it will try to
> replay the journal. Which is OK, unless it outdated.
> 
> In the test, I've shutdown a node, started the vm I had running on a
> different node. After a while, I've started the node that had a power
> failure.
> 
> After starting up, the sheep daemon aborted. Some logging:
> Oct 04 10:52:11   INFO [main] replay_journal_entry(156) /data/sheepdog7000/obj/007cb569000020c3, size 65536, off 3211264, 0
> Oct 04 10:52:11  ERROR [main] replay_journal_entry(163) open No such file or directory
> Oct 04 10:52:11  EMERG [main] check_recover_journal_file(259) PANIC: recoverying from journal file (new) failed
> Oct 04 10:52:11  EMERG [main] crash_handler(250) sheep exits unexpectedly (Aborted).
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:252: crash_handler
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcaf) [0x7f3d134d4caf]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7f3d12515424]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17a) [0x7f3d12518b8a]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) journal.c:259: check_recover_journal_file
> Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:801: main
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xec) [0x7f3d1250076c]
> Oct 04 10:52:11  EMERG [main] sd_backtrace(857) sheep() [0x405ff8]
> Oct 04 10:52:11  DEBUG [main] dump_stack_frames(753) cannot find gdb
> Oct 04 10:52:11  DEBUG [main] __sd_dump_variable(707) cannot find gdb
> Oct 04 10:52:12  ERROR [main] crash_handler(490) sheep pid 3210 exited unexpectedly.
> 
> Which just tells me that sheep sees that the journal-to-be-replayed is
> incorrect and that sheep just aborts (which makes makes me happy).
> 
> 
> What seems to work, is to remove the journal files (I had two)
> 
> Is this a correct action, or should I do something else? Also, is there
> more to be done (e.g. wiping the disks)?
> 
> BTW. The drives that where added with "node md plug" (before power
> failure), had disappeared from the configuration (though they were
> mounted at boot time via fstab).
> 

For now we don't have a persistent configuration file to remember which disks
we have with the running sheep daemon but error cases are handled nicely:

1. if disk that was plugged can't work (or removed for maintainence) when sheep
   restart, users don't need to do anything to inform sheep about it. If you don't
   specify it as a disk for sheep at startup, it works like this disk is
   unplugged and a minor node-level recovery will take place.

2. if you add a new disk when you restart sheep, it works like you plug in a new
   disk and a node-level recoery will take place too for data rebalance on all
   disks in this node.

Thanks
Yuan