[sheepdog-users] add node back-in

Mon Oct 7 07:48:48 CEST 2013

On Sun, 2013-10-06 at 18:05 +0200, Hitoshi Mitake wrote:
> Hi Kees,
> 
> At Fri, 4 Oct 2013 13:40:23 +0200,
> Kees Bos wrote:
> > 
> > Hi,
> > 
> > 
> > Maybe this is documented somewhere, but I couldn't find it. I'm using
> > version 0.7.3 (sheep -v : Sheepdog daemon version 0.7.3)
> > 
> > 
> > When a node gets a power failure, what's the procedure to get it up and
> > running in the cluster?
> > 
> > What I saw was, that if you just start the sheep daemon, it will try to
> > replay the journal. Which is OK, unless it outdated.
> > 
> > In the test, I've shutdown a node, started the vm I had running on a
> > different node. After a while, I've started the node that had a power
> > failure.
> > 
> > After starting up, the sheep daemon aborted. Some logging:
> > Oct 04 10:52:11   INFO [main] replay_journal_entry(156) /data/sheepdog7000/obj/007cb569000020c3, size 65536, off 3211264, 0
> > Oct 04 10:52:11  ERROR [main] replay_journal_entry(163) open No such file or directory
> > Oct 04 10:52:11  EMERG [main] check_recover_journal_file(259) PANIC: recoverying from journal file (new) failed
> > Oct 04 10:52:11  EMERG [main] crash_handler(250) sheep exits unexpectedly (Aborted).
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:252: crash_handler
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcaf) [0x7f3d134d4caf]
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7f3d12515424]
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17a) [0x7f3d12518b8a]
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(843) journal.c:259: check_recover_journal_file
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(843) sheep.c:801: main
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(857) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xec) [0x7f3d1250076c]
> > Oct 04 10:52:11  EMERG [main] sd_backtrace(857) sheep() [0x405ff8]
> > Oct 04 10:52:11  DEBUG [main] dump_stack_frames(753) cannot find gdb
> > Oct 04 10:52:11  DEBUG [main] __sd_dump_variable(707) cannot find gdb
> > Oct 04 10:52:12  ERROR [main] crash_handler(490) sheep pid 3210 exited unexpectedly.
> > 
> > Which just tells me that sheep sees that the journal-to-be-replayed is
> > incorrect and that sheep just aborts (which makes makes me happy).
> > 
> > 
> > What seems to work, is to remove the journal files (I had two)
> > 
> > Is this a correct action, or should I do something else? Also, is there
> > more to be done (e.g. wiping the disks)?
> > 
> > BTW. The drives that where added with "node md plug" (before power
> > failure), had disappeared from the configuration (though they were
> > mounted at boot time via fstab).
> 
> Removing the journal is not a correct action. It seems that your
> problem was caused by a bug in sheep. Do you still have the journal
> file? If you still have, I'd like to post a patch which let sheep
> produce more friendly error messages. If you can test with the patch
> again, it would be helpful for analyzing the problem.
> 

I've removed the journal files (rm, no mv). I'll try to replicate this
problem this week. I'll then report back on the dev-list (OK?). Patch
with more friendly error messages is welcome anyway, so I can apply the
patch beforehand.

Just for the record. What is the correct procedure to start sheep again
after a power failure, in a setup with plugged devices? I noticed that
the devices have to be added manually, so I wonder whether I have to
wipe the devices before adding them back in.

- Kees