[sheepdog] zookeeper quitting unexpectedly causes recoveringfrom journal file failed

Hongyi Wang hongyi at zelin.io
Thu May 23 08:25:58 CEST 2013


I removed the files in the journal folder and started the sheep again. It started successfully.


However, I found all the vdi are lost (All the node recovery has completed.). What happened? and how can I get my lost vdi?


thanks,


Hongyi
================================================
> collie vdi list
collie vdi list
  Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
Failed to read object 801d5fbd00000000 No object found
Failed to read inode header
Failed to read object 80791a2400000000 No object found
Failed to read inode header
Failed to read object 809133bf00000000 No object found
Failed to read inode header
Failed to read object 809133c000000000 No object found
Failed to read inode header
Failed to read object 80d322dd00000000 No object found
Failed to read inode header

================================================
 I started sheep on 3 nodes, here is the sheep.log shows:
z1:
....
May 23 21:53:49 [rw] sheep_exec_req(547) failed No object found
May 23 21:53:49 [rw] default_link(374) failed to link from /sheep/disk1/.stale/00d322dd000039f0.14 to /sheep/disk1/00d322dd000039f0, No such file or directory
May 23 21:53:49 [rw] sheep_exec_req(547) failed No object found
May 23 21:53:49 [rw] default_link(374) failed to link from /sheep/disk1/.stale/00d322dd000039f0.13 to /sheep/disk1/00d322dd000039f0, No such file or directory
May 23 21:53:49 [rw] do_epoch_log_read(93) failed to open epoch 12 log, No such file or directory
May 23 21:53:49 [main] recover_object_main(612) done:9311 count:9311, oid:d322dd000039f0
May 23 21:53:49 [main] modify_event(151) event info for fd 29 not found



z2:
....
May 23 22:14:38 [rw] default_link(374) failed to link from /sheep/disk2/.stale/001d5fbd00001480.14 to /sheep/disk2/001d5fbd00001480, No such file or directoy
May 23 22:14:38 [rw] sheep_exec_req(547) failed No object found
May 23 22:14:38 [rw] do_epoch_log_read(93) failed to open epoch 13 log, No such file or directory
May 23 22:14:38 [rw] sheep_exec_req(547) failed No object found
May 23 22:14:38 [rw] do_epoch_log_read(93) failed to open epoch 12 log, No such file or directory
May 23 22:14:38 [main] recover_object_main(612) done:15244 count:15244, oid:1d5fbd00001480
May 23 22:14:38 [main] modify_event(151) event info for fd 36 not found
May 23 22:14:38 [main] modify_event(151) event info for fd 38 not found
May 23 22:14:38 [main] modify_event(151) event info for fd 39 not found
May 23 22:14:38 [main] modify_event(151) event info for fd 43 not found
May 23 22:14:38 [main] modify_event(151) event info for fd 42 not found



z3:
....
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 11 log, No such file or directory
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 10 log, No such file or directory
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 9 log, No such file or directory
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 8 log, No such file or directory
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 7 log, No such file or directory
May 08 07:22:52 [io 19984] do_epoch_log_read(93) failed to open epoch 6 log, No such file or directory
May 08 07:22:57 [gway 20019] gateway_read_obj(60) local read 801d5fbd00000000 failed, No object found
May 08 07:22:57 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 19975] gateway_read_obj(60) local read 809133c000000000 failed, No object found
May 08 07:22:57 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:22:57 [gway 20019] gateway_read_obj(60) local read 80d322dd00000000 failed, No object found
May 08 07:22:57 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 19975] gateway_read_obj(60) local read 801d5fbd00000000 failed, No object found
May 08 07:30:46 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 20019] gateway_read_obj(60) local read 809133c000000000 failed, No object found
May 08 07:30:46 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:30:46 [gway 19975] gateway_read_obj(60) local read 80d322dd00000000 failed, No object found
May 08 07:30:46 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 20019] gateway_read_obj(60) local read 801d5fbd00000000 failed, No object found
May 08 07:31:13 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 19975] gateway_read_obj(60) local read 809133c000000000 failed, No object found
May 08 07:31:13 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:13 [gway 20019] gateway_read_obj(60) local read 80d322dd00000000 failed, No object found
May 08 07:31:13 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 19975] gateway_read_obj(60) local read 801d5fbd00000000 failed, No object found
May 08 07:31:32 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 19975] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 20019] gateway_read_obj(60) local read 809133c000000000 failed, No object found
May 08 07:31:32 [gway 20019] sheep_exec_req(547) failed No object found
May 08 07:31:32 [gway 19975] gateway_read_obj(60) local read 80d322dd00000000 failed, No object found
May 08 07:31:32 [gway 19975] sheep_exec_req(547) failed No object found



============================================================================










 
------------------ Original ------------------
From:  "Liu Yuan"<namei.unix at gmail.com>;
Date:  Thu, May 23, 2013 01:37 PM
To:  "Hongyi Wang"<hongyi at zelin.io>; 
Cc:  "sheepdog"<sheepdog at lists.wpkg.org>; "k"<k at zelin.io>; 
Subject:  Re: [sheepdog] zookeeper quitting unexpectedly causes recoveringfrom journal file failed

 
On 05/23/2013 01:29 PM, Hongyi Wang wrote:
> Hi, 
> 
> This is followed by our last test. One sheep node in our cluster
> connected zookeeper timeout so we tried to restart sheep on the node.
> However, the sheep cannot be started successfully, 
> I am not sure if zk connection timeout could somehow causes recovering
> from journal file failed? Is this a bug of journal replay?

I guess it is a bug of journal replay for some corner cases. Pass 'skip'
for -j or simply remove files in journal dir will start the sheep again.

Thanks,
Yuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20130523/50bf8cf0/attachment-0004.html>


More information about the sheepdog mailing list