[sheepdog-users] Failed to read object 80244e5600000000 No object found

Liu Yuan namei.unix at gmail.com
Mon May 5 05:23:32 CEST 2014


On Sun, May 04, 2014 at 08:11:02PM +0100, Struan Bartlett wrote:
> 
> Having been running sheepdog-0.8.0 successfully for a number of weeks,
> earlier last month I suddenly found that my cluster would not longer
> launch. After reattempting launch this evening, I finally got the
> cluster launched but now to make matters worse it now looks like
> sheepdog has deleted all underlying objects! Here is some data:
> 

There might be some bug in 0.8.0 version. Probably latest v0.8.1 solve your
problems.

> A. Before start-up today 'ls -l /var/lib/sheepdog/obj | wc -l' returned
> the following on the three nodes that were running sheep:
> 
> server1
> 1
> server2
> 4453
> server3
> 4453
> 
> B. After start-up of sheep on each of the three nodes, the same command
> returns only '1' on each server! I guess this means my vdis are no more!
> 
> C. dog vdi list outputs the following on any of the three nodes:
> 
> Name        Id    Size    Used  Shared    Creation time   VDI id 
> Copies  Tag
> Failed to read object 80244e5600000000 No object found
> Failed to read inode header
> Failed to read object 802b5c3a00000000 No object found
> Failed to read inode header
> Failed to read object 802b5c3b00000000 No object found
> Failed to read inode header
> Failed to read object 802b5c3c00000000 No object found
> Failed to read inode header
> Failed to read object 80cde59c00000000 No object found
> Failed to read inode header
> Failed to read object 80cde59d00000000 No object found
> Failed to read inode header
> Failed to read object 80cde59e00000000 No object found
> Failed to read inode header
> Failed to read object 80cde59f00000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a000000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a100000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a200000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a300000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a400000000 No object found
> Failed to read inode header
> Failed to read object 80cde5a500000000 No object found
> Failed to read inode header
> Failed to read object 80d8c70600000000 No object found
> Failed to read inode header
> Failed to read object 80ddce9a00000000 No object found
> Failed to read inode header
> 
> Here is a grep for the first object ID on each of the three nodes
> running sheep:
> 
> 1.
> Apr 09 16:46:14  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 16:46:14  DEBUG [main] err_to_sderr(100) object 80244e5600000000
> not found locally
> Apr 09 16:46:54  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 16:46:54  DEBUG [main] err_to_sderr(100) object 80244e5600000000
> not found locally
> May 04 19:38:50  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> May 04 19:38:50  DEBUG [main] err_to_sderr(100) object 80244e5600000000
> not found locally
> May 04 19:39:13  DEBUG [main] prepare_schedule_oid(566) 80244e5600000000
> nr_prio_oids 1
> May 04 19:39:13  DEBUG [main] request_in_recovery(195) 80244e5600000000
> wait on oid
> May 04 19:39:13  DEBUG [rw] default_get_hash(646) the message digest of
> 80244e5600000000 at epoch 12 is f5a828c6a3c5b6dc99520d31fcbc3fd76f080a34
> May 04 19:39:13  DEBUG [rw] recover_replication_object(369) try recover
> object 80244e5600000000 from epoch 16
> May 04 19:39:13  DEBUG [rw] recover_replication_object(369) try recover
> object 80244e5600000000 from epoch 15
> May 04 19:39:13  DEBUG [main] wakeup_requests_on_oid(250) retry
> 80244e5600000000
> May 04 19:39:13  DEBUG [main] request_in_recovery(195) 80244e5600000000
> wait on oid
> May 04 19:39:13   INFO [main] recover_object_main(855) object
> 80244e5600000000 is recovered (322/2745)
> May 04 19:39:41  DEBUG [main] oid_in_recovery(596) 80244e5600000000 has
> been already recovered
> May 04 19:39:41  DEBUG [io 12960] do_process_work(1393) a4,
> 80244e5600000000, 17
> May 04 19:39:41  DEBUG [io 12960] err_to_sderr(100) object
> 80244e5600000000 not found locally
> May 04 19:39:41  DEBUG [io 12960] do_process_work(1400) failed: a4,
> 80244e5600000000 , 17, No object found
> May 04 19:39:55  DEBUG [main] oid_in_recovery(596) 80244e5600000000 has
> been already recovered
> May 04 19:39:55  DEBUG [io 12950] do_process_work(1393) a4,
> 80244e5600000000, 17
> May 04 19:39:55  DEBUG [io 12950] err_to_sderr(100) object
> 80244e5600000000 not found locally
> May 04 19:39:55  DEBUG [io 12950] do_process_work(1400) failed: a4,
> 80244e5600000000 , 17, No object found
> May 04 19:40:04  DEBUG [main] oid_in_recovery(596) 80244e5600000000 has
> been already recovered
> May 04 19:40:04  DEBUG [io 12960] do_process_work(1393) a4,
> 80244e5600000000, 17
> May 04 19:40:04  DEBUG [io 12960] err_to_sderr(100) object
> 80244e5600000000 not found locally
> May 04 19:40:04  DEBUG [io 12960] do_process_work(1400) failed: a4,
> 80244e5600000000 , 17, No object found
> 
> 2.
> Mar 31 22:14:45   INFO [main] recover_object_main(856) object
> 80244e5600000000 is recovered (2868/4370)
> May 04 19:39:02   INFO [main] recover_object_main(856) object
> 80244e5600000000 is recovered (181/3034)
> May 04 19:40:04  ERROR [gway 11266] gateway_replication_read(294) local
> read 80244e5600000000 failed, No object found
> May 04 19:41:59  ERROR [gway 12177] gateway_replication_read(294) local
> read 80244e5600000000 failed, No object found
> May 04 19:43:43  ERROR [gway 12177] gateway_replication_read(294) local
> read 80244e5600000000 failed, No object found
> May 04 19:43:58  ERROR [gway 12177] gateway_replication_read(294) local
> read 80244e5600000000 failed, No object found
> May 04 19:45:15  ERROR [gway 12177] gateway_replication_read(294) local
> read 80244e5600000000 failed, No object found
> 
> 3
> Mar 31 22:14:04   INFO [main] recover_object_main(855) object
> 80244e5600000000 is recovered (2868/4370)
> Apr 09 12:23:11   INFO [main] recover_object_main(855) object
> 80244e5600000000 is recovered (2923/4452)
> Apr 09 16:45:32  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 16:50:46  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 16:58:26  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:02:34  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:03:17  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:03:34  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:03:47  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:03:48  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> Apr 09 17:07:34  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> May 04 19:35:00  DEBUG [main] init_objlist_and_vdi_bitmap(234) found the
> VDI object 80244e5600000000
> May 04 19:35:00  DEBUG [main] move_object_to_stale_dir(507) moved object
> 80244e5600000000
> May 04 19:39:02  DEBUG [io 21128] do_process_work(1393) b4,
> 80244e5600000000, 17
> May 04 19:39:02  DEBUG [io 21128] do_process_work(1400) failed: b4,
> 80244e5600000000 , 17, No object found
> May 04 19:39:13  DEBUG [io 21128] do_process_work(1393) b4,
> 80244e5600000000, 17
> May 04 19:39:13  DEBUG [io 21128] do_process_work(1400) failed: b4,
> 80244e5600000000 , 17, No object found
> May 04 19:39:41  DEBUG [gway 21110] do_process_work(1393) 2,
> 80244e5600000000, 17
> 
> Can anyone explain what has happened, and why sheepdog has just now
> deleted all the objects associated with my cluster, I assume rendering
> it completely unrecoverable? Please let me know if there are other
> investigations I should perform.
> 
> Thank you!
> 
> Struan Bartlett

Hmmm, this looks a fatal problem. Sheepdog will try to put the objects into stale
directory temprarily if your cluster wasn't shutdown normally beforhand. Then
all the nodes will try to do a recovery.

Is there any method to reproduce the problem?

Thanks
Yuan



More information about the sheepdog-users mailing list