[sheepdog-users] Fwd: dog vdi check: ovject is incosistent

Fri Oct 3 15:39:49 CEST 2014

Hi again, 

----- Original Message -----

> From: "Micha Kersloot" <micha at kovoks.nl>
> To: "Valerio Pachera" <sirio81 at gmail.com>
> Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
> Sent: Friday, October 3, 2014 2:37:00 PM
> Subject: Re: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent

> Hi,

> ----- Original Message -----

> > From: "Micha Kersloot" <micha at kovoks.nl>
> 
> > To: "Valerio Pachera" <sirio81 at gmail.com>
> 
> > Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
> 
> > Sent: Friday, October 3, 2014 12:27:55 PM
> 
> > Subject: Re: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent
> 

> > Hi,
> 

> > ----- Original Message -----
> 

> > > From: "Valerio Pachera" <sirio81 at gmail.com>
> > 
> 
> > > To: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
> > 
> 
> > > Sent: Friday, October 3, 2014 12:18:50 PM
> > 
> 
> > > Subject: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent
> > 
> 

> > > 2014-10-03 10:35 GMT+02:00 Micha Kersloot < micha at kovoks.nl > :
> > 
> 

> > > > Recovery kicked in and the VM continued. After recovery was complete, I
> > > > tried
> > > > to start the sheep daemon and that failed because of some journal
> > > > errors.
> > > > I
> > > > added the 'skip' to the journal option and recovery kicked in again.
> > > > Looked
> > > > all fine to me!
> > > 
> > 
> 

> > > What version of sheepdog are you running?
> > 
> 
> > > May you report the options you use to run sheep?
> > 
> 

> > sheep -v
> 
> > Sheepdog daemon version 0.8.3
> 

> > /usr/sbin/sheep -y 10.10.0.30 -c
> > zookeeper:10.10.0.21:2181,10.10.0.22:2181,10.10.0.30:2181 -j size=512M -w
> > size=50G --upgrade --pidfile /var/run/sheepdog.pid /var/lib/sheepdog
> > /mnt/sheep/30
> 

> > > > But... To be sure I did a dog vdi check and that gives me:
> > > 
> > 
> 
> > > > object 2e754900000844 is inconsistent
> > > 
> > 
> 

> > > Wow, I've never seen that.
> > 
> 
> > > Did you shutdown the guest before running vdi check?
> > 
> 

> > First the guest was running, then I shutdown the guest and that made no
> > difference. Here is info about the vdi itself.
> 

> > dog vdi list
> 
> > Name Id Size Used Shared Creation time VDI id Copies Tag
> 
> > micha_test 0 10 GB 9.7 GB 0.0 MB 2014-10-02 17:36 2e7549 2:1
> 

> I've done some more reading and working with the cluster and decided using
> the object cache is maybe not the best solution in my situation. So I've
> shutdown the cluster with dog cluster shutdown, replaced the "-w size=50G"
> with "-n" on all nodes and restarted the cluster without any errors. Now I
> have major filesystem errors on the kvm guest, but dog vdi check now runs
> without any problems. To me it looks like there where problems with the VDI
> which are now corrected by sheepdog, but these corrections somehow corrupted
> the filesystem. So two things to do for me now. 1 testing of the current
> setup is more stable 2. Setting up a new cluster to see if I can reproduce
> the problems.

Looks like i'm able to reproduce the problem: 

Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0 
Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0 
Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0 
Oct 03 15:23:03 INFO [main] send_join_request(787) IPv4 ip:10.10.0.22 port:7000 going to join the cluster 
Oct 03 15:23:03 INFO [main] replay_journal_entry(159) /mnt/sheep/22/00f19e4e000000aa, size 6144, off 1781760, 0 
Oct 03 15:23:03 ERROR [main] replay_journal_entry(166) open No such file or directory 
Oct 03 15:23:03 EMERG [main] check_recover_journal_file(262) PANIC: recoverying from journal file (new) failed 
Oct 03 15:23:03 EMERG [main] crash_handler(267) sheep exits unexpectedly (Aborted). 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) : 
Oct 03 15:23:04 EMERG [main] sd_backtrace(833) : 
Oct 03 15:25:19 INFO [main] md_add_disk(338) /mnt/sheep/22, vdisk nr 467, total disk 1 
Oct 03 15:25:19 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0 
Oct 03 15:25:19 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0 

I killed the node and it joins the cluster, then find out there is something wrong with the journal and kills itself. 

another node gives: 

Oct 03 15:23:04 ERROR [block] do_read(236) failed to read from socket: -1, Connection reset by peer 
Oct 03 15:23:04 ERROR [block] exec_req(347) failed to read a response 
Oct 03 15:23:04 ALERT [block] do_get_vdis(499) failed to get vdi bitmap from IPv4 ip:10.10.0.22 port:7000 
Oct 03 15:23:04 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused 
Oct 03 15:23:04 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused 
Oct 03 15:23:04 ALERT [rw] fetch_object_list(931) cannot get object list from 10.10.0.22:7000 
Oct 03 15:23:04 ALERT [rw] fetch_object_list(933) some objects may be not recovered at epoch 12 
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER 
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER 
Oct 03 15:23:04 ERROR [rw] read_erasure_object(206) can not read 2e75490000009e idx 2 
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER 
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER 
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER 
Oct 03 15:23:04 ERROR [rw] read_erasure_object(206) can not read 2e754900000758 idx 2 

and maybe more interesting: 

Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER 
Oct 03 15:23:07 ERROR [rw] read_erasure_object(206) can not read 2e7549000008df idx 2 
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER 
Oct 03 15:23:07 INFO [main] recover_object_main(863) object recovery progress 1% 
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER 
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER 
Oct 03 15:23:07 ERROR [rw] read_erasure_object(206) can not read 2e7549000006f7 idx 2 

It looks like the VDI gets 'recoverd' to an instable state, hence dog vdi check gives: 

97.0 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000476 is inconsistent 
97.0 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000477 is inconsistent 
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000478 is inconsistent 
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000479 is inconsistent 
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e0000047a is inconsistent 

I will try again with the journal skip option enabled the first time after I pooled the plug on one of the nodes. 

Greets, 

Micha 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20141003/2e8c3fbd/attachment-0005.html>