[sheepdog-users] Fwd: dog vdi check: ovject is incosistent
Micha Kersloot
micha at kovoks.nl
Fri Oct 3 15:39:49 CEST 2014
Hi again,
----- Original Message -----
> From: "Micha Kersloot" <micha at kovoks.nl>
> To: "Valerio Pachera" <sirio81 at gmail.com>
> Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
> Sent: Friday, October 3, 2014 2:37:00 PM
> Subject: Re: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent
> Hi,
> ----- Original Message -----
> > From: "Micha Kersloot" <micha at kovoks.nl>
>
> > To: "Valerio Pachera" <sirio81 at gmail.com>
>
> > Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
>
> > Sent: Friday, October 3, 2014 12:27:55 PM
>
> > Subject: Re: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent
>
> > Hi,
>
> > ----- Original Message -----
>
> > > From: "Valerio Pachera" <sirio81 at gmail.com>
> >
>
> > > To: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
> >
>
> > > Sent: Friday, October 3, 2014 12:18:50 PM
> >
>
> > > Subject: [sheepdog-users] Fwd: dog vdi check: ovject is incosistent
> >
>
> > > 2014-10-03 10:35 GMT+02:00 Micha Kersloot < micha at kovoks.nl > :
> >
>
> > > > Recovery kicked in and the VM continued. After recovery was complete, I
> > > > tried
> > > > to start the sheep daemon and that failed because of some journal
> > > > errors.
> > > > I
> > > > added the 'skip' to the journal option and recovery kicked in again.
> > > > Looked
> > > > all fine to me!
> > >
> >
>
> > > What version of sheepdog are you running?
> >
>
> > > May you report the options you use to run sheep?
> >
>
> > sheep -v
>
> > Sheepdog daemon version 0.8.3
>
> > /usr/sbin/sheep -y 10.10.0.30 -c
> > zookeeper:10.10.0.21:2181,10.10.0.22:2181,10.10.0.30:2181 -j size=512M -w
> > size=50G --upgrade --pidfile /var/run/sheepdog.pid /var/lib/sheepdog
> > /mnt/sheep/30
>
> > > > But... To be sure I did a dog vdi check and that gives me:
> > >
> >
>
> > > > object 2e754900000844 is inconsistent
> > >
> >
>
> > > Wow, I've never seen that.
> >
>
> > > Did you shutdown the guest before running vdi check?
> >
>
> > First the guest was running, then I shutdown the guest and that made no
> > difference. Here is info about the vdi itself.
>
> > dog vdi list
>
> > Name Id Size Used Shared Creation time VDI id Copies Tag
>
> > micha_test 0 10 GB 9.7 GB 0.0 MB 2014-10-02 17:36 2e7549 2:1
>
> I've done some more reading and working with the cluster and decided using
> the object cache is maybe not the best solution in my situation. So I've
> shutdown the cluster with dog cluster shutdown, replaced the "-w size=50G"
> with "-n" on all nodes and restarted the cluster without any errors. Now I
> have major filesystem errors on the kvm guest, but dog vdi check now runs
> without any problems. To me it looks like there where problems with the VDI
> which are now corrected by sheepdog, but these corrections somehow corrupted
> the filesystem. So two things to do for me now. 1 testing of the current
> setup is more stable 2. Setting up a new cluster to see if I can reproduce
> the problems.
Looks like i'm able to reproduce the problem:
Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 03 15:23:03 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 03 15:23:03 INFO [main] send_join_request(787) IPv4 ip:10.10.0.22 port:7000 going to join the cluster
Oct 03 15:23:03 INFO [main] replay_journal_entry(159) /mnt/sheep/22/00f19e4e000000aa, size 6144, off 1781760, 0
Oct 03 15:23:03 ERROR [main] replay_journal_entry(166) open No such file or directory
Oct 03 15:23:03 EMERG [main] check_recover_journal_file(262) PANIC: recoverying from journal file (new) failed
Oct 03 15:23:03 EMERG [main] crash_handler(267) sheep exits unexpectedly (Aborted).
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:03 EMERG [main] sd_backtrace(833) :
Oct 03 15:23:04 EMERG [main] sd_backtrace(833) :
Oct 03 15:25:19 INFO [main] md_add_disk(338) /mnt/sheep/22, vdisk nr 467, total disk 1
Oct 03 15:25:19 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 03 15:25:19 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
I killed the node and it joins the cluster, then find out there is something wrong with the journal and kills itself.
another node gives:
Oct 03 15:23:04 ERROR [block] do_read(236) failed to read from socket: -1, Connection reset by peer
Oct 03 15:23:04 ERROR [block] exec_req(347) failed to read a response
Oct 03 15:23:04 ALERT [block] do_get_vdis(499) failed to get vdi bitmap from IPv4 ip:10.10.0.22 port:7000
Oct 03 15:23:04 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 03 15:23:04 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 03 15:23:04 ALERT [rw] fetch_object_list(931) cannot get object list from 10.10.0.22:7000
Oct 03 15:23:04 ALERT [rw] fetch_object_list(933) some objects may be not recovered at epoch 12
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 03 15:23:04 ERROR [rw] read_erasure_object(206) can not read 2e75490000009e idx 2
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 03 15:23:04 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 03 15:23:04 ERROR [rw] read_erasure_object(206) can not read 2e754900000758 idx 2
and maybe more interesting:
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 03 15:23:07 ERROR [rw] read_erasure_object(206) can not read 2e7549000008df idx 2
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 03 15:23:07 INFO [main] recover_object_main(863) object recovery progress 1%
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 03 15:23:07 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 03 15:23:07 ERROR [rw] read_erasure_object(206) can not read 2e7549000006f7 idx 2
It looks like the VDI gets 'recoverd' to an instable state, hence dog vdi check gives:
97.0 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000476 is inconsistent
97.0 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000477 is inconsistent
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000478 is inconsistent
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e00000479 is inconsistent
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4e0000047a is inconsistent
I will try again with the journal skip option enabled the first time after I pooled the plug on one of the nodes.
Greets,
Micha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20141003/2e8c3fbd/attachment-0005.html>
More information about the sheepdog-users
mailing list