[sheepdog-users] Locking problems on 0.9

Hitoshi Mitake mitake.hitoshi at gmail.com
Thu Nov 20 15:54:01 CET 2014


On Tue, Nov 11, 2014 at 6:08 PM, Micha Kersloot <micha at kovoks.nl> wrote:
> Hi Hitoshi,
>
> thank you for your time.
>
>
> Cluster status: Waiting for other nodes to join cluster
>
> Cluster created at Tue Nov  4 14:22:03 2014
>
> Epoch Time           Version
> 2014-11-04 16:55:02      9 [10.10.0.21:7001, 10.10.0.22:7001, 10.10.0.30:7001]
> 2014-11-04 16:54:56      8 [10.10.0.21:7001, 10.10.0.30:7001]
> 2014-11-04 16:54:33      7 [10.10.0.21:7001, 10.10.0.22:7001, 10.10.0.30:7001]
> 1970-01-01 01:00:00      6 []

The above 6th epoch would be the root cause of the problem. An epoch
with no nodes (clearly it cannot be happened on normal situation) can
wipe data under sheepdog's recovery logic.
I'll prepare a patch for avoiding creation of such an epoch later.

BTW, can you see such an epoch with no nodes in other sheep daemon?

Thanks,
Hitoshi

> 2014-11-04 16:52:45      5 [10.10.0.21:7001, 10.10.0.22:7001, 10.10.0.30:7001]
> 2014-11-04 16:52:32      4 [10.10.0.22:7001, 10.10.0.30:7001]
> 2014-11-04 16:47:43      3 [10.10.0.21:7001, 10.10.0.22:7001, 10.10.0.30:7001]
> 2014-11-04 16:46:43      2 [10.10.0.21:7001, 10.10.0.30:7001]
> 2014-11-04 14:22:03      1 [10.10.0.30:7001]
>
>
> /root/sheep/usr/sbin/sheep -y 10.10.0.30 -c zookeeper:10.10.0.21:2181,10.10.0.22:2181,10.10.0.30:2181 -n /var/lib/sheepdog/0.9 /mnt/sheep/0.9
>
> ls -la /mnt/sheep/0.9/
> total 3.2M
> drwxr-xr-x 3 root root 980K Nov  7 15:25 .
> drwxr-xr-x 4 root root 1.1M Nov  4 13:21 ..
> drwxr-x--- 2 root root 1.1M Nov  7 15:25 .stale
>
> du -hs /mnt/sheep/0.9/.stale/
> 113G    /mnt/sheep/0.9/.stale/
>
>
> and the last part of the sheep.log:
>
> Nov 07 13:57:04   INFO [main] cluster_release_vdi_main(1370) node: IPv4 ip:10.10.0.30 port:7001 is unlocking VDI (type: normal): 0
> Nov 07 13:57:04  ERROR [main] vdi_unlock(496) no vdi state entry of 0 found
> Nov 07 13:57:04   INFO [main] cluster_lock_vdi_main(1347) node: IPv4 ip:10.10.0.30 port:7001 is locking VDI (type: normal): 9cc242
> Nov 07 13:57:04   INFO [main] vdi_lock(454) VDI 9cc242 is already locked
> Nov 07 13:57:04  ERROR [main] cluster_lock_vdi_main(1350) locking 9cc242failed
> Nov 07 14:00:40   INFO [main] rx_main(830) req=0x2556ab0, fd=25, client=148.251.76.165:47287, op=DEL_VDI, data=(not string)
> Nov 07 14:00:41   INFO [main] tx_main(882) req=0x2556ab0, fd=25, client=148.251.76.165:47287, op=DEL_VDI, result=00
> Nov 07 15:21:55   INFO [main] rx_main(830) req=0x26a7d00, fd=25, client=148.251.76.165:48014, op=SHUTDOWN, data=(null)
> Nov 07 15:21:55   INFO [main] tx_main(882) req=0x26a7d00, fd=25, client=148.251.76.165:48014, op=SHUTDOWN, result=00
> Nov 07 15:21:55   INFO [main] main(959) shutdown
> Nov 07 15:21:55   INFO [main] zk_leave(989) leaving from cluster
> Nov 07 15:25:50   INFO [main] md_add_disk(343) /mnt/sheep/0.9, vdisk nr 844, total disk 1
> Nov 07 15:25:50   INFO [main] send_join_request(1006) IPv4 ip:10.10.0.30 port:7000 going to join the cluster
> Nov 07 15:25:50 NOTICE [main] nfs_init(607) nfs server service is not compiled
> Nov 07 15:25:50   WARN [main] check_host_env(497) Allowed open files 1024 too small, suggested 6144000
> Nov 07 15:25:50   INFO [main] main(951) sheepdog daemon (version 0.9.0) started
> Nov 07 15:26:51  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.21:7000, op name: GET_EPOCH
> Nov 07 15:26:53  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.21:7000, op name: GET_EPOCH
> Nov 07 15:28:03  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.21:7000, op name: GET_EPOCH
> Nov 07 15:28:03  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.22:7000, op name: GET_EPOCH
> Nov 11 10:01:41  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.21:7000, op name: GET_EPOCH
> Nov 11 10:01:41  ERROR [io 7413] sheep_exec_req(1170) failed Waiting for other nodes to join cluster, remote address: 10.10.0.22:7000, op name: GET_EPOCH
>
>
>
>
> Met vriendelijke groet,
>
> Micha Kersloot
>
> Blijf op de hoogte en ontvang de laatste tips over Zimbra/KovoKs Contact:
> http://twitter.com/kovoks
>
> KovoKs B.V. is ingeschreven onder KvK nummer: 11033334
>
> ----- Original Message -----
>> From: "Hitoshi Mitake" <mitake.hitoshi at lab.ntt.co.jp>
>> To: "Micha Kersloot" <info at kovoks.nl>
>> Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
>> Sent: Tuesday, November 11, 2014 9:09:58 AM
>> Subject: Re: [sheepdog-users] Locking problems on 0.9
>>
>>
>> Hi Micha, sorry for my late reply.
>>
>> At Fri, 7 Nov 2014 15:39:44 +0100 (CET),
>> Micha Kersloot wrote:
>> >
>> > Hi,
>> >
>> > I've done it again...
>> >
>> > Shutdown all sheepdog instances with dog cluster shutdown.
>> >
>> > Started 0.9 version of sheepdog on the default port 7000 instead of port
>> > 7001 on the default zookeeper cluster instead of the alternate I created
>> > for the conversion.
>> >
>> > Cluster status: Waiting for other nodes to join cluster
>> > on all servers and the directories assigned to the 0.9 version are all
>> > empty now and all the converted vdi's are lost.
>> >
>> > So I guess my procedure has some faults, but why is all the data lost?
>>
>> Hmm... it is strange, could you show your "dog cluster info" output on the
>> culuster?
>>
>> Thanks,
>> Hitoshi
>>
>> >
>> > Met vriendelijke groet,
>> >
>> > Micha Kersloot
>> >
>> > Blijf op de hoogte en ontvang de laatste tips over Zimbra/KovoKs Contact:
>> > http://twitter.com/kovoks
>> >
>> > KovoKs B.V. is ingeschreven onder KvK nummer: 11033334
>> >
>> > ----- Original Message -----
>> > > From: "Micha Kersloot" <micha at kovoks.nl>
>> > > To: "Valerio Pachera" <sirio81 at gmail.com>
>> > > Cc: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
>> > > Sent: Friday, November 7, 2014 3:11:27 PM
>> > > Subject: Re: [sheepdog-users] Locking problems on 0.9
>> > >
>> > > Hi,
>> > >
>> > > I do feel it could be or running both daemons at the same time, running
>> > > the
>> > > 0.9 daemon on a non default port or maybe the version of qemu i'm using
>> > > (default debian wheezy version).
>> > >
>> > > Met vriendelijke groet,
>> > >
>> > > Micha Kersloot
>> > >
>> > > Blijf op de hoogte en ontvang de laatste tips over Zimbra/KovoKs Contact:
>> > > http://twitter.com/kovoks
>> > >
>> > > KovoKs B.V. is ingeschreven onder KvK nummer: 11033334
>> > >
>> > > ----- Original Message -----
>> > > > From: "Valerio Pachera" <sirio81 at gmail.com>
>> > > > To: "Lista sheepdog user" <sheepdog-users at lists.wpkg.org>
>> > > > Sent: Friday, November 7, 2014 2:43:30 PM
>> > > > Subject: Re: [sheepdog-users] Locking problems on 0.9
>> > > >
>> > > > 2014-11-07 13:59 GMT+01:00 Micha Kersloot <micha at kovoks.nl>:
>> > > > > Hi Valerio,
>> > > > >
>> > > > > good idea, but the problem stays unfortunately.
>> > > >
>> > > > I tried to import a qcow2 on sheepdog 0.9 cluster and start the guest
>> > > > and it works fine.
>> > > > May you post the command you use to run bot sheep daemons (0.8.3 and
>> > > > 0.9.0)
>> > > > ?
>> > > >
>> > > > Thank you.
>> > > > --
>> > > > sheepdog-users mailing lists
>> > > > sheepdog-users at lists.wpkg.org
>> > > > http://lists.wpkg.org/mailman/listinfo/sheepdog-users
>> > > >
>> > > --
>> > > sheepdog-users mailing lists
>> > > sheepdog-users at lists.wpkg.org
>> > > http://lists.wpkg.org/mailman/listinfo/sheepdog-users
>> > >
>> > --
>> > sheepdog-users mailing lists
>> > sheepdog-users at lists.wpkg.org
>> > http://lists.wpkg.org/mailman/listinfo/sheepdog-users
>>
> --
> sheepdog-users mailing lists
> sheepdog-users at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog-users



More information about the sheepdog-users mailing list