[sheepdog-users] [sheepdog-0.9.2] Data corruption after two nodes was killed and back again
Marcin Mirosław
marcin at mejor.pl
Tue Sep 29 15:47:04 CEST 2015
W dniu 29.09.2015 o 15:17, Marcin Mirosław pisze:
> Hi!
> I've made an attepmpt to build real sheepdog cluster. After short time
> of using it I'm getting various errors.
>
>
> # dog node list
> Id Host:Port V-Nodes Zone
> 0 192.168.1.202:7000 45 3389106368
> 1 192.168.1.230:7000 706 3858868416
> 2 192.168.8.200:7000 4 3356010688
> 3 192.168.8.201:7000 10 3372787904
> 4 193.221.122.241:7000 2 4051361217
> 5 193.221.122.242:7000 2 4068138433
> # dog cluster format -c 2:1
> using backend plain store
>
> # dog cluster info -v
> Cluster status: running, auto-recovery enabled
> Cluster store: plain with 2:1 redundancy policy
> Cluster vnode mode: node
> Cluster created at Tue Sep 29 11:57:55 2015
>
> Epoch Time Version
> 2015-09-29 11:57:56 1 [192.168.1.202:7000, 192.168.1.230:7000,
> 192.168.8.200:7000, 192.168.8.201:7000, 193.221.122.241:7000,
> 193.221.122.242:7000]
>
>
> Next I'm killing one sheep, waiting for synchronization, next I'm
> killing another one sheep. Waiting for synchronization. Next I'm writing
> some data to block device and next I'm launching both sheeps. Then I'v
> got many data coruption on block device, dog cluster check throws many
> messages:
> 42.5 % [=========================================================>
> ] 4.3 GB / 10 GB object fda51e00000441 is inconsistent
> 42.7 % [==========================================================>
> ] 4.3 GB / 10 GB object fda51e00000443 is inconsistent
> 42.8 % [==========================================================>
> ] 4.3 GB / 10 GB object fda51e00000442 is inconsistent
>
>
> On first killed sheep (192.168.8.200) is in log:
>> Sep 29 12:10:29 INFO [main] recover_object_main(904) object recovery
>> progress 100%
>> Sep 29 12:11:19 NOTICE [main] cluster_recovery_completion(714) all
>> nodes are recovered, epoch 5
>> Sep 29 15:02:10 INFO [main] main(966) shutdown
>> Sep 29 15:02:10 INFO [main] zk_leave(989) leaving from cluster
>> Sep 29 15:06:40 INFO [main] md_add_disk(343) /var/lib/sheepdog/obj,
>> vdisk nr 35, total disk 1
>> Sep 29 15:06:41 NOTICE [main] get_local_addr(522) found IPv4 address
>> Sep 29 15:06:41 INFO [main] send_join_request(1032) IPv4
>> ip:192.168.8.200 port:7000 going to join the cluster
>> Sep 29 15:06:41 NOTICE [main] nfs_init(608) nfs server service is not
>> compiled
>> Sep 29 15:06:41 INFO [main] check_host_env(500) Allowed open files
>> 1024000, suggested 6144000
>> Sep 29 15:06:41 INFO [main] main(958) sheepdog daemon (version
>> 0.9.2) started
>> Sep 29 15:06:41 ERROR [block] sheep_exec_req(1170) failed The buffer
>> is too small, remote address: 192.168.1.202:7000, op name:
>> VDI_STATE_SNAPSHOT_CTL
>> Sep 29 15:06:42 ERROR [block] sheep_exec_req(1170) failed The buffer
>> is too small, remote address: 192.168.1.202:7000, op name:
>> VDI_STATE_SNAPSHOT_CTL
>> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery
>> progress 1%
>> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery
>> progress 2%
>> Sep 29 15:06:43 ERROR [rw 945515] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery
>> progress 3%
>> Sep 29 15:06:43 ERROR [rw 945542] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 945542] atomic_create_and_write(694) failed
>> to rename /var/lib/sheepdog/epoch/00000007, No such file or directory
>> Sep 29 15:06:44 ERROR [rw 945540] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:44 INFO [main] recover_object_main(904) object recovery
>> progress 4%
>> Sep 29 15:06:44 ERROR [rw 945542] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:45 INFO [main] recover_object_main(904) object recovery
>> progress 5%
>
>
> on second killed sheep (192.168.8.201) is in sheep.log:
>> Sep 29 15:03:49 INFO [main] recover_object_main(904) object recovery
>> progress 98%
>> Sep 29 15:03:49 INFO [main] recover_object_main(904) object recovery
>> progress 99%
> > ========== WHY HERE ISN'T LINE WITH MESSAGE "recover_object_main(904)
> object recovery progress 100%"?
>> Sep 29 15:04:17 NOTICE [main] cluster_recovery_completion(714) all
>> nodes are recovered, epoch 6
>> Sep 29 15:05:58 INFO [main] main(966) shutdown
>> Sep 29 15:05:58 INFO [main] zk_leave(989) leaving from cluster
>> Sep 29 15:06:38 INFO [main] md_add_disk(343) /var/lib/sheepdog/obj,
>> vdisk nr 88, total disk 1
>> Sep 29 15:06:38 NOTICE [main] get_local_addr(522) found IPv4 address
>> Sep 29 15:06:38 INFO [main] send_join_request(1032) IPv4
>> ip:192.168.8.201 port:7000 going to join the cluster
>> Sep 29 15:06:39 NOTICE [main] nfs_init(608) nfs server service is not
>> compiled
>> Sep 29 15:06:39 INFO [main] check_host_env(500) Allowed open files
>> 1024000, suggested 6144000
>> Sep 29 15:06:39 INFO [main] main(958) sheepdog daemon (version
>> 0.9.2) started
>> Sep 29 15:06:39 ERROR [block] sheep_exec_req(1170) failed The buffer
>> is too small, remote address: 192.168.1.202:7000, op name:
>> VDI_STATE_SNAPSHOT_CTL
>> Sep 29 15:06:39 ERROR [block] sheep_exec_req(1170) failed The buffer
>> is too small, remote address: 192.168.1.202:7000, op name:
>> VDI_STATE_SNAPSHOT_CTL
>> Sep 29 15:06:40 INFO [main] recover_object_main(904) object recovery
>> progress 1%
>> Sep 29 15:06:41 ERROR [rw 293160] sheep_exec_req(1170) failed Request
>> has an old epoch, remote address: 193.221.122.241:7000, op name:
>> READ_PEER
>> Sep 29 15:06:41 ERROR [rw 293160] recover_object_work(574) failed to
>> recover object fda51e0000002e
>> Sep 29 15:06:42 INFO [main] local_vdi_state_snapshot_ctl(1398)
>> freeing vdi state snapshot at epoch 7
>> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery
>> progress 1%
>> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 ERROR [rw 293160] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery
>> progress 2%
>> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 ERROR [rw 293174] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery
>> progress 3%
>> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No
>> object found, remote address: 192.168.8.201:7000, op name: READ_PEER
>
>
> Is it known issue?
Also I found in log, in three sheeps such messages:
(192.168.1.242(
> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery progress 100%
> Sep 29 15:08:20 NOTICE [main] cluster_recovery_completion(714) all nodes are recovered, epoch 9
> Sep 29 15:11:36 ERROR [net 18963] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 18963] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 18963] tx_work(863) failed to send a request
another one(192.168.8.200):
> Sep 29 15:07:36 ERROR [rw 945540] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:07:36 INFO [main] recover_object_main(904) object recovery progress 97%
> Sep 29 15:07:37 INFO [main] recover_object_main(904) object recovery progress 99%
> Sep 29 15:07:37 ERROR [rw 945542] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:07:37 INFO [main] recover_object_main(904) object recovery progress 100%
> Sep 29 15:08:20 NOTICE [main] cluster_recovery_completion(714) all nodes are recovered, epoch 9
> Sep 29 15:11:36 ERROR [net 947221] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 947221] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 947221] tx_work(863) failed to send a request
and third sheep (192.168.8.201):
> Sep 29 15:08:19 ERROR [rw 293160] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:08:20 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:08:20 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:08:20 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:08:20 NOTICE [main] cluster_recovery_completion(714) all nodes are recovered, epoch 9
> Sep 29 15:11:36 ERROR [net 294037] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 294037] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 294037] tx_work(863) failed to send a request
> Sep 29 15:11:36 ERROR [net 293170] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 293170] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 293170] tx_work(863) failed to send a request
> Sep 29 15:11:36 ERROR [net 294054] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 294054] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 294054] tx_work(863) failed to send a request
> Sep 29 15:11:36 ERROR [net 293172] do_write(281) failed to write to socket: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 293172] send_req(319) failed to send request a4, 2097152: Connection reset by peer
> Sep 29 15:11:36 ERROR [net 293172] tx_work(863) failed to send a request
More information about the sheepdog-users
mailing list