[sheepdog-users] [sheepdog-0.9.2] Data corruption after two nodes was killed and back again
Marcin Mirosław
marcin at mejor.pl
Tue Sep 29 15:17:43 CEST 2015
Hi!
I've made an attepmpt to build real sheepdog cluster. After short time
of using it I'm getting various errors.
# dog node list
Id Host:Port V-Nodes Zone
0 192.168.1.202:7000 45 3389106368
1 192.168.1.230:7000 706 3858868416
2 192.168.8.200:7000 4 3356010688
3 192.168.8.201:7000 10 3372787904
4 193.221.122.241:7000 2 4051361217
5 193.221.122.242:7000 2 4068138433
# dog cluster format -c 2:1
using backend plain store
# dog cluster info -v
Cluster status: running, auto-recovery enabled
Cluster store: plain with 2:1 redundancy policy
Cluster vnode mode: node
Cluster created at Tue Sep 29 11:57:55 2015
Epoch Time Version
2015-09-29 11:57:56 1 [192.168.1.202:7000, 192.168.1.230:7000,
192.168.8.200:7000, 192.168.8.201:7000, 193.221.122.241:7000,
193.221.122.242:7000]
Next I'm killing one sheep, waiting for synchronization, next I'm
killing another one sheep. Waiting for synchronization. Next I'm writing
some data to block device and next I'm launching both sheeps. Then I'v
got many data coruption on block device, dog cluster check throws many
messages:
42.5 % [=========================================================>
] 4.3 GB / 10 GB object fda51e00000441 is inconsistent
42.7 % [==========================================================>
] 4.3 GB / 10 GB object fda51e00000443 is inconsistent
42.8 % [==========================================================>
] 4.3 GB / 10 GB object fda51e00000442 is inconsistent
On first killed sheep (192.168.8.200) is in log:
> Sep 29 12:10:29 INFO [main] recover_object_main(904) object recovery progress 100%
> Sep 29 12:11:19 NOTICE [main] cluster_recovery_completion(714) all nodes are recovered, epoch 5
> Sep 29 15:02:10 INFO [main] main(966) shutdown
> Sep 29 15:02:10 INFO [main] zk_leave(989) leaving from cluster
> Sep 29 15:06:40 INFO [main] md_add_disk(343) /var/lib/sheepdog/obj, vdisk nr 35, total disk 1
> Sep 29 15:06:41 NOTICE [main] get_local_addr(522) found IPv4 address
> Sep 29 15:06:41 INFO [main] send_join_request(1032) IPv4 ip:192.168.8.200 port:7000 going to join the cluster
> Sep 29 15:06:41 NOTICE [main] nfs_init(608) nfs server service is not compiled
> Sep 29 15:06:41 INFO [main] check_host_env(500) Allowed open files 1024000, suggested 6144000
> Sep 29 15:06:41 INFO [main] main(958) sheepdog daemon (version 0.9.2) started
> Sep 29 15:06:41 ERROR [block] sheep_exec_req(1170) failed The buffer is too small, remote address: 192.168.1.202:7000, op name: VDI_STATE_SNAPSHOT_CTL
> Sep 29 15:06:42 ERROR [block] sheep_exec_req(1170) failed The buffer is too small, remote address: 192.168.1.202:7000, op name: VDI_STATE_SNAPSHOT_CTL
> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery progress 1%
> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery progress 2%
> Sep 29 15:06:43 ERROR [rw 945515] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery progress 3%
> Sep 29 15:06:43 ERROR [rw 945542] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 945542] atomic_create_and_write(694) failed to rename /var/lib/sheepdog/epoch/00000007, No such file or directory
> Sep 29 15:06:44 ERROR [rw 945540] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:44 INFO [main] recover_object_main(904) object recovery progress 4%
> Sep 29 15:06:44 ERROR [rw 945542] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:45 INFO [main] recover_object_main(904) object recovery progress 5%
on second killed sheep (192.168.8.201) is in sheep.log:
> Sep 29 15:03:49 INFO [main] recover_object_main(904) object recovery progress 98%
> Sep 29 15:03:49 INFO [main] recover_object_main(904) object recovery progress 99%
> ========== WHY HERE ISN'T LINE WITH MESSAGE "recover_object_main(904)
object recovery progress 100%"?
> Sep 29 15:04:17 NOTICE [main] cluster_recovery_completion(714) all nodes are recovered, epoch 6
> Sep 29 15:05:58 INFO [main] main(966) shutdown
> Sep 29 15:05:58 INFO [main] zk_leave(989) leaving from cluster
> Sep 29 15:06:38 INFO [main] md_add_disk(343) /var/lib/sheepdog/obj, vdisk nr 88, total disk 1
> Sep 29 15:06:38 NOTICE [main] get_local_addr(522) found IPv4 address
> Sep 29 15:06:38 INFO [main] send_join_request(1032) IPv4 ip:192.168.8.201 port:7000 going to join the cluster
> Sep 29 15:06:39 NOTICE [main] nfs_init(608) nfs server service is not compiled
> Sep 29 15:06:39 INFO [main] check_host_env(500) Allowed open files 1024000, suggested 6144000
> Sep 29 15:06:39 INFO [main] main(958) sheepdog daemon (version 0.9.2) started
> Sep 29 15:06:39 ERROR [block] sheep_exec_req(1170) failed The buffer is too small, remote address: 192.168.1.202:7000, op name: VDI_STATE_SNAPSHOT_CTL
> Sep 29 15:06:39 ERROR [block] sheep_exec_req(1170) failed The buffer is too small, remote address: 192.168.1.202:7000, op name: VDI_STATE_SNAPSHOT_CTL
> Sep 29 15:06:40 INFO [main] recover_object_main(904) object recovery progress 1%
> Sep 29 15:06:41 ERROR [rw 293160] sheep_exec_req(1170) failed Request has an old epoch, remote address: 193.221.122.241:7000, op name: READ_PEER
> Sep 29 15:06:41 ERROR [rw 293160] recover_object_work(574) failed to recover object fda51e0000002e
> Sep 29 15:06:42 INFO [main] local_vdi_state_snapshot_ctl(1398) freeing vdi state snapshot at epoch 7
> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery progress 1%
> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 ERROR [rw 293160] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 INFO [main] recover_object_main(904) object recovery progress 2%
> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:42 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 INFO [main] recover_object_main(904) object recovery progress 3%
> Sep 29 15:06:43 ERROR [rw 293160] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293174] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293176] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
> Sep 29 15:06:43 ERROR [rw 293175] sheep_exec_req(1170) failed No object found, remote address: 192.168.8.201:7000, op name: READ_PEER
Is it known issue?
Marcin
More information about the sheepdog-users
mailing list