[sheepdog-users] issues re-joining a cluster
Valerio Pachera
sirio81 at gmail.com
Sat Dec 6 12:29:55 CET 2014
In a previous discussion named 'Re-joing the cluster doesn't remove
orphan objects' we pointed out that you have to remove metadata dir
when rejoining the cluster if any vdi got deleted.
It seems this isn't valid on Sheepdog daemon version 0.9.0_12_g139ab59
using --enable-diskvnodes.
Remove a node;
wait recovery;
removing a vdi;
insert back the node keeping metadata;
(This is known to cause problems)
Dec 06 11:19:38 INFO [main] recover_object_main(905) object recovery
progress 100%
Dec 06 11:19:38 ERROR [rw 3122] sheep_exec_req(1170) failed Network
error between sheep, remote address: 192.168.10.4:7000, op name:
READ_PEER
Dec 06 11:19:38 ERROR [rw 3122] recover_replication_object(411) can
not recover oid fd3815000004f8
Dec 06 11:19:38 ERROR [rw 3122] recover_object_work(575) failed to
recover object fd3815000004f8
Dec 06 11:19:38 ERROR [io 3272] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3272] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3272] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3158] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3159] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3273] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3276] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3277] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3278] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3272] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3147] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3280] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:38 ERROR [io 3158] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:38 ERROR [io 3158] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3273] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3276] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3277] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3278] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3272] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3147] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3280] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3158] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3159] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3273] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3276] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3277] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3278] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3279] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3147] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3280] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3275] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3158] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:19:39 ERROR [io 3159] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3273] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:19:39 ERROR [io 3276] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:20:15 NOTICE [main] cluster_recovery_completion(726) all
nodes are recovered, epoch 3
Remove a node;
wait recovery;
removing a vdi;
insert back the node clearing metadata dir first;
(This wasn't giving problems)
Dec 06 11:37:37 ERROR [rw 4215] sheep_exec_req(1170) failed Network
error between sheep, remote address: 192.168.10.5:7000, op name:
READ_PEER
Dec 06 11:37:37 ERROR [rw 4216] recover_replication_object(411) can
not recover oid 7c2b2500000096
Dec 06 11:37:37 ERROR [rw 4216] recover_object_work(575) failed to
recover object 7c2b2500000096
Dec 06 11:37:37 ERROR [rw 4215] recover_replication_object(411) can
not recover oid 7c2b250000008e
Dec 06 11:37:37 ERROR [rw 4215] recover_object_work(575) failed to
recover object 7c2b250000008e
Dec 06 11:37:37 ERROR [io 4296] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:37:37 ERROR [io 4362] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:37:37 ERROR [io 4298] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:37:37 ERROR [io 4243] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:37:37 ERROR [rw 4219] sheep_exec_req(1170) failed Network
error between sheep, remote address: 192.168.10.5:7000, op name:
READ_PEER
Dec 06 11:37:37 ERROR [rw 4219] recover_replication_object(411) can
not recover oid 7c2b25000000a2
Dec 06 11:37:37 ERROR [rw 4219] recover_object_work(575) failed to
recover object 7c2b25000000a2
Remove a node;
delete a vdi while recovering
(This is another test I dind try before)
Dec 06 11:27:23 ERROR [io 3996] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:27:23 ERROR [io 3994] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:27:23 ERROR [io 3993] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:27:23 ERROR [io 3996] err_to_sderr(101) /mnt/sheep/1/.stale corrupted
Dec 06 11:27:23 ERROR [io 3994] err_to_sderr(101) /mnt/sheep/0/.stale corrupted
Dec 06 11:27:24 INFO [main] recover_object_main(905) object recovery
progress 61%
Dec 06 11:27:24 INFO [main] recover_object_main(905) object recovery
progress 62%
Dec 06 11:27:24 INFO [main] recover_object_main(905) object recovery
progress 63%
Dec 06 11:27:25 INFO [main] recover_object_main(905) object recovery
progress 64%
Dec 06 11:27:26 INFO [main] recover_object_main(905) object recovery
progress 65%
Dec 06 11:27:26 INFO [main] recover_object_main(905) object recovery
progress 66%
Dec 06 11:27:27 INFO [main] recover_object_main(905) object recovery
progress 67%
Dec 06 11:27:27 INFO [main] recover_object_main(905) object recovery
progress 69%
Dec 06 11:27:27 ERROR [rw 3962] recover_replication_object(411) can
not recover oid fd381500000000
Dec 06 11:27:27 ERROR [rw 3962] recover_object_work(575) failed to
recover object fd381500000000
Dec 06 11:27:27 ERROR [rw 3963] recover_replication_object(411) can
not recover oid fd381500000001
Dec 06 11:27:27 ERROR [rw 3963] recover_object_work(575) failed to
recover object fd381500000001
Dec 06 11:27:27 ERROR [rw 3958] sheep_exec_req(1170) failed Network
error between sheep, remote address: 192.168.10.6:7000, op name:
READ_PEER
Dec 06 11:27:27 ERROR [rw 3958] recover_replication_object(411) can
not recover oid fd381500000002
Dec 06 11:27:27 ERROR [rw 3958] recover_object_work(575) failed to
recover object fd381500000002
Dec 06 11:27:27 ERROR [rw 3961] recover_replication_object(411) can
not recover oid fd381500000004
The only way not to get these errors, is to completely wipe out
objects and metadata dir before re-joining the cluster.
More information about the sheepdog-users
mailing list