[sheepdog-users] Unexpeted freeze of sheep on one node
Valerio Pachera
sirio81 at gmail.com
Thu Nov 20 18:56:30 CET 2014
I think I found the root cause: time shifting!
I see from nagios that node id0 was in warning state because it's
clock was 2-3 minutes late/behind.
Later on, a schedule operation synchronized the clock by ntp.
I tried to replicate this scenario in my testing cluster.
Sure zookeeper doesn't like time shifting.
A guest is running and writing 1 M of random data each second.
Here I move the clock 5 minutes behind
Nov 21 18:36:28 ERROR [main] zk_create_seq_node(251) failed,
path:/sheepdog/queue/, connection loss
Nov 21 18:36:28 ERROR [main] zk_get_data(264) failed,
path:/sheepdog/queue/0000000898, session expired
Nov 21 18:36:28 ERROR [main] zk_find_seq_node(503) failed, session expired
Nov 21 18:36:28 ERROR [main] zk_queue_push(534) failed,
path:/sheepdog/queue/, connection loss
Nov 21 18:36:28 ERROR [main] add_event(683) failed, type: 3, connection loss
Nov 21 18:36:28 ERROR [main] zk_create_seq_node(251) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] zk_queue_push(534) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] add_event(683) failed, type: 3, invalid
zhandle state
Nov 21 18:36:28 ERROR [main] zk_create_seq_node(251) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] zk_queue_push(534) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] add_event(683) failed, type: 3, invalid
zhandle state
Nov 21 18:36:28 ERROR [main] zk_create_seq_node(251) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] zk_queue_push(534) failed,
path:/sheepdog/queue/, invalid zhandle state
Nov 21 18:36:28 ERROR [main] add_event(683) failed, type: 3, invalid
zhandle state
Nov 21 18:36:28 ERROR [main] zk_event_handler(1239) detect a session
timeout. reconnecting...
Nov 21 18:36:28 INFO [main] send_join_request(1006) IPv4
ip:192.168.10.4 port:7000 going to join the cluster
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:28 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 21 18:36:29 INFO [main] zk_event_handler(1241) reconnected
Nov 21 18:36:30 INFO [main] recover_object_main(905) object recovery
progress 2%
Nov 21 18:36:30 INFO [main] recover_object_main(905) object recovery
progress 3%
Nov 21 18:36:30 INFO [main] recover_object_main(905) object recovery
progress 5%
Nov 21 18:36:30 INFO [main] recover_object_main(905) object recovery
progress 7%
Nov 21 18:36:30 INFO [main] recover_object_main(905) object recovery
progress 8%
I read this on another node sheep.log
Nov 20 18:44:35 ERROR [rw 19890] sheep_exec_req(1170) failed Request
has an old epoch, remote address: 192.168.10.5:7000, op name:
READ_PEER
Nov 20 18:44:35 ERROR [rw 19890] recover_object_work(575) failed to
recover object 7c2b2500000036
Nov 20 18:44:35 INFO [main] local_vdi_state_snapshot_ctl(1386)
freeing vdi state snapshot at epoch 3
I re-sync the clock by ntpdate on all nodes.
Whe I shut down the guest I get these messages
Nov 20 18:47:08 ERROR [main] zk_create_seq_node(251) failed,
path:/sheepdog/queue/, operation timeout
Nov 20 18:47:18 ERROR [main] zk_get_data(264) failed,
path:/sheepdog/queue/0000000918, session expired
Nov 20 18:47:18 ERROR [main] zk_find_seq_node(503) failed, session expired
Nov 20 18:47:18 ERROR [main] zk_queue_push(534) failed,
path:/sheepdog/queue/, operation timeout
Nov 20 18:47:18 ERROR [main] add_event(683) failed, type: 4, operation timeout
Nov 20 18:47:18 ERROR [main] queue_cluster_request(321) failed to
broadcast block to cluster, Cluster driver error
Nov 20 18:47:18 ERROR [main] zk_event_handler(1239) detect a session
timeout. reconnecting...
Nov 20 18:47:18 INFO [main] send_join_request(1006) IPv4
ip:192.168.10.4 port:7000 going to join the cluster
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_find_master(838) detect master leave,
start to compete master
Nov 20 18:47:18 INFO [main] zk_event_handler(1241) reconnected
Anyway, on another test I run earlier I had different type of errors.
I wasn't able to replicate sheep freeze (I'm also using master branch
in my testing cluster and not 0.8.2).
More information about the sheepdog-users
mailing list