[sheepdog-users] Very high wait time during recovery

Simon Hürlimann simon.huerlimann at cyt.ch
Mon Apr 13 23:33:31 CEST 2015


Hi

I'm using a 3 node Setup of sheepdog 0.7.5-1 as provided by Ubuntu
12.04.4 LTS with qemu. The VDIs have two copies.

I'm experiencing high wait time in the guests, even leading to some
short outages during recovery or node kills. An actual occurrence was
when I've 'dog node kill' one of the nodes. The other two nodes start
recovery to ensure the two copy policy as far as I understand. During
this, I get the following logs:

Apr 13 20:24:54   INFO [main] recover_object_main(624) object
7b7f0400001414 is recovered (1/70169)
Apr 13 20:24:54   INFO [main] recover_object_main(624) object
7b7f0400000c5f is recovered (2/70169)
[...]
Apr 13 20:25:29   INFO [main] recover_object_main(624) object
3eb96700002d9f is recovered (1346/70169)
Apr 13 20:25:29   INFO [main] recover_object_main(624) object
b8729c00000255 is recovered (1347/70169)
Apr 13 20:25:29   INFO [main] recover_object_main(624) object
3eb9670000003a is recovered (1348/70169)
Apr 13 20:25:29  ERROR [io 28218] do_epoch_log_read(54) failed to open
epoch 54 log, No such file or directory
Apr 13 20:25:29  ERROR [io 28230] do_epoch_log_read(54) failed to open
epoch 53 log, No such file or directory
[...]
Apr 13 20:25:29  ERROR [io 28234] do_epoch_log_read(54) failed to open
epoch 2 log, No such file or directory
Apr 13 20:25:29  ERROR [io 28235] do_epoch_log_read(54) failed to open
epoch 1 log, No such file or directory
Apr 13 20:25:29   INFO [main] recover_object_main(624) object
b8729c00000329 is recovered (1350/70169)
Apr 13 20:25:29   INFO [main] recover_object_main(624) object
3eb96700003017 is recovered (1351/70169)
[...]
Apr 13 20:26:16   INFO [main] recover_object_main(624) object
38499b00000041 is recovered (2053/70169)
Apr 13 20:26:17   INFO [main] recover_object_main(624) object
43f30a0000003c is recovered (2054/70169)
Apr 13 20:26:18   INFO [main] recover_object_main(624) object
7b7f0400001414 is recovered (1/45485)
Apr 13 20:26:18   INFO [main] recover_object_main(624) object
7b7f0400000c5f is recovered (2/45485)
Apr 13 20:26:18   INFO [main] recover_object_main(624) object
eefb9700000f35 is recovered (3/45485)
[...]
Apr 13 20:26:19   INFO [main] recover_object_main(624) object
eefb9700005897 is recovered (34123/45485)
Apr 13 20:26:27   WARN [gway 28175] wait_forward_request(177) poll
timeout 1, disks of some nodes or network is busy. Going to poll-wait again
Apr 13 20:27:38  ERROR [gway 28175] wait_forward_request(212) fail
c4cbc3000001cf, Node is killed
Apr 13 20:27:38  ERROR [gway 28194] wait_forward_request(212) fail
c4cbc3000001cf, Node is killed
[...]
Apr 13 20:27:38  ERROR [gway 28175] wait_forward_request(212) fail
c4cbc3000001cf, Node is killed
Apr 13 20:27:38  ERROR [gway 28194] wait_forward_request(212) fail
c4cbc3000001cf, Node is killed
Apr 13 20:27:38   INFO [main] recover_object_main(624) object
7b7f0400001414 is recovered (1/70169)
Apr 13 20:27:38   INFO [main] recover_object_main(624) object
7b7f0400000c5f is recovered (2/70169)
Apr 13 20:27:38   INFO [main] recover_object_main(624) object
eefb9700000f35 is recovered (3/70169)
[...]

Is this something that is fixed in newer versions?

I can also dig out logs from earlier issues. It seems to mostly be
related to some 'wait_forward_request' logs.

Cheers
Simon



-- 
CyT GmbH
Dorfstr. 17
6300 Zug
Tel.   +41 41 544 47 60
Natel  +41 76 497 80 07




More information about the sheepdog-users mailing list