[sheepdog-users] Errors after recovery

Valerio Pachera sirio81 at gmail.com
Wed Nov 6 09:14:06 CET 2013


Hi, I have a production cluster with 3 nodes and it's running reclaim
branch (0.7.0_19_g331cdcf).

It's using 672G of 1,5T total available.

I had a previous issue with this cluster: a host disconnected it self
during a backup (qemu-img convert).
I inserted the node backup but after the recovery, sheep.log was showing
these messages:

Oct 23 09:00:40  ERROR [gway 2827] wait_forward_request(212) fail
105f5400000000, No object found
Oct 23 09:00:40  ERROR [io 19821] remove_object(452) failed to remove
object 105f5400000000, No object found
Oct 23 09:00:40  ERROR [io 19821] peer_decref_object(1030) error No object
found
Oct 23 09:08:39  ERROR [gway 2827] gateway_forward_request(315) fail to
write local 105f5400000080, No object found
Oct 23 09:08:39  ERROR [gway 2827] wait_forward_request(212) fail
105f5400000080, No object found
Oct 23 09:08:39  ERROR [io 20567] remove_object(452) failed to remove
object 105f5400000080, No object found
Oct 23 09:08:39  ERROR [io 20567] peer_decref_object(1030) error No object
found
Oct 23 09:08:41  ERROR [gway 2827] gateway_forward_request(315) fail to
write local 105f5400000120, No object found

Yesterday evening, I stopped all guests and then the cluster.
Note: after 'cluster shutdown' all sheep daemons were frozen, I has to kill
them by kill 9.

Once restarted sheep on all 3 nodes, a recovery has started (all 3 nodes
involved).
It was quick (~20 minutes).
I run vdi check of the larger vdi (200G) and no messages were reported in
sheep.log, so it was fine.
(I also made a backup of if by qemu-img convert and it went fine).

I killed then node id 0 (the one that failed the 23 october).
The node was storing 303G of data.
It took all night (~ 8h 40') to complete the recovery.
Till this morning (so the end of the recovery), no messages were reported
in both sheep.log.
At the end of it, I get these messages:


id 0
...
Nov 06 08:22:09   INFO [main] recover_object_main(623) object
10b38ba700000240 is recovered (171829/171830)
Nov 06 08:22:09   INFO [main] recover_object_main(623) object
b38ba700000240 is recovered (171830/171830)
Nov 06 08:22:09  ERROR [main] modify_event(150) event info for fd 28 not
found
Nov 06 08:22:09  ERROR [main] modify_event(150) event info for fd 30 not
found
Nov 06 08:22:09  ERROR [main] modify_event(150) event info for fd 33 not
found
Nov 06 08:22:09  ERROR [main] modify_event(150) event info for fd 24 not
found
Nov 06 08:35:01  ERROR [io 4032] do_epoch_log_read(54) failed to open epoch
3 log, No such file or directory
Nov 06 08:35:01  ERROR [io 4032] sheep_exec_req(855) failed Failed to find
requested tag
Nov 06 08:35:01  ERROR [io 4032] do_epoch_log_read(54) failed to open epoch
1 log, No such file or directory
Nov 06 08:35:01  ERROR [io 4032] sheep_exec_req(855) failed Failed to find
requested tag

id 1
...
Nov 06 06:04:21   INFO [main] recover_object_main(623) object
10b38ba700000240 is recovered (171829/171830)
Nov 06 06:04:21   INFO [main] recover_object_main(623) object
b38ba700000240 is recovered (171830/171830)
Nov 06 06:13:14  ERROR [net 14080] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 14080] rx_work(536) failed to read a header
Nov 06 06:13:14  ERROR [net 14080] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 14080] rx_work(536) failed to read a header
Nov 06 06:13:14  ERROR [net 13405] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 13405] rx_work(536) failed to read a header
Nov 06 06:13:14  ERROR [net 14080] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 14080] rx_work(536) failed to read a header
Nov 06 06:13:14  ERROR [net 14080] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 14080] rx_work(536) failed to read a header
Nov 06 06:13:14  ERROR [net 13405] do_read(220) connection is closed (48
bytes left)
Nov 06 06:13:14  ERROR [net 13405] rx_work(536) failed to read a header
Nov 06 06:15:19  ERROR [gway 15164] gateway_forward_request(315) fail to
write local 68779600000034, No object found
Nov 06 06:15:19  ERROR [gway 15164] wait_forward_request(212) fail
68779600000034, No object found
Nov 06 06:15:19  ERROR [gway 4343] remove_object(452) failed to remove
object 68779600000034, No object found
Nov 06 06:15:19  ERROR [gway 4343] peer_decref_object(1030) error No object
found
Nov 06 06:15:19  ERROR [gway 4343] gateway_forward_request(315) fail to
write local 1068779600000034, No object found
Nov 06 06:15:19  ERROR [gway 15126] dec_object_refcnt(476) failed to
decrement reference 1068779600000034, No object found
Nov 06 06:15:19  ERROR [gway 15126] update_obj_refcnt(379) fail, 2
Nov 06 06:15:43  ERROR [io 14819] do_epoch_log_read(54) failed to open
epoch 3 log, No such file or directory
Nov 06 06:15:43  ERROR [io 14683] do_epoch_log_read(54) failed to open
epoch 1 log, No such file or directory
Nov 06 06:15:43  ERROR [io 14683] do_epoch_log_read(54) failed to open
epoch 3 log, No such file or directory
Nov 06 06:15:43  ERROR [io 14819] do_epoch_log_read(54) failed to open
epoch 1 log, No such file or directory
Nov 06 06:15:48   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:15:53   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:15:58   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:03   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:08   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:11  ERROR [net 15117] do_read(220) connection is closed (48
bytes left)
Nov 06 06:16:11  ERROR [net 15117] rx_work(536) failed to read a header
Nov 06 06:16:11  ERROR [net 15118] do_read(220) connection is closed (48
bytes left)
Nov 06 06:16:11  ERROR [net 15118] rx_work(536) failed to read a header
Nov 06 06:16:11  ERROR [net 15118] do_read(220) connection is closed (48
bytes left)
Nov 06 06:16:11  ERROR [net 15118] rx_work(536) failed to read a header
Nov 06 06:16:12  ERROR [net 15123] do_read(220) connection is closed (48
bytes left)
Nov 06 06:16:12  ERROR [net 15123] rx_work(536) failed to read a header
Nov 06 06:16:13   WARN [gway 15126] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:18  ERROR [gway 15210] gateway_forward_request(315) fail to
write local 68779600000052, No object found
Nov 06 06:16:18  ERROR [gway 15210] wait_forward_request(212) fail
68779600000052, No object found
Nov 06 06:16:18  ERROR [gway 15165] remove_object(452) failed to remove
object 68779600000052, No object found
Nov 06 06:16:18  ERROR [gway 15165] peer_decref_object(1030) error No
object found
Nov 06 06:16:19  ERROR [gway 15165] gateway_forward_request(315) fail to
write local 1068779600000052, No object found
Nov 06 06:16:24   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:29   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:34   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:39   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:44   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:49   WARN [gway 15165] wait_forward_request(177) poll timeout
1, disks of some nodes or network is busy. Going to poll-wait again
Nov 06 06:16:57  ERROR [gway 15164] gateway_forward_request(315) fail to
write local 68779600000053, No object found
Nov 06 06:16:57  ERROR [gway 15164] wait_forward_request(212) fail
68779600000053, No object found
Nov 06 06:16:57  ERROR [gway 15209] remove_object(452) failed to remove
object 68779600000053, No object found
Nov 06 06:16:57  ERROR [gway 15209] peer_decref_object(1030) error No
object found
Nov 06 06:16:57  ERROR [gway 15209] gateway_forward_request(315) fail to
write local 1068779600000053, No object found
Nov 06 06:16:57  ERROR [gway 15210] dec_object_refcnt(476) failed to
decrement reference 1068779600000053, No object found
...
Nov 06 08:18:43  ERROR [gway 17070] gateway_forward_request(315) fail to
write local 102748e90000962d, No object found
Nov 06 08:18:43  ERROR [gway 17071] dec_object_refcnt(476) failed to
decrement reference 102748e90000962d, No object found
Nov 06 08:18:43  ERROR [gway 17071] update_obj_refcnt(379) fail, 2
Nov 06 08:18:44  ERROR [gway 17071] gateway_forward_request(315) fail to
write local 2748e90000962f, No object found
Nov 06 08:18:44  ERROR [gway 17071] wait_forward_request(212) fail
2748e90000962f, No object found
Nov 06 08:18:44  ERROR [gway 17072] remove_object(452) failed to remove
object 2748e90000962f, No object found
Nov 06 08:18:44  ERROR [gway 17072] peer_decref_object(1030) error No
object found
Nov 06 08:18:44  ERROR [gway 17072] gateway_forward_request(315) fail to
write local 102748e90000962f, No object found
Nov 06 08:18:44  ERROR [gway 17070] dec_object_refcnt(476) failed to
decrement reference 102748e90000962f, No object found
Nov 06 08:18:44  ERROR [gway 17070] update_obj_refcnt(379) fail, 2
Nov 06 08:18:44  ERROR [gway 17071] gateway_forward_request(315) fail to
write local 2748e900009630, No object found
Nov 06 08:18:44  ERROR [gway 17071] wait_forward_request(212) fail
2748e900009630, No object found
Nov 06 08:18:44  ERROR [gway 17072] remove_object(452) failed to remove
object 2748e900009630, No object found
Nov 06 08:18:44  ERROR [gway 17072] peer_decref_object(1030) error No
object found
Nov 06 08:18:44  ERROR [gway 17072] gateway_forward_request(315) fail to
write local 102748e900009630, No object found
Nov 06 08:18:44  ERROR [gway 17070] dec_object_refcnt(476) failed to
decrement reference 102748e900009630, No object found
Nov 06 08:18:44  ERROR [gway 17070] update_obj_refcnt(379) fail, 2
Nov 06 08:27:55  ERROR [gway 17071] gateway_forward_request(315) fail to
write local 2748e9000030d0, No object found
Nov 06 08:27:55  ERROR [gway 17071] wait_forward_request(212) fail
2748e9000030d0, No object found
Nov 06 08:27:55  ERROR [gway 17070] remove_object(452) failed to remove
object 2748e9000030d0, No object found
Nov 06 08:27:55  ERROR [gway 17070] peer_decref_object(1030) error No
object found
Nov 06 08:27:55  ERROR [gway 17070] gateway_forward_request(315) fail to
write local 102748e9000030d0, No object found
Nov 06 08:27:55  ERROR [gway 17072] dec_object_refcnt(476) failed to
decrement reference 102748e9000030d0, No object found
Nov 06 08:27:55  ERROR [gway 17072] update_obj_refcnt(379) fail, 2
Nov 06 08:32:38  ERROR [io 16541] do_epoch_log_read(54) failed to open
epoch 3 log, No such file or directory
Nov 06 08:32:38  ERROR [io 15571] do_epoch_log_read(54) failed to open
epoch 1 log, No such file or directory
<and keeps printing errors>

What do you think about these messages/errors?
Do I have to worry?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131106/632d83c7/attachment-0004.html>


More information about the sheepdog-users mailing list