<div dir="ltr"><div><div><div>Hi, I have a production cluster with 3 nodes and it's running reclaim branch (0.7.0_19_g331cdcf).<br><br></div><div>It's using 672G of 1,5T total available.<br></div><div><br></div><div>
I had a previous issue with this cluster: a host disconnected it self during a backup (qemu-img convert).<br></div><div>I inserted the node backup but after the recovery, sheep.log was showing these messages:<br><br>Oct 23 09:00:40 ERROR [gway 2827] wait_forward_request(212) fail 105f5400000000, No object found<br>
Oct 23 09:00:40 ERROR [io 19821] remove_object(452) failed to remove object 105f5400000000, No object found<br>Oct 23 09:00:40 ERROR [io 19821] peer_decref_object(1030) error No object found<br>Oct 23 09:08:39 ERROR [gway 2827] gateway_forward_request(315) fail to write local 105f5400000080, No object found<br>
Oct 23 09:08:39 ERROR [gway 2827] wait_forward_request(212) fail 105f5400000080, No object found<br>Oct 23 09:08:39 ERROR [io 20567] remove_object(452) failed to remove object 105f5400000080, No object found<br>Oct 23 09:08:39 ERROR [io 20567] peer_decref_object(1030) error No object found<br>
Oct 23 09:08:41 ERROR [gway 2827] gateway_forward_request(315) fail to write local 105f5400000120, No object found<br><br></div><div>Yesterday evening, I stopped all guests and then the cluster.<br></div><div>Note: after 'cluster shutdown' all sheep daemons were frozen, I has to kill them by kill 9.<br>
<br></div><div>Once restarted sheep on all 3 nodes, a recovery has started (all 3 nodes involved).<br></div><div>It was quick (~20 minutes).<br></div><div>I run vdi check of the larger vdi (200G) and no messages were reported in sheep.log, so it was fine.<br>
</div><div>(I also made a backup of if by qemu-img convert and it went fine).<br><br></div><div>I killed then node id 0 (the one that failed the 23 october).<br></div><div>The node was storing 303G of data.<br></div><div>
It took all night (~ 8h 40') to complete the recovery.<br></div><div>Till this morning (so the end of the recovery), no messages were reported in both sheep.log.<br></div><div>At the end of it, I get these messages:<br>
</div><div><br></div><div><br>id 0 <br>...<br>Nov 06 08:22:09 INFO [main] recover_object_main(623) object 10b38ba700000240 is recovered (171829/171830)<br>Nov 06 08:22:09 INFO [main] recover_object_main(623) object b38ba700000240 is recovered (171830/171830)<br>
Nov 06 08:22:09 ERROR [main] modify_event(150) event info for fd 28 not found<br>Nov 06 08:22:09 ERROR [main] modify_event(150) event info for fd 30 not found<br>Nov 06 08:22:09 ERROR [main] modify_event(150) event info for fd 33 not found<br>
Nov 06 08:22:09 ERROR [main] modify_event(150) event info for fd 24 not found<br>Nov 06 08:35:01 ERROR [io 4032] do_epoch_log_read(54) failed to open epoch 3 log, No such file or directory<br>Nov 06 08:35:01 ERROR [io 4032] sheep_exec_req(855) failed Failed to find requested tag<br>
Nov 06 08:35:01 ERROR [io 4032] do_epoch_log_read(54) failed to open epoch 1 log, No such file or directory<br>Nov 06 08:35:01 ERROR [io 4032] sheep_exec_req(855) failed Failed to find requested tag<br><br></div>id 1<br>
...<br>Nov 06 06:04:21 INFO [main] recover_object_main(623) object 10b38ba700000240 is recovered (171829/171830)<br>Nov 06 06:04:21 INFO [main] recover_object_main(623) object b38ba700000240 is recovered (171830/171830)<br>
Nov 06 06:13:14 ERROR [net 14080] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:13:14 ERROR [net 14080] rx_work(536) failed to read a header<br>Nov 06 06:13:14 ERROR [net 14080] do_read(220) connection is closed (48 bytes left)<br>
Nov 06 06:13:14 ERROR [net 14080] rx_work(536) failed to read a header<br>Nov 06 06:13:14 ERROR [net 13405] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:13:14 ERROR [net 13405] rx_work(536) failed to read a header<br>
Nov 06 06:13:14 ERROR [net 14080] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:13:14 ERROR [net 14080] rx_work(536) failed to read a header<br>Nov 06 06:13:14 ERROR [net 14080] do_read(220) connection is closed (48 bytes left)<br>
Nov 06 06:13:14 ERROR [net 14080] rx_work(536) failed to read a header<br>Nov 06 06:13:14 ERROR [net 13405] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:13:14 ERROR [net 13405] rx_work(536) failed to read a header<br>
Nov 06 06:15:19 ERROR [gway 15164] gateway_forward_request(315) fail to write local 68779600000034, No object found<br>Nov 06 06:15:19 ERROR [gway 15164] wait_forward_request(212) fail 68779600000034, No object found<br>
Nov 06 06:15:19 ERROR [gway 4343] remove_object(452) failed to remove object 68779600000034, No object found<br>Nov 06 06:15:19 ERROR [gway 4343] peer_decref_object(1030) error No object found<br>Nov 06 06:15:19 ERROR [gway 4343] gateway_forward_request(315) fail to write local 1068779600000034, No object found<br>
Nov 06 06:15:19 ERROR [gway 15126] dec_object_refcnt(476) failed to decrement reference 1068779600000034, No object found<br>Nov 06 06:15:19 ERROR [gway 15126] update_obj_refcnt(379) fail, 2<br>Nov 06 06:15:43 ERROR [io 14819] do_epoch_log_read(54) failed to open epoch 3 log, No such file or directory<br>
Nov 06 06:15:43 ERROR [io 14683] do_epoch_log_read(54) failed to open epoch 1 log, No such file or directory<br>Nov 06 06:15:43 ERROR [io 14683] do_epoch_log_read(54) failed to open epoch 3 log, No such file or directory<br>
Nov 06 06:15:43 ERROR [io 14819] do_epoch_log_read(54) failed to open epoch 1 log, No such file or directory<br>Nov 06 06:15:48 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:15:53 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>Nov 06 06:15:58 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:03 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>Nov 06 06:16:08 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:11 ERROR [net 15117] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:16:11 ERROR [net 15117] rx_work(536) failed to read a header<br>Nov 06 06:16:11 ERROR [net 15118] do_read(220) connection is closed (48 bytes left)<br>
Nov 06 06:16:11 ERROR [net 15118] rx_work(536) failed to read a header<br>Nov 06 06:16:11 ERROR [net 15118] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:16:11 ERROR [net 15118] rx_work(536) failed to read a header<br>
Nov 06 06:16:12 ERROR [net 15123] do_read(220) connection is closed (48 bytes left)<br>Nov 06 06:16:12 ERROR [net 15123] rx_work(536) failed to read a header<br>Nov 06 06:16:13 WARN [gway 15126] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:18 ERROR [gway 15210] gateway_forward_request(315) fail to write local 68779600000052, No object found<br>Nov 06 06:16:18 ERROR [gway 15210] wait_forward_request(212) fail 68779600000052, No object found<br>
Nov 06 06:16:18 ERROR [gway 15165] remove_object(452) failed to remove object 68779600000052, No object found<br>Nov 06 06:16:18 ERROR [gway 15165] peer_decref_object(1030) error No object found<br>Nov 06 06:16:19 ERROR [gway 15165] gateway_forward_request(315) fail to write local 1068779600000052, No object found<br>
Nov 06 06:16:24 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>Nov 06 06:16:29 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:34 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>Nov 06 06:16:39 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:44 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>Nov 06 06:16:49 WARN [gway 15165] wait_forward_request(177) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again<br>
Nov 06 06:16:57 ERROR [gway 15164] gateway_forward_request(315) fail to write local 68779600000053, No object found<br>Nov 06 06:16:57 ERROR [gway 15164] wait_forward_request(212) fail 68779600000053, No object found<br>
Nov 06 06:16:57 ERROR [gway 15209] remove_object(452) failed to remove object 68779600000053, No object found<br>Nov 06 06:16:57 ERROR [gway 15209] peer_decref_object(1030) error No object found<br>Nov 06 06:16:57 ERROR [gway 15209] gateway_forward_request(315) fail to write local 1068779600000053, No object found<br>
Nov 06 06:16:57 ERROR [gway 15210] dec_object_refcnt(476) failed to decrement reference 1068779600000053, No object found<br>...<br>Nov 06 08:18:43 ERROR [gway 17070] gateway_forward_request(315) fail to write local 102748e90000962d, No object found<br>
Nov 06 08:18:43 ERROR [gway 17071] dec_object_refcnt(476) failed to decrement reference 102748e90000962d, No object found<br>Nov 06 08:18:43 ERROR [gway 17071] update_obj_refcnt(379) fail, 2<br>Nov 06 08:18:44 ERROR [gway 17071] gateway_forward_request(315) fail to write local 2748e90000962f, No object found<br>
Nov 06 08:18:44 ERROR [gway 17071] wait_forward_request(212) fail 2748e90000962f, No object found<br>Nov 06 08:18:44 ERROR [gway 17072] remove_object(452) failed to remove object 2748e90000962f, No object found<br>Nov 06 08:18:44 ERROR [gway 17072] peer_decref_object(1030) error No object found<br>
Nov 06 08:18:44 ERROR [gway 17072] gateway_forward_request(315) fail to write local 102748e90000962f, No object found<br>Nov 06 08:18:44 ERROR [gway 17070] dec_object_refcnt(476) failed to decrement reference 102748e90000962f, No object found<br>
Nov 06 08:18:44 ERROR [gway 17070] update_obj_refcnt(379) fail, 2<br>Nov 06 08:18:44 ERROR [gway 17071] gateway_forward_request(315) fail to write local 2748e900009630, No object found<br>Nov 06 08:18:44 ERROR [gway 17071] wait_forward_request(212) fail 2748e900009630, No object found<br>
Nov 06 08:18:44 ERROR [gway 17072] remove_object(452) failed to remove object 2748e900009630, No object found<br>Nov 06 08:18:44 ERROR [gway 17072] peer_decref_object(1030) error No object found<br>Nov 06 08:18:44 ERROR [gway 17072] gateway_forward_request(315) fail to write local 102748e900009630, No object found<br>
Nov 06 08:18:44 ERROR [gway 17070] dec_object_refcnt(476) failed to decrement reference 102748e900009630, No object found<br>Nov 06 08:18:44 ERROR [gway 17070] update_obj_refcnt(379) fail, 2<br>Nov 06 08:27:55 ERROR [gway 17071] gateway_forward_request(315) fail to write local 2748e9000030d0, No object found<br>
Nov 06 08:27:55 ERROR [gway 17071] wait_forward_request(212) fail 2748e9000030d0, No object found<br>Nov 06 08:27:55 ERROR [gway 17070] remove_object(452) failed to remove object 2748e9000030d0, No object found<br>Nov 06 08:27:55 ERROR [gway 17070] peer_decref_object(1030) error No object found<br>
Nov 06 08:27:55 ERROR [gway 17070] gateway_forward_request(315) fail to write local 102748e9000030d0, No object found<br>Nov 06 08:27:55 ERROR [gway 17072] dec_object_refcnt(476) failed to decrement reference 102748e9000030d0, No object found<br>
Nov 06 08:27:55 ERROR [gway 17072] update_obj_refcnt(379) fail, 2<br>Nov 06 08:32:38 ERROR [io 16541] do_epoch_log_read(54) failed to open epoch 3 log, No such file or directory<br>Nov 06 08:32:38 ERROR [io 15571] do_epoch_log_read(54) failed to open epoch 1 log, No such file or directory<br>
</div><div><and keeps printing errors><br></div><div><br></div>What do you think about these messages/errors?<br></div>Do I have to worry?<br><div><div><br></div></div></div>