[sheepdog-users] Stability regression with erasure coding
Liu Yuan
namei.unix at gmail.com
Fri Jul 11 04:16:35 CEST 2014
On Fri, Jul 04, 2014 at 04:54:35PM +0200, Valerio Pachera wrote:
> Hi, I was testing master branch on a 4 nodes cluster.
>
> I got severe issues formatting the cluster with -c 2:1.
>
> I imported a vdi and run a the guest.
> qemu crashed after boot.
> Sheep.log was showing:
>
> Jul 04 16:46:25 ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:46:26 ERROR [rw 19500] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:46:26 ERROR [rw 19500] exec_req(347) failed to read a response
>
> I restarted the guest and it was working.
> I tried then to kill a node (obviously not the one the guest was running
> on).
> After that, I wasn't even able to login inside the guest.
>
> Sheep.log is showing
>
> Jul 04 16:46:55 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ERROR [rw 19500] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:46:56 ERROR [rw 19500] exec_req(347) failed to read a response
> Jul 04 16:46:56 ERROR [rw 19500] recover_replication_object(412) can not
> recover oid 87c2b260000008e
> Jul 04 16:46:56 ERROR [rw 19500] recover_object_work(576) failed to
> recover object 87c2b260000008e
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_number(100) copy number for
> fd3815 not found, set 3
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_number(100) copy number for
> fd3815 not found, set 3
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56 ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25 ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25 ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25 ERROR [rw 19447] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:47:25 ERROR [rw 19447] exec_req(347) failed to read a response
> Jul 04 16:47:25 ERROR [rw 19447] read_erasure_object(228) can not read
> fd3815000007f3 idx 0
> Jul 04 16:47:25 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:25 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:25 ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ERROR [rw 19498] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:47:26 ERROR [rw 19498] exec_req(347) failed to read a response
> Jul 04 16:47:26 ERROR [rw 19498] read_erasure_object(228) can not read
> fd381500000502 idx 2
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26 ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:50 ERROR [gway 23895] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23896] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23890] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23889] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23886] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23867] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23892] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50 ERROR [gway 23888] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
>
> there are endless number ot these last error messages.
>
> Sheepdog daemon version 0.8.0_223_ge4735ba.
Might this be something related to the new reclaim algorithm's sprasing patch,
Hitoshi? I remeber when I develop EC, I found that it can't work with
zeroing/sparsing functions.
Thanks
Yuan
More information about the sheepdog-users
mailing list