[sheepdog-users] Stability regression with erasure coding

Liu Yuan namei.unix at gmail.com
Fri Jul 11 04:16:35 CEST 2014


On Fri, Jul 04, 2014 at 04:54:35PM +0200, Valerio Pachera wrote:
> Hi, I was testing master branch on a 4 nodes cluster.
> 
> I got severe issues formatting the cluster with -c 2:1.
> 
> I imported a vdi and run a the guest.
> qemu crashed after boot.
> Sheep.log was showing:
> 
> Jul 04 16:46:25  ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:46:26  ERROR [rw 19500] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:46:26  ERROR [rw 19500] exec_req(347) failed to read a response
> 
> I restarted the guest and it was working.
> I tried then to kill a node (obviously not the one the guest was running
> on).
> After that, I wasn't even able to login inside the guest.
> 
> Sheep.log is showing
> 
> Jul 04 16:46:55  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:55  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ERROR [rw 19500] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:46:56  ERROR [rw 19500] exec_req(347) failed to read a response
> Jul 04 16:46:56  ERROR [rw 19500] recover_replication_object(412) can not
> recover oid 87c2b260000008e
> Jul 04 16:46:56  ERROR [rw 19500] recover_object_work(576) failed to
> recover object 87c2b260000008e
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_number(100) copy number for
> fd3815 not found, set 3
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_number(100) copy number for
> fd3815 not found, set 3
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:46:56  ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25  ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25  ERROR [main] check_request_epoch(176) new node version 1,
> 9 (READ_PEER)
> Jul 04 16:47:25  ERROR [rw 19447] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:47:25  ERROR [rw 19447] exec_req(347) failed to read a response
> Jul 04 16:47:25  ERROR [rw 19447] read_erasure_object(228) can not read
> fd3815000007f3 idx 0
> Jul 04 16:47:25  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:25  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:25  ALERT [rw 19447] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ERROR [rw 19498] do_read(236) failed to read from socket:
> -1, Resource temporarily unavailable
> Jul 04 16:47:26  ERROR [rw 19498] exec_req(347) failed to read a response
> Jul 04 16:47:26  ERROR [rw 19498] read_erasure_object(228) can not read
> fd381500000502 idx 2
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:26  ALERT [rw 19498] get_vdi_copy_policy(117) copy policy for
> fd3815 not found, set 17
> Jul 04 16:47:50  ERROR [gway 23895] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23896] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23890] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23889] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23886] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23867] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23892] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> Jul 04 16:47:50  ERROR [gway 23888] wait_forward_request(438) fail
> 7c2b2500000208, Node is killed
> 
> there are endless number ot these last error messages.
> 
> Sheepdog daemon version 0.8.0_223_ge4735ba.

Might this be something related to the new reclaim algorithm's sprasing patch,
Hitoshi? I remeber when I develop EC, I found that it can't work with
zeroing/sparsing functions.

Thanks
Yuan



More information about the sheepdog-users mailing list