[sheepdog-users] VM stops IO if node fails
Fabian Zimmermann
dev.faz at gmail.com
Thu May 22 20:15:34 CEST 2014
Hello,
I just installed sheepdog (Ubuntu 0.8.1-3) on 3 nodes. Everything is
working as expected, but as soon as I remove/hard-reset one node, my VM
stops processing IOs.
I checked the logs and corosync is detecting the lost connection. Even
sheepdog is detecting the change.
--
May 22 20:11:45 node1 corosync[7147]: [MAIN ] Member left: r(0)
ip(192.168.20.23)
May 22 20:11:45 node1 corosync[7147]: [TOTEM ] waiting_trans_ack
changed to 1
May 22 20:11:45 node1 corosync[7147]: [TOTEM ] entering OPERATIONAL state.
May 22 20:11:45 node1 corosync[7147]: [TOTEM ] A new membership
(192.168.20.21:120) was formed. Members left: -1062726633
May 22 20:11:45 node1 corosync[7147]: [SYNC ] Committing
synchronization for corosync configuration map access
May 22 20:11:45 node1 corosync[7147]: [QB ] Not first sync -> no action
May 22 20:11:45 node1 corosync[7147]: [CPG ] got joinlist message
from node c0a81416
May 22 20:11:45 node1 corosync[7147]: [CPG ] comparing: sender r(0)
ip(192.168.20.21) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]: [CPG ] comparing: sender r(0)
ip(192.168.20.22) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]: [CPG ] chosen downlist: sender
r(0) ip(192.168.20.21) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]: [CPG ] left_list_entries:1
May 22 20:11:45 node1 corosync[7147]: [CPG ] left_list[0]
group:sheepdog, ip:r(0) ip(192.168.20.23) , pid:2338
May 22 20:11:45 node1 corosync[7147]: [CPG ] got joinlist message
from node c0a81415
May 22 20:11:45 node1 corosync[7147]: [SYNC ] Committing
synchronization for corosync cluster closed process group service v1.01
May 22 20:11:45 node1 corosync[7147]: [CPG ] joinlist_messages[0]
group:sheepdog, ip:r(0) ip(192.168.20.21) , pid:8158
May 22 20:11:45 node1 corosync[7147]: [CPG ] joinlist_messages[1]
group:sheepdog, ip:r(0) ip(192.168.20.22) , pid:18633
May 22 20:11:45 node1 corosync[7147]: [MAIN ] Completed service
synchronization, ready to provide service.
May 22 20:11:45 node1 corosync[7147]: [TOTEM ] waiting_trans_ack
changed to 0
May 22 20:11:46 node1 sheep[8159]: DEBUG [main] cdrv_cpg_confchg(555)
mem:2, joined:0, left:1
May 22 20:11:46 node1 sheep[8159]: DEBUG [main]
__corosync_dispatch(373) wait for a next dispatch event
May 22 20:11:49 node1 sheep[8159]: WARN [gway 8693]
wait_forward_request(389) poll timeout 1, disks of some nodes or network
is busy. Going to poll-wait again
May 22 20:12:14 node1 sheep[8159]: message repeated 5 times: [ WARN
[gway 8693] wait_forward_request(389) poll timeout 1, disks of some
nodes or network is busy. Going to poll-wait again]
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8693]
sockfd_cache_close(388) 192.168.20.23:7000 idx 0
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8693]
sockfd_cache_del_node(475) 192.168.20.23:7000, count 2
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8693]
do_process_work(1435) failed: 3, 18f71800000827 , 22, Network error
between sheep
May 22 20:12:19 node1 sheep[8159]: DEBUG [main] gateway_op_done(113)
retrying failed I/O request op WRITE_OBJ result 86 epoch 22, sys epoch 22
May 22 20:12:19 node1 sheep[8159]: DEBUG [main] queue_request(454)
WRITE_OBJ, 1
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8576]
do_process_work(1428) 3, 18f71800000827, 22
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8576]
gateway_forward_request(480) 18f71800000827
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8576]
prepare_replication_requests(38) 18f71800000827
May 22 20:12:19 node1 sheep[8159]: DEBUG [gway 8576]
sockfd_cache_grab(120) failed node 192.168.20.23:7000
May 22 20:12:24 node1 sheep[8159]: ERROR [gway 8576] connect_to(193)
failed to connect to 192.168.20.23:7000: Operation now in progress
May 22 20:12:24 node1 sheep[8159]: DEBUG [gway 8576] connect_to(209)
-1, 192.168.20.23:7000
May 22 20:12:29 node1 sheep[8159]: ERROR [gway 8576] connect_to(193)
failed to connect to 192.168.20.23:7000: Operation now in progress
May 22 20:12:29 node1 sheep[8159]: DEBUG [gway 8576] connect_to(209)
-1, 192.168.20.23:7000
May 22 20:12:29 node1 sheep[8159]: DEBUG [gway 8576]
gateway_forward_request(539) nr_sent 0, err 86
May 22 20:12:29 node1 sheep[8159]: DEBUG [gway 8576]
do_process_work(1435) failed: 3, 18f71800000827 , 22, Network error
between sheep
May 22 20:12:29 node1 sheep[8159]: DEBUG [main] gateway_op_done(113)
retrying failed I/O request op WRITE_OBJ result 86 epoch 22, sys epoch 22
May 22 20:12:29 node1 sheep[8159]: DEBUG [main] queue_request(454)
WRITE_OBJ, 1
...
--
but I have to execute
"dog cluster recover enable"
to get my VM do IOs again.
If I execute a "dog cluster check", this is printed:
--
fixed missing 18f71800000022
91.7 %
[========================================================================================>
] 18 GB / 20 GB fixed missing 18f7180000002e
91.7 %
[========================================================================================>
] 18 GB / 20 GB fixed missing 18f7180000002b
...
--
Any hints how to solve this?
Thanks a lot,
Fabian
More information about the sheepdog-users
mailing list