[sheepdog-users] VM stops IO if node fails

Thu May 22 20:15:34 CEST 2014

Hello,

I just installed sheepdog (Ubuntu 0.8.1-3) on 3 nodes. Everything is
working as expected, but as soon as I remove/hard-reset one node, my VM
stops processing IOs.

I checked the logs and corosync is detecting the lost connection. Even
sheepdog is detecting the change.

--
May 22 20:11:45 node1 corosync[7147]:   [MAIN  ] Member left: r(0)
ip(192.168.20.23)
May 22 20:11:45 node1 corosync[7147]:   [TOTEM ] waiting_trans_ack
changed to 1
May 22 20:11:45 node1 corosync[7147]:   [TOTEM ] entering OPERATIONAL state.
May 22 20:11:45 node1 corosync[7147]:   [TOTEM ] A new membership
(192.168.20.21:120) was formed. Members left: -1062726633
May 22 20:11:45 node1 corosync[7147]:   [SYNC  ] Committing
synchronization for corosync configuration map access
May 22 20:11:45 node1 corosync[7147]:   [QB    ] Not first sync -> no action
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] got joinlist message
from node c0a81416
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] comparing: sender r(0)
ip(192.168.20.21) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] comparing: sender r(0)
ip(192.168.20.22) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.20.21) ; members(old:3 left:1)
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] left_list_entries:1
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] left_list[0]
group:sheepdog, ip:r(0) ip(192.168.20.23) , pid:2338
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] got joinlist message
from node c0a81415
May 22 20:11:45 node1 corosync[7147]:   [SYNC  ] Committing
synchronization for corosync cluster closed process group service v1.01
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] joinlist_messages[0]
group:sheepdog, ip:r(0) ip(192.168.20.21) , pid:8158
May 22 20:11:45 node1 corosync[7147]:   [CPG   ] joinlist_messages[1]
group:sheepdog, ip:r(0) ip(192.168.20.22) , pid:18633
May 22 20:11:45 node1 corosync[7147]:   [MAIN  ] Completed service
synchronization, ready to provide service.
May 22 20:11:45 node1 corosync[7147]:   [TOTEM ] waiting_trans_ack
changed to 0
May 22 20:11:46 node1 sheep[8159]:  DEBUG [main] cdrv_cpg_confchg(555)
mem:2, joined:0, left:1
May 22 20:11:46 node1 sheep[8159]:  DEBUG [main]
__corosync_dispatch(373) wait for a next dispatch event
May 22 20:11:49 node1 sheep[8159]:   WARN [gway 8693]
wait_forward_request(389) poll timeout 1, disks of some nodes or network
is busy. Going to poll-wait again
May 22 20:12:14 node1 sheep[8159]: message repeated 5 times: [   WARN
[gway 8693] wait_forward_request(389) poll timeout 1, disks of some
nodes or network is busy. Going to poll-wait again]
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8693]
sockfd_cache_close(388) 192.168.20.23:7000 idx 0
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8693]
sockfd_cache_del_node(475) 192.168.20.23:7000, count 2
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8693]
do_process_work(1435) failed: 3, 18f71800000827 , 22, Network error
between sheep
May 22 20:12:19 node1 sheep[8159]:  DEBUG [main] gateway_op_done(113)
retrying failed I/O request op WRITE_OBJ result 86 epoch 22, sys epoch 22
May 22 20:12:19 node1 sheep[8159]:  DEBUG [main] queue_request(454)
WRITE_OBJ, 1
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8576]
do_process_work(1428) 3, 18f71800000827, 22
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8576]
gateway_forward_request(480) 18f71800000827
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8576]
prepare_replication_requests(38) 18f71800000827
May 22 20:12:19 node1 sheep[8159]:  DEBUG [gway 8576]
sockfd_cache_grab(120) failed node 192.168.20.23:7000
May 22 20:12:24 node1 sheep[8159]:  ERROR [gway 8576] connect_to(193)
failed to connect to 192.168.20.23:7000: Operation now in progress
May 22 20:12:24 node1 sheep[8159]:  DEBUG [gway 8576] connect_to(209)
-1, 192.168.20.23:7000
May 22 20:12:29 node1 sheep[8159]:  ERROR [gway 8576] connect_to(193)
failed to connect to 192.168.20.23:7000: Operation now in progress
May 22 20:12:29 node1 sheep[8159]:  DEBUG [gway 8576] connect_to(209)
-1, 192.168.20.23:7000
May 22 20:12:29 node1 sheep[8159]:  DEBUG [gway 8576]
gateway_forward_request(539) nr_sent 0, err 86
May 22 20:12:29 node1 sheep[8159]:  DEBUG [gway 8576]
do_process_work(1435) failed: 3, 18f71800000827 , 22, Network error
between sheep
May 22 20:12:29 node1 sheep[8159]:  DEBUG [main] gateway_op_done(113)
retrying failed I/O request op WRITE_OBJ result 86 epoch 22, sys epoch 22
May 22 20:12:29 node1 sheep[8159]:  DEBUG [main] queue_request(454)
WRITE_OBJ, 1
...
--

but I have to execute

"dog cluster recover enable"

to get my VM do IOs again.

If I execute a "dog cluster check", this is printed:

--
fixed missing 18f71800000022
 91.7 %
[========================================================================================>
      ] 18 GB / 20 GB      fixed missing 18f7180000002e
 91.7 %
[========================================================================================>
      ] 18 GB / 20 GB      fixed missing 18f7180000002b
...
--

Any hints how to solve this?

Thanks a lot,

Fabian