Hi, my production cluster has crashed. I'm trying to understand the causes. 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M). Cluster has been working fine till today and we copy lot's of data on it each night. This morning I had to expand a vdi from 600G to 1T. Then I run a backup process on the vdi using this vdi. Backup was reading and writing from the same vdi. Guest was running on sheepdog004. >From logs I can see sheepdog002 died first (8:44). Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster stopped. Right now I have two qemu processes on sheepdog004 that I can't kill -9. Corosync and sheepdog processes are running only on sheepdog001. I'm going to force reboot on sheepdog004, and normal reboot the other nodes. Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002. Any suggestion? Here more info: root at sheepdog001:~# collie vdi list Name Id Size Used Shared Creation time VDI id Copies Tag zimbra_backup 0 100 GB 99 GB 0.0 MB 2013-04-16 21:41 2e519 2 systemrescue 0 350 MB 0.0 MB 0.0 MB 2013-05-07 08:44 c8be4d 2 backup_data 0 1.0 TB 606 GB 0.0 MB 2013-04-16 21:45 c8d128 2 crmdelta 0 50 GB 7.7 GB 0.0 MB 2013-04-16 21:32 e149bf 2 backp 0 10 GB 3.8 GB 0.0 MB 2013-04-16 21:31 f313b6 2 SHEEPDOG002 /var/log/messages May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped sheep.log May 07 08:44:44 [main] corosync_handler(740) corosync driver received EPOLLHUP event, exiting. /var/log/syslog ... May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] FAILED TO RECEIVE May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped SHEEPDOG004 /var/log/syslog May 7 08:35:33 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 ... May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] FAILED TO RECEIVE /var/log/messages May 7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped sheep.log May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797, oid:c8d12800006f80 May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797, oid:c8d1280000b162 May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797, oid:c8d1280001773b May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797, oid:c8d1280000b5ce May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797, oid:c8d1280000b709 May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797, oid:2e51900004acf ... May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869 count:181797, oid:c8d1280000b5ae May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870 count:181797, oid:c8d128000202ff May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1 May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871 count:181797, oid:c8d12800022fdf May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1 May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1 May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1 May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1 ... May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872 count:181797, oid:c8d1280000b355 May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873 count:181797, oid:c8d1280000afa4 May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874 count:181797, oid:c8d128000114ac May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875 count:181797, oid:c8d128000140e9 May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876 count:181797, oid:c8d1280001f031 May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877 count:181797, oid:c8d12800008d92 ... May 07 10:39:03 [main] corosync_handler(740) corosync driver received EPOLLHUP event, exiting. SHEEPDOG001 /var/log/syslog May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor failed, forming new configuration. May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 7 10:39:02 sheepdog001 corosync[2695]: [CPG ] chosen downlist: sender r(0) ip(192.168.6.41) ; members(old:2 left:1) May 7 10:39:02 sheepdog001 corosync[2695]: [MAIN ] Completed service synchronization, ready to provide service. May 7 10:39:03 sheepdog001 corosync[2695]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 7 10:39:03 sheepdog001 corosync[2695]: [CPG ] chosen downlist: sender r(0) ip(192.168.6.41) ; members(old:1 left:0) May 7 10:39:03 sheepdog001 corosync[2695]: [MAIN ] Completed service synchronization, ready to provide service. sheep.log ... May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794 count:181797, oid:c8d1280000117b May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795 count:181797, oid:c8d128000055fe May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796 count:181797, oid:c8d12800012667 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130507/a45a15f5/attachment.html> |