On 05/07/2013 06:15 PM, Valerio Pachera wrote: > Hi, my production cluster has crashed. > I'm trying to understand the causes. > 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M). > > Cluster has been working fine till today and we copy lot's of data on it > each night. > This morning I had to expand a vdi from 600G to 1T. > Then I run a backup process on the vdi using this vdi. > Backup was reading and writing from the same vdi. > Guest was running on sheepdog004. > > From logs I can see sheepdog002 died first (8:44). > Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster > stopped. > > Right now I have two qemu processes on sheepdog004 that I can't kill -9. > Corosync and sheepdog processes are running only on sheepdog001. > > I'm going to force reboot on sheepdog004, and normal reboot the other nodes. > Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002. > Any suggestion? > > Here more info: > > root at sheepdog001:~# collie vdi list > Name Id Size Used Shared Creation time VDI id > Copies Tag > zimbra_backup 0 100 GB 99 GB 0.0 MB 2013-04-16 21:41 > 2e519 2 > systemrescue 0 350 MB 0.0 MB 0.0 MB 2013-05-07 08:44 > c8be4d 2 > backup_data 0 1.0 TB 606 GB 0.0 MB 2013-04-16 21:45 > c8d128 2 > crmdelta 0 50 GB 7.7 GB 0.0 MB 2013-04-16 21:32 e149bf > 2 > backp 0 10 GB 3.8 GB 0.0 MB 2013-04-16 21:31 f313b6 2 > > SHEEPDOG002 > /var/log/messages > May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped > > sheep.log > May 07 08:44:44 [main] corosync_handler(740) corosync driver received > EPOLLHUP event, exiting. > > /var/log/syslog > ... > May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: > 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 > May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: > 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee > May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: > 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 > May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: > 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee > May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] FAILED TO RECEIVE > May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped > > Looks like a Corosync's issue, I have no idea what these logs are. It is out of my knowledge. CC'ed corosync devel list for help. > SHEEPDOG004 > /var/log/syslog > May 7 08:35:33 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: > 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 > 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 > May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: > 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 > 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee > May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: > 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 > 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 > ... > May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: > 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 > May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] FAILED TO RECEIVE > > /var/log/messages > May 7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped > > sheep.log > May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797, > oid:c8d12800006f80 > May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797, > oid:c8d1280000b162 > May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797, > oid:c8d1280001773b > May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797, > oid:c8d1280000b5ce > May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797, > oid:c8d1280000b709 > May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797, > oid:2e51900004acf > ... > May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869 > count:181797, oid:c8d1280000b5ae > May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870 > count:181797, oid:c8d128000202ff > May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1 > May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871 > count:181797, oid:c8d12800022fdf > May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1 > May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1 > May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1 > May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1 > ... > May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872 > count:181797, oid:c8d1280000b355 > May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873 > count:181797, oid:c8d1280000afa4 > May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874 > count:181797, oid:c8d128000114ac > May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875 > count:181797, oid:c8d128000140e9 > May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876 > count:181797, oid:c8d1280001f031 > May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877 > count:181797, oid:c8d12800008d92 > ... > May 07 10:39:03 [main] corosync_handler(740) corosync driver received > EPOLLHUP event, exiting. > This means corosync process was gone. (killed?) > > > SHEEPDOG001 > /var/log/syslog > May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: > 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 > 97 98 99 9a 9b 9c > May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: > 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 > a1 a2 a3 a4 a5 a6 > May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: > 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 > 97 98 99 9a 9b 9c > May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: > 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 > a1 a2 a3 a4 a5 a6 > May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor > failed, forming new configuration. > May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > May 7 10:39:02 sheepdog001 corosync[2695]: [CPG ] chosen downlist: > sender r(0) ip(192.168.6.41) ; members(old:2 left:1) > May 7 10:39:02 sheepdog001 corosync[2695]: [MAIN ] Completed service > synchronization, ready to provide service. > May 7 10:39:03 sheepdog001 corosync[2695]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > May 7 10:39:03 sheepdog001 corosync[2695]: [CPG ] chosen downlist: > sender r(0) ip(192.168.6.41) ; members(old:1 left:0) > May 7 10:39:03 sheepdog001 corosync[2695]: [MAIN ] Completed service > synchronization, ready to provide service. > > sheep.log > ... > May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794 > count:181797, oid:c8d1280000117b > May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795 > count:181797, oid:c8d128000055fe > May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796 > count:181797, oid:c8d12800012667 > > I hope some guys from Corosync community can take a look at this issue, you might need more information about Corosync like its version, your host platform. Thanks, Yuan |