[sheepdog-users] [corosync] cluster has creashed
Jan Friesse
jfriesse at redhat.com
Tue May 7 12:48:48 CEST 2013
Corosync death after "FAILED TO RECEIVE" message is fixed by
81ff0e8c94589bb7139d89e573a75473cfc5d173 commit in corosync 1.4.5.
Please try to install this version.
Regards,
Honza
Liu Yuan napsal(a):
> On 05/07/2013 06:15 PM, Valerio Pachera wrote:
>> Hi, my production cluster has crashed.
>> I'm trying to understand the causes.
>> 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M).
>>
>> Cluster has been working fine till today and we copy lot's of data on it
>> each night.
>> This morning I had to expand a vdi from 600G to 1T.
>> Then I run a backup process on the vdi using this vdi.
>> Backup was reading and writing from the same vdi.
>> Guest was running on sheepdog004.
>>
>> From logs I can see sheepdog002 died first (8:44).
>> Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster
>> stopped.
>>
>> Right now I have two qemu processes on sheepdog004 that I can't kill -9.
>> Corosync and sheepdog processes are running only on sheepdog001.
>>
>> I'm going to force reboot on sheepdog004, and normal reboot the other nodes.
>> Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002.
>> Any suggestion?
>>
>> Here more info:
>>
>> root at sheepdog001:~# collie vdi list
>> Name Id Size Used Shared Creation time VDI id
>> Copies Tag
>> zimbra_backup 0 100 GB 99 GB 0.0 MB 2013-04-16 21:41
>> 2e519 2
>> systemrescue 0 350 MB 0.0 MB 0.0 MB 2013-05-07 08:44
>> c8be4d 2
>> backup_data 0 1.0 TB 606 GB 0.0 MB 2013-04-16 21:45
>> c8d128 2
>> crmdelta 0 50 GB 7.7 GB 0.0 MB 2013-04-16 21:32 e149bf
>> 2
>> backp 0 10 GB 3.8 GB 0.0 MB 2013-04-16 21:31 f313b6 2
>>
>> SHEEPDOG002
>> /var/log/messages
>> May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
>>
>> sheep.log
>> May 07 08:44:44 [main] corosync_handler(740) corosync driver received
>> EPOLLHUP event, exiting.
>>
>> /var/log/syslog
>> ...
>> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] FAILED TO RECEIVE
>> May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
>>
>>
>
> Looks like a Corosync's issue, I have no idea what these logs are. It is
> out of my knowledge. CC'ed corosync devel list for help.
>
>> SHEEPDOG004
>> /var/log/syslog
>> May 7 08:35:33 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> ...
>> May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6
>> May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] FAILED TO RECEIVE
>>
>> /var/log/messages
>> May 7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped
>>
>> sheep.log
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797,
>> oid:c8d12800006f80
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797,
>> oid:c8d1280000b162
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797,
>> oid:c8d1280001773b
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797,
>> oid:c8d1280000b5ce
>> May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797,
>> oid:c8d1280000b709
>> May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797,
>> oid:2e51900004acf
>> ...
>> May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869
>> count:181797, oid:c8d1280000b5ae
>> May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870
>> count:181797, oid:c8d128000202ff
>> May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871
>> count:181797, oid:c8d12800022fdf
>> May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1
>> ...
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872
>> count:181797, oid:c8d1280000b355
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873
>> count:181797, oid:c8d1280000afa4
>> May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874
>> count:181797, oid:c8d128000114ac
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875
>> count:181797, oid:c8d128000140e9
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876
>> count:181797, oid:c8d1280001f031
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877
>> count:181797, oid:c8d12800008d92
>> ...
>> May 07 10:39:03 [main] corosync_handler(740) corosync driver received
>> EPOLLHUP event, exiting.
>>
>
> This means corosync process was gone. (killed?)
>
>>
>>
>> SHEEPDOG001
>> /var/log/syslog
>> May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
>> 97 98 99 9a 9b 9c
>> May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List:
>> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
>> a1 a2 a3 a4 a5 a6
>> May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
>> 97 98 99 9a 9b 9c
>> May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List:
>> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
>> a1 a2 a3 a4 a5 a6
>> May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor
>> failed, forming new configuration.
>> May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> May 7 10:39:02 sheepdog001 corosync[2695]: [CPG ] chosen downlist:
>> sender r(0) ip(192.168.6.41) ; members(old:2 left:1)
>> May 7 10:39:02 sheepdog001 corosync[2695]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> May 7 10:39:03 sheepdog001 corosync[2695]: [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> May 7 10:39:03 sheepdog001 corosync[2695]: [CPG ] chosen downlist:
>> sender r(0) ip(192.168.6.41) ; members(old:1 left:0)
>> May 7 10:39:03 sheepdog001 corosync[2695]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>>
>> sheep.log
>> ...
>> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794
>> count:181797, oid:c8d1280000117b
>> May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795
>> count:181797, oid:c8d128000055fe
>> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796
>> count:181797, oid:c8d12800012667
>>
>>
>
> I hope some guys from Corosync community can take a look at this issue,
> you might need more information about Corosync like its version, your
> host platform.
>
> Thanks,
> Yuan
>
> _______________________________________________
> discuss mailing list
> discuss at corosync.org
> http://lists.corosync.org/mailman/listinfo/discuss
>
More information about the sheepdog-users
mailing list