[sheepdog-users] [corosync] cluster has creashed

Tue May 7 12:48:48 CEST 2013

Corosync death after "FAILED TO RECEIVE" message is fixed by
81ff0e8c94589bb7139d89e573a75473cfc5d173 commit in corosync 1.4.5.

Please try to install this version.

Regards,
  Honza

Liu Yuan napsal(a):
> On 05/07/2013 06:15 PM, Valerio Pachera wrote:
>> Hi, my production cluster has crashed.
>> I'm trying to understand the causes.
>> 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M).
>>
>> Cluster has been working fine till today and we copy lot's of data on it
>> each night.
>> This morning I had to expand a vdi from 600G to 1T.
>> Then I run a backup process on the vdi using this vdi.
>> Backup was reading and writing from the same vdi.
>> Guest was running on sheepdog004.
>>
>> From logs I can see sheepdog002 died first (8:44).
>> Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster
>> stopped.
>>
>> Right now I have two qemu processes on sheepdog004 that I can't kill -9.
>> Corosync and sheepdog processes are running only on sheepdog001.
>>
>> I'm going to force reboot on sheepdog004, and normal reboot the other nodes.
>> Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002.
>> Any suggestion?
>>
>> Here more info:
>>
>> root at sheepdog001:~# collie vdi list
>>   Name        Id    Size    Used  Shared    Creation time   VDI id 
>> Copies  Tag
>>   zimbra_backup     0  100 GB   99 GB  0.0 MB 2013-04-16 21:41   
>> 2e519     2             
>>   systemrescue     0  350 MB  0.0 MB  0.0 MB 2013-05-07 08:44  
>> c8be4d     2             
>>   backup_data     0  1.0 TB  606 GB  0.0 MB 2013-04-16 21:45  
>> c8d128     2             
>>   crmdelta     0   50 GB  7.7 GB  0.0 MB 2013-04-16 21:32   e149bf    
>> 2             
>>   backp        0   10 GB  3.8 GB  0.0 MB 2013-04-16 21:31   f313b6     2
>>
>> SHEEPDOG002
>> /var/log/messages
>> May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
>>
>> sheep.log
>> May 07 08:44:44 [main] corosync_handler(740) corosync driver received
>> EPOLLHUP event, exiting.
>>
>> /var/log/syslog
>> ...
>> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] FAILED TO RECEIVE
>> May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
>>
>>
> 
> Looks like a Corosync's issue, I have no idea what these logs are. It is
> out of my knowledge. CC'ed corosync devel list for help.
> 
>> SHEEPDOG004
>> /var/log/syslog
>> May  7 08:35:33 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
>> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
>> May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
>> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
>> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
>> ...
>> May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6
>> May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] FAILED TO RECEIVE
>>
>> /var/log/messages
>> May  7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped
>>
>> sheep.log
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797,
>> oid:c8d12800006f80
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797,
>> oid:c8d1280000b162
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797,
>> oid:c8d1280001773b
>> May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797,
>> oid:c8d1280000b5ce
>> May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797,
>> oid:c8d1280000b709
>> May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797,
>> oid:2e51900004acf
>> ...
>> May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869
>> count:181797, oid:c8d1280000b5ae
>> May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870
>> count:181797, oid:c8d128000202ff
>> May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871
>> count:181797, oid:c8d12800022fdf
>> May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1
>> May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1
>> ...
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872
>> count:181797, oid:c8d1280000b355
>> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873
>> count:181797, oid:c8d1280000afa4
>> May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874
>> count:181797, oid:c8d128000114ac
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875
>> count:181797, oid:c8d128000140e9
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876
>> count:181797, oid:c8d1280001f031
>> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877
>> count:181797, oid:c8d12800008d92
>> ...
>> May 07 10:39:03 [main] corosync_handler(740) corosync driver received
>> EPOLLHUP event, exiting.
>>
> 
> This means corosync process was gone. (killed?)
> 
>>
>>
>> SHEEPDOG001
>> /var/log/syslog
>> May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
>> 97 98 99 9a 9b 9c
>> May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
>> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
>> a1 a2 a3 a4 a5 a6
>> May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
>> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
>> 97 98 99 9a 9b 9c
>> May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
>> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
>> a1 a2 a3 a4 a5 a6
>> May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor
>> failed, forming new configuration.
>> May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> May  7 10:39:02 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
>> sender r(0) ip(192.168.6.41) ; members(old:2 left:1)
>> May  7 10:39:02 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> May  7 10:39:03 sheepdog001 corosync[2695]:   [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> May  7 10:39:03 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
>> sender r(0) ip(192.168.6.41) ; members(old:1 left:0)
>> May  7 10:39:03 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> sheep.log
>> ...
>> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794
>> count:181797, oid:c8d1280000117b
>> May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795
>> count:181797, oid:c8d128000055fe
>> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796
>> count:181797, oid:c8d12800012667
>>
>>
> 
> I hope some guys from Corosync community can take a look at this issue,
> you might need more information about Corosync like its version, your
> host platform.
> 
> Thanks,
> Yuan
> 
> _______________________________________________
> discuss mailing list
> discuss at corosync.org
> http://lists.corosync.org/mailman/listinfo/discuss
>