[sheepdog-users] cluster has creashed

Tue May 7 12:30:18 CEST 2013

On 05/07/2013 06:15 PM, Valerio Pachera wrote:
> Hi, my production cluster has crashed.
> I'm trying to understand the causes.
> 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M).
> 
> Cluster has been working fine till today and we copy lot's of data on it
> each night.
> This morning I had to expand a vdi from 600G to 1T.
> Then I run a backup process on the vdi using this vdi.
> Backup was reading and writing from the same vdi.
> Guest was running on sheepdog004.
> 
> From logs I can see sheepdog002 died first (8:44).
> Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster
> stopped.
> 
> Right now I have two qemu processes on sheepdog004 that I can't kill -9.
> Corosync and sheepdog processes are running only on sheepdog001.
> 
> I'm going to force reboot on sheepdog004, and normal reboot the other nodes.
> Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002.
> Any suggestion?
> 
> Here more info:
> 
> root at sheepdog001:~# collie vdi list
>   Name        Id    Size    Used  Shared    Creation time   VDI id 
> Copies  Tag
>   zimbra_backup     0  100 GB   99 GB  0.0 MB 2013-04-16 21:41   
> 2e519     2             
>   systemrescue     0  350 MB  0.0 MB  0.0 MB 2013-05-07 08:44  
> c8be4d     2             
>   backup_data     0  1.0 TB  606 GB  0.0 MB 2013-04-16 21:45  
> c8d128     2             
>   crmdelta     0   50 GB  7.7 GB  0.0 MB 2013-04-16 21:32   e149bf    
> 2             
>   backp        0   10 GB  3.8 GB  0.0 MB 2013-04-16 21:31   f313b6     2
> 
> SHEEPDOG002
> /var/log/messages
> May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
> 
> sheep.log
> May 07 08:44:44 [main] corosync_handler(740) corosync driver received
> EPOLLHUP event, exiting.
> 
> /var/log/syslog
> ...
> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
> May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
> May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List:
> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
> May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] FAILED TO RECEIVE
> May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped
> 
> 

Looks like a Corosync's issue, I have no idea what these logs are. It is
out of my knowledge. CC'ed corosync devel list for help.

> SHEEPDOG004
> /var/log/syslog
> May  7 08:35:33 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
> May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
> 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
> May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8
> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
> ...
> May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List:
> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6
> May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] FAILED TO RECEIVE
> 
> /var/log/messages
> May  7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped
> 
> sheep.log
> May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797,
> oid:c8d12800006f80
> May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797,
> oid:c8d1280000b162
> May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797,
> oid:c8d1280001773b
> May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797,
> oid:c8d1280000b5ce
> May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797,
> oid:c8d1280000b709
> May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797,
> oid:2e51900004acf
> ...
> May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869
> count:181797, oid:c8d1280000b5ae
> May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870
> count:181797, oid:c8d128000202ff
> May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1
> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871
> count:181797, oid:c8d12800022fdf
> May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1
> May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1
> May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1
> May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1
> ...
> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872
> count:181797, oid:c8d1280000b355
> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873
> count:181797, oid:c8d1280000afa4
> May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874
> count:181797, oid:c8d128000114ac
> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875
> count:181797, oid:c8d128000140e9
> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876
> count:181797, oid:c8d1280001f031
> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877
> count:181797, oid:c8d12800008d92
> ...
> May 07 10:39:03 [main] corosync_handler(740) corosync driver received
> EPOLLHUP event, exiting.
> 

This means corosync process was gone. (killed?)

> 
> 
> SHEEPDOG001
> /var/log/syslog
> May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
> 97 98 99 9a 9b 9c
> May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
> a1 a2 a3 a4 a5 a6
> May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96
> 97 98 99 9a 9b 9c
> May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List:
> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0
> a1 a2 a3 a4 a5 a6
> May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor
> failed, forming new configuration.
> May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> May  7 10:39:02 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.6.41) ; members(old:2 left:1)
> May  7 10:39:02 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> May  7 10:39:03 sheepdog001 corosync[2695]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> May  7 10:39:03 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.6.41) ; members(old:1 left:0)
> May  7 10:39:03 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> sheep.log
> ...
> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794
> count:181797, oid:c8d1280000117b
> May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795
> count:181797, oid:c8d128000055fe
> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796
> count:181797, oid:c8d12800012667
> 
> 

I hope some guys from Corosync community can take a look at this issue,
you might need more information about Corosync like its version, your
host platform.

Thanks,
Yuan