[sheepdog-users] cluster has creashed

Valerio Pachera sirio81 at gmail.com
Tue May 7 12:15:14 CEST 2013


Hi, my production cluster has crashed.
I'm trying to understand the causes.
3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M).

Cluster has been working fine till today and we copy lot's of data on it
each night.
This morning I had to expand a vdi from 600G to 1T.
Then I run a backup process on the vdi using this vdi.
Backup was reading and writing from the same vdi.
Guest was running on sheepdog004.

>From logs I can see sheepdog002 died first (8:44).
Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster
stopped.

Right now I have two qemu processes on sheepdog004 that I can't kill -9.
Corosync and sheepdog processes are running only on sheepdog001.

I'm going to force reboot on sheepdog004, and normal reboot the other nodes.
Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002.
Any suggestion?

Here more info:

root at sheepdog001:~# collie vdi list
  Name        Id    Size    Used  Shared    Creation time   VDI id  Copies
Tag
  zimbra_backup     0  100 GB   99 GB  0.0 MB 2013-04-16 21:41    2e519
2
  systemrescue     0  350 MB  0.0 MB  0.0 MB 2013-05-07 08:44   c8be4d
2
  backup_data     0  1.0 TB  606 GB  0.0 MB 2013-04-16 21:45   c8d128
2
  crmdelta     0   50 GB  7.7 GB  0.0 MB 2013-04-16 21:32   e149bf
2
  backp        0   10 GB  3.8 GB  0.0 MB 2013-04-16 21:31   f313b6     2

SHEEPDOG002
/var/log/messages
May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped

sheep.log
May 07 08:44:44 [main] corosync_handler(740) corosync driver received
EPOLLHUP event, exiting.

/var/log/syslog
...
May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List: 6db
6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List: 6e5
6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
May  7 08:44:40 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List: 6db
6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] Retransmit List: 6e5
6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
May  7 08:44:41 sheepdog002 corosync[2777]:   [TOTEM ] FAILED TO RECEIVE
May  7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped


SHEEPDOG004
/var/log/syslog
May  7 08:35:33 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List: 6e5
6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da
6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List: 6db
6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da
6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee
May  7 08:35:34 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List: 6e5
6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 6d9 6da
6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4
...
May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] Retransmit List: 9d
9e 9f a0 a1 a2 a3 a4 a5 a6
May  7 10:38:59 sheepdog004 corosync[5314]:   [TOTEM ] FAILED TO RECEIVE

/var/log/messages
May  7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped

sheep.log
May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797,
oid:c8d12800006f80
May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797,
oid:c8d1280000b162
May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797,
oid:c8d1280001773b
May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797,
oid:c8d1280000b5ce
May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797,
oid:c8d1280000b709
May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797,
oid:2e51900004acf
...
May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869
count:181797, oid:c8d1280000b5ae
May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870
count:181797, oid:c8d128000202ff
May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1
May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871
count:181797, oid:c8d12800022fdf
May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1
May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1
May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1
May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1
...
May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872
count:181797, oid:c8d1280000b355
May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873
count:181797, oid:c8d1280000afa4
May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874
count:181797, oid:c8d128000114ac
May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875
count:181797, oid:c8d128000140e9
May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876
count:181797, oid:c8d1280001f031
May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877
count:181797, oid:c8d12800008d92
...
May 07 10:39:03 [main] corosync_handler(740) corosync driver received
EPOLLHUP event, exiting.



SHEEPDOG001
/var/log/syslog
May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List: 9d
9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98
99 9a 9b 9c
May  7 10:38:58 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List: 93
94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 a1 a2
a3 a4 a5 a6
May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List: 9d
9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98
99 9a 9b 9c
May  7 10:38:59 sheepdog001 corosync[2695]:   [TOTEM ] Retransmit List: 93
94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 a1 a2
a3 a4 a5 a6
May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor failed,
forming new configuration.
May  7 10:39:02 sheepdog001 corosync[2695]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
May  7 10:39:02 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
sender r(0) ip(192.168.6.41) ; members(old:2 left:1)
May  7 10:39:02 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
synchronization, ready to provide service.
May  7 10:39:03 sheepdog001 corosync[2695]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
May  7 10:39:03 sheepdog001 corosync[2695]:   [CPG   ] chosen downlist:
sender r(0) ip(192.168.6.41) ; members(old:1 left:0)
May  7 10:39:03 sheepdog001 corosync[2695]:   [MAIN  ] Completed service
synchronization, ready to provide service.

sheep.log
...
May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794
count:181797, oid:c8d1280000117b
May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795
count:181797, oid:c8d128000055fe
May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796
count:181797, oid:c8d12800012667
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130507/a45a15f5/attachment.html>


More information about the sheepdog-users mailing list