[Sheepdog] Cluster appears down but nodes report different epochs

Shawn Moore smmoore at gmail.com
Mon Nov 7 16:03:19 CET 2011


When I checked on the cluster this morning I see the following from
cluster info.  A sheep and corosync process was found on all nodes
except blade162 which didn't have a sheep process but did have a
corosync one.  I'm not sure what has happened.  We have not had a
network interruption that we are aware of as all nodes are on the same
switch (along with countless other production systems).  Logs from
each node can be found
http://www.stormpoint.com/files/sd_2011-11-07.zip.  Total
un-compressed size is ~ 254MB and this download size is around 21MB.
When I left Friday, this is how our cluster looked:

All nodes were running version 0.2.4_63_gd56e3b6

   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 192.168.217.152:7000 	64          1
     1 - 192.168.217.153:7000 	64          1
     2 - 192.168.217.154:7000 	64          1
     3 - 192.168.217.155:7000 	64          1
     4 - 192.168.217.156:7000 	64          1
     5 - 192.168.217.157:7000 	64          2
     6 - 192.168.217.159:7000 	64          2
     7 - 192.168.217.160:7000 	64          2
     8 - 192.168.217.161:7000 	64          2
     9 - 192.168.217.162:7000 	64          2

[root at blade152 sheep]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 17:26:22     14 [192.168.217.152:7000]
2011-11-04 17:26:22     13 [192.168.217.152:7000, 192.168.217.162:7000]
2011-11-04 17:26:22     12 [192.168.217.152:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 17:26:22     11 [192.168.217.152:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 17:26:22     10 [192.168.217.152:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]
2011-11-04 17:26:21      9 [192.168.217.152:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 17:26:21      8 [192.168.217.152:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 17:26:21      7 [192.168.217.152:7000,
192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]


[root at blade153 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-05 00:05:19     14 [192.168.217.153:7000]
2011-11-05 00:05:19     13 [192.168.217.153:7000, 192.168.217.162:7000]
2011-11-05 00:05:19     12 [192.168.217.153:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-05 00:05:19     11 [192.168.217.153:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-05 00:05:19     10 [192.168.217.153:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]
2011-11-05 00:05:19      9 [192.168.217.153:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-05 00:05:18      8 [192.168.217.153:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-05 00:05:18      7 [192.168.217.153:7000,
192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]


[root at blade154 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 13:25:06      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 06:58:12      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 05:57:43      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-02 10:49:34      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 10:33:44      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-02 07:01:26      1 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]


[root at blade155 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 13:24:42      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 06:57:48      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 05:57:19      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-02 10:49:07      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 10:33:17      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-02 07:00:59      1 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]


[root at blade156 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-05 07:39:11      9 [192.168.217.154:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000]
2011-11-05 07:39:11      8 [192.168.217.154:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 18:47:30      7 [192.168.217.153:7000,
192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]
2011-11-04 17:26:26      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 10:59:30      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 09:59:03      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 09:59:03      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 10:33:44      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]


[root at blade157 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-05 07:39:11      9 [192.168.217.154:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000]
2011-11-05 07:39:11      8 [192.168.217.154:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 18:47:30      7 [192.168.217.153:7000,
192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.162:7000]
2011-11-04 17:26:26      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 10:59:32      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 10:59:32      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-02 10:49:34      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 10:33:44      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]


[root at blade159 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 17:26:11      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 10:59:17      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 09:58:48      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-02 14:50:37      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 14:34:46      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-02 11:02:28      1 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]


[root at blade160 ~]# collie cluster info
Cluster status: running

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 17:26:26      6 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.162:7000]
2011-11-04 10:59:30      5 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 09:59:02      4 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-02 14:50:46      3 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-02 14:34:55      2 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
2011-11-02 11:02:37      1 [192.168.217.152:7000,
192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]


[root at blade161 ~]# collie cluster info
Cluster status: The sheepdog is stopped doing IO, short of living nodes

Cluster created at Wed Nov  2 11:02:26 2011

Epoch Time           Version
2011-11-04 17:26:51     14 [192.168.217.161:7000]
2011-11-04 17:26:51     13 [192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 17:26:51     12 [192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 17:26:51     11 [192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 17:26:48     10 [192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]
2011-11-04 17:26:48      9 [192.168.217.156:7000,
192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 17:26:48      8 [192.168.217.155:7000,
192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
2011-11-04 17:26:48      7 [192.168.217.154:7000,
192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
192.168.217.162:7000]


[root at blade162 ~]# collie cluster info
failed to connect to localhost:7000, Connection refused
failed to connect to localhost:7000, Connection refused



More information about the sheepdog mailing list