When I checked on the cluster this morning I see the following from cluster info. A sheep and corosync process was found on all nodes except blade162 which didn't have a sheep process but did have a corosync one. I'm not sure what has happened. We have not had a network interruption that we are aware of as all nodes are on the same switch (along with countless other production systems). Logs from each node can be found http://www.stormpoint.com/files/sd_2011-11-07.zip. Total un-compressed size is ~ 254MB and this download size is around 21MB. When I left Friday, this is how our cluster looked: All nodes were running version 0.2.4_63_gd56e3b6 Idx - Host:Port Vnodes Zone --------------------------------------------- 0 - 192.168.217.152:7000 64 1 1 - 192.168.217.153:7000 64 1 2 - 192.168.217.154:7000 64 1 3 - 192.168.217.155:7000 64 1 4 - 192.168.217.156:7000 64 1 5 - 192.168.217.157:7000 64 2 6 - 192.168.217.159:7000 64 2 7 - 192.168.217.160:7000 64 2 8 - 192.168.217.161:7000 64 2 9 - 192.168.217.162:7000 64 2 [root at blade152 sheep]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 17:26:22 14 [192.168.217.152:7000] 2011-11-04 17:26:22 13 [192.168.217.152:7000, 192.168.217.162:7000] 2011-11-04 17:26:22 12 [192.168.217.152:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:22 11 [192.168.217.152:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:22 10 [192.168.217.152:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:21 9 [192.168.217.152:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:21 8 [192.168.217.152:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:21 7 [192.168.217.152:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] [root at blade153 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-05 00:05:19 14 [192.168.217.153:7000] 2011-11-05 00:05:19 13 [192.168.217.153:7000, 192.168.217.162:7000] 2011-11-05 00:05:19 12 [192.168.217.153:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-05 00:05:19 11 [192.168.217.153:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-05 00:05:19 10 [192.168.217.153:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-05 00:05:19 9 [192.168.217.153:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-05 00:05:18 8 [192.168.217.153:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-05 00:05:18 7 [192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] [root at blade154 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 13:25:06 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 06:58:12 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 05:57:43 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:49:34 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:33:44 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-02 07:01:26 1 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] [root at blade155 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 13:24:42 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 06:57:48 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 05:57:19 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:49:07 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:33:17 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-02 07:00:59 1 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] [root at blade156 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-05 07:39:11 9 [192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000] 2011-11-05 07:39:11 8 [192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 18:47:30 7 [192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:26 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 10:59:30 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 09:59:03 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 09:59:03 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:33:44 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] [root at blade157 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-05 07:39:11 9 [192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000] 2011-11-05 07:39:11 8 [192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 18:47:30 7 [192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 17:26:26 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 10:59:32 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 10:59:32 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:49:34 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 10:33:44 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] [root at blade159 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 17:26:11 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 10:59:17 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 09:58:48 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 14:50:37 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 14:34:46 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-02 11:02:28 1 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] [root at blade160 ~]# collie cluster info Cluster status: running Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 17:26:26 6 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-04 10:59:30 5 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 09:59:02 4 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 14:50:46 3 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-02 14:34:55 2 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] 2011-11-02 11:02:37 1 [192.168.217.152:7000, 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] [root at blade161 ~]# collie cluster info Cluster status: The sheepdog is stopped doing IO, short of living nodes Cluster created at Wed Nov 2 11:02:26 2011 Epoch Time Version 2011-11-04 17:26:51 14 [192.168.217.161:7000] 2011-11-04 17:26:51 13 [192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:51 12 [192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:51 11 [192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:48 10 [192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:48 9 [192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:48 8 [192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] 2011-11-04 17:26:48 7 [192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] [root at blade162 ~]# collie cluster info failed to connect to localhost:7000, Connection refused failed to connect to localhost:7000, Connection refused |