At Mon, 7 Nov 2011 10:03:19 -0500, Shawn Moore wrote: > > When I checked on the cluster this morning I see the following from > cluster info. A sheep and corosync process was found on all nodes > except blade162 which didn't have a sheep process but did have a > corosync one. I'm not sure what has happened. We have not had a In blade162.log: Nov 05 00:06:30 sd_leave_handler(1222) Network Patition Bug: I should have exited. Probably, this is a corosync's bug and Yunkai is trying to solve it. http://lists.wpkg.org/pipermail/sheepdog/2011-November/001835.html > network interruption that we are aware of as all nodes are on the same > switch (along with countless other production systems). Logs from > each node can be found > http://www.stormpoint.com/files/sd_2011-11-07.zip. Total > un-compressed size is ~ 254MB and this download size is around 21MB. > When I left Friday, this is how our cluster looked: > > All nodes were running version 0.2.4_63_gd56e3b6 > > Idx - Host:Port Vnodes Zone > --------------------------------------------- > 0 - 192.168.217.152:7000 64 1 > 1 - 192.168.217.153:7000 64 1 > 2 - 192.168.217.154:7000 64 1 > 3 - 192.168.217.155:7000 64 1 > 4 - 192.168.217.156:7000 64 1 > 5 - 192.168.217.157:7000 64 2 > 6 - 192.168.217.159:7000 64 2 > 7 - 192.168.217.160:7000 64 2 > 8 - 192.168.217.161:7000 64 2 > 9 - 192.168.217.162:7000 64 2 > > [root at blade152 sheep]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 17:26:22 14 [192.168.217.152:7000] > 2011-11-04 17:26:22 13 [192.168.217.152:7000, 192.168.217.162:7000] > 2011-11-04 17:26:22 12 [192.168.217.152:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 17:26:22 11 [192.168.217.152:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 17:26:22 10 [192.168.217.152:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > 2011-11-04 17:26:21 9 [192.168.217.152:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 17:26:21 8 [192.168.217.152:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 17:26:21 7 [192.168.217.152:7000, > 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > > > [root at blade153 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-05 00:05:19 14 [192.168.217.153:7000] > 2011-11-05 00:05:19 13 [192.168.217.153:7000, 192.168.217.162:7000] > 2011-11-05 00:05:19 12 [192.168.217.153:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-05 00:05:19 11 [192.168.217.153:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-05 00:05:19 10 [192.168.217.153:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > 2011-11-05 00:05:19 9 [192.168.217.153:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-05 00:05:18 8 [192.168.217.153:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-05 00:05:18 7 [192.168.217.153:7000, > 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > > > [root at blade154 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 13:25:06 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 06:58:12 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 05:57:43 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-02 10:49:34 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 10:33:44 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-02 07:01:26 1 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > > > [root at blade155 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 13:24:42 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 06:57:48 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 05:57:19 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-02 10:49:07 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 10:33:17 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-02 07:00:59 1 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > > > [root at blade156 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-05 07:39:11 9 [192.168.217.154:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000] > 2011-11-05 07:39:11 8 [192.168.217.154:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 18:47:30 7 [192.168.217.153:7000, > 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > 2011-11-04 17:26:26 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 10:59:30 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 09:59:03 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 09:59:03 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 10:33:44 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > > > [root at blade157 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-05 07:39:11 9 [192.168.217.154:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000] > 2011-11-05 07:39:11 8 [192.168.217.154:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 18:47:30 7 [192.168.217.153:7000, > 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.162:7000] > 2011-11-04 17:26:26 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 10:59:32 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 10:59:32 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-02 10:49:34 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 10:33:44 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > > > [root at blade159 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 17:26:11 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 10:59:17 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 09:58:48 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-02 14:50:37 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 14:34:46 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-02 11:02:28 1 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > > > [root at blade160 ~]# collie cluster info > Cluster status: running > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 17:26:26 6 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-04 10:59:30 5 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 09:59:02 4 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-02 14:50:46 3 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-02 14:34:55 2 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000] > 2011-11-02 11:02:37 1 [192.168.217.152:7000, > 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > > > [root at blade161 ~]# collie cluster info > Cluster status: The sheepdog is stopped doing IO, short of living nodes > > Cluster created at Wed Nov 2 11:02:26 2011 > > Epoch Time Version > 2011-11-04 17:26:51 14 [192.168.217.161:7000] > 2011-11-04 17:26:51 13 [192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 17:26:51 12 [192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 17:26:51 11 [192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 17:26:48 10 [192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > 2011-11-04 17:26:48 9 [192.168.217.156:7000, > 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000, > 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 17:26:48 8 [192.168.217.155:7000, > 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000, > 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000] > 2011-11-04 17:26:48 7 [192.168.217.154:7000, > 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000, > 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000, > 192.168.217.162:7000] > > > [root at blade162 ~]# collie cluster info > failed to connect to localhost:7000, Connection refused > failed to connect to localhost:7000, Connection refused It seems that a network partition is wrongly detected. To make explanation simpler, I'll use the following labels for each node: n0: 192.168.217.152 n1: 192.168.217.153 n2: 192.168.217.154 n3: 192.168.217.155 n4: 192.168.217.156 n5: 192.168.217.157 n6: 192.168.217.159 n7: 192.168.217.160 n8: 192.168.217.161 n9: 192.168.217.162 I guess your cluster is splited into 5 groups; {n0}, {n1}, {n2, n3, n4, n5, n6, n7}, {n8}, {n9}. - n0 received a notification that n[1-9] were left. - n1 received a notification that n0 and n[2-9] were left. - n[2-7] received a notification that n0, n1, n8, and n9 were left. - n8 received a notification that n[0-7] and n9 were left. - n9 received a notification that n[0-8] were left (and aborted due to the above bug). Currently, Sheepdog cannot handle this kinds of false detection. We may avoid this problem if we set appropriate values to corosync.conf (totem.merge or totem.seqno_unchanged_const?), but I'm not sure. Does anyone know more about this? Thanks, Kazutaka |