I did go ahead this morning and pull down g066d753 and apply it to the four nodes. I brought up node174 first like usual, and then tried node173, but it still refused to join, but with a different message: ....... Sep 22 08:28:49 update_cluster_info(756) status = 2, epoch = 17, 66, 0 Sep 22 08:28:49 update_cluster_info(759) failed to join sheepdog, 66 Sep 22 08:28:49 leave_cluster(1984) 16 Sep 22 08:28:49 update_cluster_info(761) I am really hurt and gonna leave cluster. Sep 22 08:28:49 update_cluster_info(762) Fix yourself and restart me later, pleaseeeee...Bye. Sep 22 08:28:49 log_sigsegv(367) sheep logger exits abnormally, pid:24265 I then brought up node157 and it recovered. Then I brought up node156 and it recovered as well. Then I was able to bring up node173. I noticed some odd things (mostly related to sizes of in use changing), looking at "collie node info" during the node startups. I'm not sure if this is normal or not, but below is what I saw. The timing is from top (oldest) to bottom (most current). Do objects re-distribute themselves around the cluster during recovery or epoch changes? node174 and node157: [root at node174 ~]# collie node info Id Size Used Use% 0 382 GB 17 GB 4% 1 394 GB 17 GB 4% Total 775 GB 34 GB 4%, total virtual VDI Size 100 GB Then added node156: [root at node174 ~]# collie node info Id Size Used Use% 0 365 GB 720 MB 0% 1 376 GB 12 GB 3% 2 380 GB 3.6 GB 0% failed to read object, 80f5969500000000 Remote node has a new epoch failed to read a inode header failed to read object, 80f5969600000000 Remote node has a new epoch failed to read a inode header Total 1.1 TB 16 GB 1%, total virtual VDI Size 0.0 MB [root at node174 ~]# collie node info Id Size Used Use% 0 365 GB 1008 MB 0% 1 377 GB 12 GB 3% 2 382 GB 5.2 GB 1% Total 1.1 TB 19 GB 1%, total virtual VDI Size 100 GB Then after everyone is done with recovery, added node173 back: [root at node174 ~]# collie node info Id Size Used Use% 0 374 GB 10 GB 2% 1 377 GB 12 GB 3% 2 399 GB 22 GB 5% Total 1.1 TB 45 GB 3%, total virtual VDI Size 100 GB [root at node174 ~]# collie node info Id Size Used Use% 0 365 GB 496 MB 0% 1 366 GB 1.3 GB 0% 2 377 GB 792 MB 0% 3 394 GB 17 GB 4% failed to read object, 80f5969400000000 Remote node has a new epoch failed to read a inode header Total 1.5 TB 20 GB 1%, total virtual VDI Size 100 GB [root at node174 sheepdog]# collie node info Id Size Used Use% 0 386 GB 21 GB 5% 1 381 GB 17 GB 4% 2 397 GB 21 GB 5% 3 394 GB 17 GB 4% Total 1.5 TB 76 GB 4%, total virtual VDI Size 100 GB But as far as I can tell, everything is working right now. |