| A follow-up to the follow-up. The hint was in build_node_list(460) nr_sd_nodes:2. Turned out there was a hung sheep instance on node 3 that never went down and caused all attempts at restarting other nodes to immediately halt. I killed it, rebooted everything for good measure. Back up, recovered and instances running again. On 08/23/2013 01:17 PM, Andrew J. Hobbs wrote: Reverted to the previously running rc of 0.7.0. This now also fails. Attempting a reboot now. Will try again and attach a log segment with debug messages enabled. <large number of these snipped> Aug 23 13:14:50 DEBUG [main] add_to_lru_cache(684) oid cf50080000091f added Aug 23 13:14:50 DEBUG [main] load_cache_object(1262) cf50080000091f Aug 23 13:14:50 DEBUG [main] add_to_lru_cache(684) oid cf500300000036 added Aug 23 13:14:50 DEBUG [main] load_cache_object(1262) cf500300000036 Aug 23 13:14:50 INFO [main] check_host_env(465) Allowed open files 100000, suggested 1024000 Aug 23 13:14:50 DEBUG [main] check_host_env(471) Allowed core file size 0, suggested unlimited Aug 23 13:14:50 INFO [main] main(854) sheepdog daemon (version 0.7.0_26_gc65bb2f) started Aug 23 13:14:50 DEBUG [main] zk_event_handler(1012) 1, 1761 Aug 23 13:14:50 DEBUG [main] zk_queue_pop_advance(402) /sheepdog/queue/0000001761, type:2, len:114872, pos:1761 Aug 23 13:14:50 DEBUG [main] zk_handle_accept(854) ACCEPT Aug 23 13:14:50 DEBUG [main] init_node_list(838) 1 Aug 23 13:14:50 DEBUG [main] zk_handle_accept(859) IPv4 ip:10.254.0.1 port:7000 Aug 23 13:14:50 DEBUG [main] zk_handle_accept(865) create path:/sheepdog/member/IPv4 ip:10.254.0.1 port:7000 Aug 23 13:14:50 DEBUG [main] zk_watcher(522) path:/sheepdog/member/IPv4 ip:10.254.0.1 port:7000, type:1 Aug 23 13:14:50 DEBUG [main] build_node_list(460) nr_sd_nodes:2 Aug 23 13:14:50 DEBUG [main] sd_accept_handler(886) join IPv4 ip:10.254.0.1 port:7000 Aug 23 13:14:50 DEBUG [main] sd_accept_handler(888) [0] IPv4 ip:10.254.0.1 port:7000 Aug 23 13:14:50 DEBUG [main] sd_accept_handler(888) [1] IPv4 ip:10.254.0.3 port:7000 Aug 23 13:14:50 DEBUG [main] zk_watcher(522) path:/sheepdog/member, type:4 Aug 23 13:14:50 INFO [main] main(861) shutdown Aug 23 13:14:50 INFO [main] zk_leave(780) leaving from cluster Aug 23 13:14:50 DEBUG [main] zk_watcher(522) path:/sheepdog/queue/0000001762, type:1 Aug 23 13:14:50 DEBUG [main] zk_queue_push(362) create path:/sheepdog/queue/0000001762, queue_pos:0000001762, len:144 Aug 23 13:14:50 DEBUG [main] zk_watcher(522) path:/sheepdog/member/IPv4 ip:10.254.0.1 port:7000, type:2 Aug 23 13:14:50 INFO [main] main(866) cleaning journal file Aug 23 13:14:50 DEBUG [main] zk_queue_push(362) create path:/sheepdog/queue/0000001763, queue_pos:0000001762, len:144 Note, I'm not seeing anything that indicates an issue, simply started then stopped. Is it possible the cluster shutdown command has persisted in zookeeper or some other location? On 08/23/2013 12:57 PM, Andrew J. Hobbs wrote: Aug 23 12:52:05 INFO [main] send_join_request(770) IPv4 ip:10.254.0.1 port:7000 Aug 23 12:52:05 ERROR [main] for_each_object_in_stale(383) /var/lib/sheepdog/obj/.stale Aug 23 12:52:05 INFO [main] check_host_env(465) Allowed open files 100000, suggested 1024000 Aug 23 12:52:05 INFO [main] main(854) sheepdog daemon (version 0.7.0_26_gc65bb2f) started Aug 23 12:52:05 INFO [main] main(861) shutdown Aug 23 12:52:05 INFO [main] zk_leave(780) leaving from cluster For now I'm going to revert to a prior build, but I'm not sure how to proceed. The /var/lib/sheepdog/obj/.stale directory has no content. This build was pulled from git approximately 20 minutes ago. -------------- next part -------------- A non-text attachment was scrubbed... Name: ajhobbs.vcf Type: text/x-vcard Size: 353 bytes Desc: ajhobbs.vcf URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130823/83b5853e/attachment.vcf> |