AI research moves fast enough that any snapshot becomes dated within months. Rather than cataloging current benchmarks, this lesson examines the structural forces shaping the field: what is scaling, what is hitting limits, where the economic incentives point, and what the open questions are as of early 2025.
The frontier of AI capability is defined by a handful of models from a handful of organizations. As of early 2025: OpenAI’s GPT-4o and o1/o3, Anthropic’s Claude 3.5 Sonnet and Claude with extended thinking, Google DeepMind’s Gemini 1.5 and 2.0, and Meta’s Llama 3.1 405B (open weights).
These models share a common architecture (dense or mixture-of-experts transformers), similar training approaches (pre-training on web-scale data, RLHF/constitutional AI alignment), and comparable capabilities on standard benchmarks. The differences are in the details: context window length, multimodal capabilities, reasoning depth, tool use, and the specific safety-capability tradeoffs each lab has chosen.
The pace of improvement has been striking. Tasks that were impossible for AI in 2022 (passing the bar exam, writing production code, solving competition math problems) are routine in 2025. Whether this pace continues is the central question.
Standard benchmarks are losing their ability to discriminate between frontier models. When multiple models score above 90% on a benchmark, the remaining errors may reflect ambiguity in the benchmark rather than meaningful capability differences. New benchmarks (GPQA for PhD-level science, SWE-bench for real-world software engineering, ARC-AGI for abstraction and reasoning) are designed to be harder, but they too will eventually saturate. The field needs evaluation methods that scale with capability — potentially using AI systems to generate and evaluate increasingly difficult challenges.
The scaling paradigm — bigger models, more data, more compute — has driven progress for five years. But several walls are approaching.
The data wall. Pre-training requires vast amounts of text data. Estimates suggest that high-quality English text on the internet totals roughly 10 trillion tokens. Current frontier models have already been trained on a significant fraction of this data. Repeating data during training yields diminishing returns after a few epochs. Where does the next 10x of training data come from?
The compute wall. Training a frontier model requires tens of thousands of GPUs running for months, at a cost exceeding $100 million. The next generation may require $1 billion or more. The supply chain for AI chips (fabrication by TSMC, memory by Samsung and SK Hynix, networking by NVIDIA) is concentrated and capacity-constrained. Power consumption for large training runs rivals that of small cities.
The efficiency wall. Even if data and compute were unlimited, the transformer architecture may have fundamental efficiency limits. Attention scales quadratically with sequence length. The dense model architecture activates all parameters for every token. Mixture-of-experts partially addresses this, but introduces routing complexity and training instability.
There is growing evidence that the returns to scaling are diminishing for some capability dimensions. The jump from GPT-3 to GPT-4 was larger than the jump from GPT-4 to GPT-4o on many benchmarks. Some researchers argue this indicates fundamental limits to the scaling approach. Others argue that reasoning models (o1, o3) represent a new scaling axis (inference compute) that restores rapid improvement. The truth is likely task-dependent: scaling continues to improve performance on reasoning and code, while gains on commonsense understanding and social reasoning are plateauing.
If natural data is running out, can models generate their own training data? Synthetic data — text, code, reasoning traces, or other content generated by AI models — is increasingly used to supplement natural data.
Where synthetic data works. Mathematics, formal logic, and code are verifiable. A model can generate millions of math problems and solutions, and an automated checker can filter out the incorrect ones. The remaining correct solutions are high-quality training data. This approach drove the performance of reasoning models — generating chains of reasoning, verifying the final answers, and training on the correct chains.
Where synthetic data fails. For tasks without automated verification (creative writing, nuanced analysis, ethical reasoning), synthetic data risks model collapse — a feedback loop where the model trains on its own outputs, amplifying biases and narrowing the distribution. Each generation of synthetic data is slightly less diverse than the previous, eventually converging to a degenerate distribution.
Companies with large user bases have a structural advantage: user interactions generate a continuous stream of natural data. Every conversation with a chatbot, every code suggestion accepted or rejected, every search query and click is potential training signal. This creates a flywheel: better models attract more users, who generate more data, which trains better models. The flywheel advantage compounds over time and creates a barrier to entry that raw compute alone cannot overcome.
Current language models learn about the world through text descriptions. They know that “water flows downhill” because they have read sentences stating this, not because they have observed or simulated fluid dynamics. World models aim to give AI systems an internal simulation of how the world works.
Video prediction models (Sora, Runway, Pika) learn physics implicitly by predicting future video frames. To predict that a ball will bounce, the model must have learned something about elasticity and gravity — not as explicit equations, but as patterns in pixel space. Whether these learned representations constitute genuine understanding of physics or merely pattern matching over visual statistics is an active debate.
Robotics is the ultimate test of world models. A robot operating in the physical world must predict the consequences of its actions in real time. Foundation models for robotics (RT-2, Mobile ALOHA) combine language understanding with physical manipulation, using language models to interpret instructions and world models to plan actions.
Training robots in simulation is cheaper and safer than training in the real world, but simulated physics is never perfectly accurate. A policy learned in simulation may fail when transferred to a real robot because the real world has friction, compliance, sensor noise, and edge cases that the simulator does not capture. This sim-to-real gap is narrowing as simulators improve and domain randomization (training on varied simulated environments) makes policies more robust, but it remains a significant barrier to deploying capable robots outside controlled environments.
The economics of AI are reshaping the technology industry and, increasingly, the broader economy.
Training costs for frontier models have increased roughly 4x per year, from millions of dollars in 2020 to hundreds of millions in 2024. This concentration of investment limits frontier model development to a handful of well-capitalized organizations — a dynamic some call the “GPU rich” and “GPU poor” divide.
Inference costs are declining rapidly through hardware improvements, quantization, distillation, and architectural innovations. Tasks that cost $1 per query in 2023 cost $0.01 in 2025. This cost reduction expands the set of economically viable AI applications — a 100x cost reduction does not just make existing applications cheaper, it makes entirely new categories of applications feasible.
Labor market effects are beginning to materialize. Coding assistants measurably increase developer productivity. AI-generated first drafts accelerate writing, analysis, and design. Customer service, data entry, and routine research are being automated. The historical pattern with automation technologies — displacement of some jobs, augmentation of others, creation of new categories — is playing out, but the breadth and pace of AI’s impact across white-collar work is unprecedented.
The Jevons paradox: when a resource becomes more efficient to use, total consumption increases rather than decreases. Coal-efficient steam engines did not reduce coal consumption — they made steam power economical for more applications, increasing total coal use. Similarly, cheaper AI inference does not reduce total AI spending. It makes AI economical for tasks that were previously too expensive to automate, expanding total usage. A 100x reduction in inference cost leads to far more than 100x increase in AI usage, because entirely new application categories become viable. This dynamic suggests that AI compute demand will continue to outpace supply for the foreseeable future.
Artificial General Intelligence (AGI) — a system that matches or exceeds human cognitive ability across all domains — is the field’s stated long-term goal and its most contentious topic.
Short-timeline arguments. Current systems already exceed human performance on many cognitive tasks. The rate of improvement is steep. Scaling and architectural innovations continue to unlock new capabilities. Some researchers predict AGI within 5-10 years.
Long-timeline arguments. Current systems fail at tasks requiring genuine understanding, long-horizon planning, physical reasoning, and robust generalization. Benchmark performance does not equal general intelligence. The easy gains from scaling may be exhausted, and the remaining problems (grounding, reasoning, embodiment) may require fundamental breakthroughs that are not predictable.
The definitional problem. There is no consensus definition of AGI. Some define it as “passing any human cognitive test.” Others require embodied competence (a robot that can do any physical task a human can). Still others define it economically: “a system that can perform any economically valuable cognitive task.” The timeline estimate depends heavily on the definition.
Capability overhang occurs when a system has latent abilities that are not yet being utilized — either because no one has discovered the right prompt, the right fine-tuning approach, or the right tool-use pattern. When GPT-4 was released, months of experimentation progressively revealed capabilities (complex reasoning, code generation, creative writing) that were not apparent from the initial release.
If current frontier models have significant untapped capability overhang, then AGI-relevant capabilities might emerge from existing architectures through better prompting, fine-tuning, or tool use — without requiring fundamentally new architectures. This is a reason for uncertainty in either direction: we may be closer to AGI than benchmarks suggest (if the capabilities are latent), or further away (if the benchmarks accurately reflect the limits of current approaches).
The AI field is split between proprietary models (GPT-4, Claude, Gemini) and open-weight models (Llama, Mistral, Qwen, DeepSeek).
Arguments for open models. Transparency enables external safety research and auditing. Competition drives innovation and prevents monopoly. Developers can fine-tune and deploy models without dependence on a single provider. Open models democratize access — a university researcher with a single GPU can fine-tune a 7B parameter model.
Arguments for closed models. Open-weight frontier models can be fine-tuned to remove safety training, enabling misuse. The cost of training frontier models is so high that open-sourcing represents a massive subsidy to competitors. Safety measures (monitoring, rate limiting, content filtering) are easier to maintain in a controlled API.
The practical landscape: Meta releases Llama models openly and has catalyzed a large ecosystem. DeepSeek and Qwen (Chinese labs) release competitive open models. Mistral (European) occupies a middle ground. The open model ecosystem increasingly narrows the capability gap with closed models, particularly for tasks where fine-tuning on domain-specific data matters more than raw scale.
“Open weights” means the model parameters are publicly available. It does not mean the training data, training code, RLHF data, or evaluation framework are public. Reproducing a model from its weights alone is possible; understanding why it behaves as it does requires the full training pipeline. True open-source AI — where the entire pipeline is public and reproducible — remains rare. The distinction matters: open weights enable deployment and fine-tuning, but not the deep understanding needed for safety research and accountability.
AI is accelerating scientific discovery across multiple fields, producing results that would have taken years or decades with traditional methods.
Protein structure. AlphaFold 2 (2020) solved the 50-year-old protein folding problem, predicting 3D protein structures from amino acid sequences with experimental accuracy. AlphaFold 3 (2024) extended this to protein-ligand, protein-DNA, and protein-RNA complexes. The AlphaFold Protein Structure Database contains predicted structures for over 200 million proteins — nearly every known protein.
Materials science. GNoME (Google DeepMind, 2023) predicted 2.2 million new stable crystal structures, expanding the number of known stable materials by an order of magnitude. These predictions guide experimental synthesis of materials for batteries, solar cells, and superconductors.
Drug discovery. AI-designed drug candidates have entered clinical trials. The time from target identification to candidate molecule has compressed from years to months. AI systems predict drug-target interactions, optimize molecular properties, and identify potential side effects before synthesis.
Mathematics. AI systems have discovered new mathematical conjectures, identified patterns in mathematical data, and (with AlphaProof) solved competition-level problems with formal verification. The combination of neural intuition and formal proof is opening a new mode of mathematical discovery.
Several developments will shape the field over the next 2-3 years. The resolution of the scaling debate: do new architectures or inference-time compute restore rapid improvement, or does progress plateau? The maturation of AI agents that can take multi-step actions in the real world autonomously. The economic impact as AI tools become ubiquitous in professional work. The regulatory landscape as governments implement (or fail to implement) governance frameworks. And the safety question: as models become more capable, do alignment techniques keep pace? The answers will determine whether AI’s impact is primarily constructive, primarily disruptive, or some combination that nobody predicted.
This lesson establishes:
Next: AI Frontiers Check