The first generation of LLMs improved by training longer on more data — scaling training compute. Reasoning models represent a second scaling axis: spending more compute at inference time. Instead of producing an answer immediately, the model thinks step by step, sometimes for minutes, producing a chain of reasoning before committing to an answer.
Traditional LLMs generate answers in a single forward pass through the network. The compute spent is fixed regardless of problem difficulty — a trivial question and a competition-level math problem receive the same computational budget.
Reasoning models break this constraint. They allocate variable compute at inference time, spending more tokens (and more time) on harder problems. The key insight: for many problems, it is more efficient to invest additional compute at inference time than to scale the training further. A smaller model that reasons for 60 seconds can outperform a larger model that answers immediately.
Training compute is spent once. Inference compute is spent on every query. But for difficult reasoning tasks, the marginal value of additional inference compute can exceed the marginal value of additional training compute. This creates a new optimization surface: for a fixed total budget, what is the optimal split between training a better base model and giving that model more time to think at inference?
The answer depends on the query distribution. If most queries are easy, invest in training. If queries are concentrated in the hard tail (mathematics, competitive programming, complex analysis), invest in inference compute.
Chain-of-thought reasoning was initially a prompting technique — “think step by step.” But reasoning models internalize this pattern. They generate extended reasoning traces not because they are prompted to, but because the training process has learned that step-by-step reasoning produces better final answers.
The reasoning trace serves as working memory. Each intermediate step conditions the generation of the next step. The model can check its own work, explore alternatives, backtrack from dead ends, and build up to a conclusion through a sequence of simpler operations.
Reasoning traces in models like o1 and Claude’s extended thinking can run to thousands of tokens — far longer than a human would need to solve the same problem. This apparent verbosity serves a purpose: the model’s per-step reasoning is less reliable than a human’s, so it compensates by taking smaller steps and checking more frequently. The optimal trace length is problem-dependent: over-thinking wastes tokens on easy problems, under-thinking fails on hard ones. Some architectures learn to adaptively allocate trace length based on problem difficulty.
The reasoning trace is a sequence of natural language statements leading from the problem to the answer. Unlike a neural network’s internal computations, the trace is human-readable and auditable. This opens the possibility of verification: checking whether each step in the trace follows logically from the previous steps.
Process reward models (PRMs) evaluate the correctness of each reasoning step, not just the final answer. A PRM is trained on labeled reasoning traces where each step is annotated as correct or incorrect. During inference, the PRM scores intermediate steps, allowing the system to detect and discard reasoning paths that go wrong early.
An outcome reward model (ORM) scores only the final answer — right or wrong. A process reward model (PRM) scores each step. PRMs provide denser training signal (every step is a training example, not just the final answer) and enable step-level search: the model can generate multiple candidate next steps, score each with the PRM, and continue from the best one. This is analogous to AlphaGo’s tree search — evaluating intermediate positions rather than only the final game result.
Research from OpenAI and others shows that PRMs substantially outperform ORMs for multi-step reasoning tasks, but require expensive step-level human annotations for training.
OpenAI’s o1 (2024) was the first widely deployed reasoning model. It generates an internal chain of thought before producing a visible response, spending seconds to minutes “thinking” on hard problems. On benchmarks requiring multi-step reasoning — competition mathematics, PhD-level science questions, complex coding — o1 substantially outperformed prior models.
Claude’s extended thinking provides a similar capability: the model generates a reasoning trace (visible to the developer) before committing to a response. The thinking tokens are billed as output tokens, making the cost-quality tradeoff explicit.
o3 (2024-2025) pushed further, achieving high scores on the ARC-AGI benchmark (a test designed to be easy for humans and hard for LLMs) through massive inference-time compute — reportedly spending thousands of dollars of compute per problem at the highest setting.
Reasoning models improve on problems where the answer can be reached through a chain of verifiable steps: mathematics, formal logic, code, structured analysis. They improve less on tasks requiring broad world knowledge, creative generation, or nuanced judgment. The reasoning trace is only as good as the model’s per-step reliability — if the model makes a subtle logical error in step 3 of a 20-step proof, the remaining steps may propagate the error confidently. Formal verification (checking the proof with a theorem prover) remains necessary for high-stakes mathematical and logical reasoning.
The reasoning model paradigm introduces a new dimension to the scaling discussion. Traditional scaling laws (Chinchilla) describe how loss decreases with training compute. Inference scaling laws describe how answer quality increases with inference compute (measured in thinking tokens).
Early results suggest that inference scaling has a favorable curve for reasoning-heavy tasks: doubling inference compute can improve performance more than doubling training compute at the margin, especially for models that are already well-trained. This has economic implications — a lab can improve performance by giving users the option to “think longer” rather than training a new, larger model.
$$\text{Performance}(C_{\text{train}}, C_{\text{infer}}) = f(C_{\text{train}}) + g(C_{\text{infer}})$$
where $$f$$ and $$g$$ are increasing functions with diminishing returns, and the optimal allocation depends on the cost ratio and task difficulty distribution.
Training a frontier model costs hundreds of millions of dollars and takes months. Inference compute is pay-per-query. The test-time compute paradigm shifts some of the cost from a fixed upfront investment to a variable per-query cost. This changes the economics of AI deployment: instead of one-size-fits-all, providers can offer tiered service — fast, cheap responses for easy queries and slow, expensive deep reasoning for hard ones. Users pay for the compute they need.
This lesson establishes:
Next: Multimodal AI