LLM Pretraining

The core objective is deceptively simple: predict the next token. Every capability of a modern LLM — translation, reasoning, code generation, summarization — emerges from training a model to do this one thing over trillions of tokens.

Next-Token Prediction

Given a sequence of tokens, the model outputs a probability distribution over the vocabulary for the next token. Training minimizes the cross-entropy loss between the predicted distribution and the actual next token across the entire training corpus.

This objective is self-supervised — no human labels required. The training data itself provides the supervision signal. Every token in every document becomes a training example: the preceding context is the input, the next token is the label.

Why Next-Token Prediction Works

The argument, articulated by Ilya Sutskever and others, is that truly accurate next-token prediction requires world knowledge. Predicting the next word in a physics textbook requires understanding physics. Predicting the next token in code requires understanding programming. Compressing language well enough to predict it forces the model to build internal representations of the concepts the language describes.

This is not memorization. The model must generalize across contexts — the same concept appears in different phrasings, different documents, different languages. The loss function rewards models that extract the underlying structure, not those that memorize surface patterns.

Pretraining Data

Modern LLMs train on datasets measured in trillions of tokens. The primary source is Common Crawl — a regularly updated archive of the public web. But raw web data is noisy: duplicated pages, spam, machine-generated text, low-quality content.

Quality filtering removes pages with low text-to-HTML ratios, excessive boilerplate, or content below a language quality classifier’s threshold. Deduplication removes near-identical documents to prevent the model from memorizing repeated content and to reduce training compute waste. Domain mixing blends web data with curated sources — books, academic papers, code repositories, Wikipedia — to improve coverage of high-quality reasoning.

The Pretraining Data Wall

By 2024, frontier labs recognized a problem: high-quality text data on the internet is finite. Estimates suggest the total stock of quality text is in the low tens of trillions of tokens. Models are approaching the point where they have seen most of what is worth learning from public text. This “data wall” is driving interest in synthetic data generation (using models to generate training data for other models), multi-epoch training, and multimodal data as alternative scaling paths.

Scaling Laws

In 2020, researchers at OpenAI published scaling laws showing that model performance (measured by loss) follows a predictable power law as three variables increase: model parameters, dataset size, and training compute. Double the compute, and loss decreases by a predictable amount. This relationship held over several orders of magnitude.

The Chinchilla paper (2022, DeepMind) refined these laws. The key finding: most models were over-parameterized and under-trained. For a fixed compute budget, the optimal strategy allocates compute equally between model size and training data. A 70B parameter model trained on 1.4 trillion tokens outperforms a 175B model trained on 300 billion tokens — despite being less than half the size.

Chinchilla Scaling in Practice

The Chinchilla ratio — roughly 20 tokens per parameter — became the baseline for compute-optimal training. But inference cost matters too. Llama and similar models deliberately train smaller models on far more tokens than the Chinchilla optimum, because a smaller model is cheaper to serve at scale, even if training it cost more per FLOP. The “optimal” ratio depends on whether the objective is training cost or total lifetime cost including inference.

$$L(N, D) \propto N^{-\alpha} + D^{-\beta}$$

where $$N$$ is parameter count, $$D$$ is dataset size, and $$\alpha, \beta$$ are empirically fitted exponents (approximately 0.076 and 0.095 respectively in the original Kaplan et al. formulation).

Emergent Abilities at Scale

Some capabilities appear suddenly as models cross parameter thresholds. Small models cannot do multi-step arithmetic, chain-of-thought reasoning, or in-context learning. Scale the same architecture to sufficient size and these abilities appear without being explicitly trained.

This observation drove the scaling hypothesis: intelligence is a function of scale. Pour more compute, data, and parameters into the same architecture, and qualitatively new capabilities emerge. Whether these are truly emergent or simply cross detection thresholds at scale is debated, but the practical effect is the same — larger models can do things smaller models cannot.

The Emergent Abilities Debate

A 2023 paper by Schaeffer et al. argued that “emergent abilities” are a mirage of metric choice. Smooth, continuous improvement on a log scale looks like a sudden jump when measured with a sharp threshold metric (e.g., exact-match accuracy). The underlying capability may be improving gradually — it just crosses the “correct answer” threshold at a particular scale. This does not change the practical observation that only large models can perform complex reasoning, but it challenges the narrative of phase transitions in capability.

Training Infrastructure

Pretraining a frontier model requires thousands of GPUs running for weeks to months. The compute cost of training GPT-4-class models is estimated in the tens to hundreds of millions of dollars.

Data parallelism replicates the model across devices, each processing a different batch, and synchronizes gradients. Model parallelism (tensor parallelism and pipeline parallelism) splits the model itself across devices when it is too large to fit in a single GPU’s memory. Mixed precision training uses 16-bit or 8-bit floating point for most operations and 32-bit only where numerical stability requires it, roughly doubling throughput.

The Hardware Bottleneck

Training is bottlenecked by interconnect bandwidth, not raw compute. GPUs spend significant time waiting for gradient synchronization across the network. NVIDIA’s NVLink, InfiniBand fabrics, and custom interconnects (Google’s TPU pods, Meta’s Grand Teton clusters) are designed to minimize this communication overhead. The limiting factor for frontier model training is increasingly the ability to build and cool clusters of tens of thousands of GPUs with sufficient interconnect bandwidth, not the availability of the GPUs themselves.

Key Takeaways

This lesson establishes:

Why next-token prediction produces general-purpose language capabilities
The pretraining data pipeline: sourcing, filtering, deduplication, domain mixing
The Chinchilla finding and why it changed how labs allocate compute budgets
The distinctions among data parallelism, model parallelism, and mixed precision training
The pretraining data wall and why it matters for future scaling

Next: Fine-Tuning and Alignment