The transformer is the architecture behind every major language model — GPT, BERT, T5, LLaMA, Claude. It processes sequences using only attention and feed-forward layers, with no recurrence and no convolution. This design is simple enough to describe in a few pages, flexible enough to handle language, vision, audio, and code, and scalable enough to train models with hundreds of billions of parameters.
The original transformer has two stacks: an encoder and a decoder. Each stack is a series of identical blocks (layers), typically 6-96 depending on model size.
Encoder block: self-attention layer, then a feed-forward network. The self-attention lets each position attend to all positions in the input. The feed-forward network processes each position independently. Both are wrapped in residual connections and layer normalization.
Decoder block: masked self-attention layer, then a cross-attention layer (attending to the encoder output), then a feed-forward network. The mask in self-attention prevents future positions from being attended to — the model can only look at tokens it has already generated.
The encoder processes the entire input in parallel. The decoder generates output one token at a time, using both the encoder’s representation and its own previously generated tokens.
The encoder-decoder split serves a specific purpose: the encoder builds a rich representation of the input with full bidirectional attention (every position sees every other position). The decoder generates output autoregressively with causal attention (each position sees only previous positions). This separation makes sense for tasks like translation where the input is fully known but the output is generated sequentially. For other tasks, a single stack suffices — hence encoder-only and decoder-only variants.
Neural networks are sensitive to the scale of their inputs. If activations grow too large or shrink too small across layers, training becomes unstable. Layer normalization stabilizes training by normalizing the activations within each layer to have zero mean and unit variance.
For each token’s representation vector, layer normalization computes the mean and standard deviation across the vector’s dimensions, subtracts the mean, divides by the standard deviation, and applies learned scale and shift parameters. This is done independently for each token at each layer.
Layer normalization allows higher learning rates (faster training), reduces sensitivity to parameter initialization, and makes very deep networks trainable. Without it, training a 96-layer transformer would be practically impossible.
The original transformer applied layer normalization after the residual connection (post-norm). Most modern models apply it before the attention or feed-forward layer (pre-norm). Pre-norm is more stable during training — it prevents the residual pathway from accumulating unnormalized activations — and is now the default. RMSNorm (root mean square normalization) is a further simplification that skips the mean-centering step, reducing computation with no measurable quality loss. GPT-3 and LLaMA use RMSNorm.
Each sub-layer (attention, feed-forward) in a transformer block is wrapped in a residual connection: the input to the sub-layer is added to its output. If the sub-layer’s function is F, the output is x + F(x) instead of just F(x).
Residual connections solve the degradation problem in deep networks. Without them, adding more layers can make the network perform worse — deeper networks have harder optimization landscapes. With residual connections, the network can learn identity mappings (F(x) = 0, so the output is just x), meaning additional layers can only help. The gradient also flows directly through the addition, preventing vanishing gradients across many layers.
In a 96-layer transformer, the input has a direct additive path to the output through all 96 residual connections. This “gradient highway” is what makes extreme depth feasible.
A useful mental model from interpretability research: think of the transformer as a residual stream. The input embedding enters the stream, and each attention or feed-forward layer reads from the stream, processes what it reads, and writes its output back by addition. The stream carries information forward through the entire network. Each layer contributes incrementally to the final representation rather than completely transforming it. This view clarifies why individual layers can be removed from trained models with surprisingly small impact — each layer’s contribution is additive and somewhat redundant.
Each transformer block contains a position-wise feed-forward network (FFN) — applied independently to each token position. It is a simple two-layer neural network: project up to a larger dimension (typically 4x the model dimension), apply an activation function, project back down.
The attention layer handles inter-token interactions — which tokens relate to which. The FFN handles per-token processing — transforming each token’s representation independently. Research suggests that FFNs act as key-value memories: the first layer’s weights serve as keys matching input patterns, and the second layer’s weights serve as values providing the associated information.
The FFN is the most parameter-heavy component of a transformer. Mixture of Experts replaces the single FFN with many smaller “expert” FFNs and a routing network that selects which experts to use for each token. Only a subset of experts (typically 2 out of 8 or 16) activate per token, so the model has many more total parameters but the same computational cost per forward pass. Mixtral 8x7B uses this approach — it has 47 billion total parameters but activates only 13 billion per token, achieving performance comparable to much denser models. MoE is a key technique for scaling parameter count without proportionally scaling compute.
The original transformer used both stacks. Subsequent work discovered that a single stack often suffices.
Encoder-only (BERT): processes the full input with bidirectional self-attention. Every token can attend to every other token. Produces rich contextual representations. Used for classification, named entity recognition, semantic similarity. Not generative — cannot produce new text.
Decoder-only (GPT, LLaMA, Claude): generates text autoregressively. Each token can attend only to previous tokens (causal masking). Trained by predicting the next token. This is the dominant architecture for modern LLMs. Its simplicity — a single stack, a single training objective — makes scaling straightforward.
Encoder-decoder (T5, BART): the original design. The encoder processes the input bidirectionally; the decoder generates the output autoregressively, attending to the encoder’s representation. Used for translation, summarization, and tasks with distinct input and output sequences.
Decoder-only models dominate modern LLMs for several reasons. The next-token prediction objective is simple and scales to arbitrary data — any text is training data. The architecture is simpler to implement and optimize. And with enough scale, decoder-only models can perform the tasks that encoder-only and encoder-decoder models specialize in, by formulating those tasks as text generation. Classification becomes “generate the label.” Translation becomes “generate the translation.” Summarization becomes “generate the summary.” The generality of text generation subsumes specialized architectures.
The transformer’s dominance is not just architectural elegance — it is the ability to convert compute into capability predictably.
Parallelism: self-attention processes all positions simultaneously. Unlike RNNs, training time does not grow linearly with sequence length (it grows quadratically in memory, but the computation parallelizes across GPU cores). This means transformers can efficiently use thousands of GPUs.
Hardware utilization: the core operations — matrix multiplications for attention and FFN — map perfectly to GPU and TPU architectures, which are optimized for exactly these operations. Transformers use modern hardware near its theoretical peak efficiency.
Scaling laws: Kaplan et al. (2020) demonstrated that transformer performance improves predictably with more parameters, more data, and more compute, following a power law. This means the performance of a larger model can be predicted before training it. This predictability drives investment: if doubling compute reliably halves the loss, rational decisions can be made about billion-dollar training runs.
The scaling hypothesis — that simply making models bigger, with more data and compute, will continue to produce qualitatively new capabilities — has been the driving thesis of frontier AI labs since 2020. GPT-3 (175B parameters) could do few-shot learning that GPT-2 (1.5B) could not. GPT-4 and Claude 3.5 demonstrate reasoning capabilities absent in smaller models. Whether scaling alone produces genuine intelligence or merely better pattern matching is one of the central debates in AI research. What is empirically clear: capabilities that were absent at one scale reliably appear at larger scales, and no one has yet identified a ceiling.
This lesson establishes: