The Attention Mechanism

Attention is the mechanism that lets a model decide which parts of its input matter most for each part of its output. Before attention, sequence models compressed an entire input into a single fixed-size vector — a brutal bottleneck that discarded information. Attention removed that bottleneck and, in the process, became the core building block of every modern language model.

The Bottleneck Problem in Seq2Seq

The original sequence-to-sequence (seq2seq) architecture for machine translation used an encoder RNN and a decoder RNN. The encoder processed the entire input sentence and compressed it into a single hidden state vector — the context vector. The decoder then generated the translation from this single vector.

The problem: that single vector must encode everything about the input sentence. For short sentences, this works. For long sentences, information is inevitably lost. Translating a 50-word sentence through a 256-dimensional vector is like describing a painting through a keyhole. Performance degraded sharply as sentence length increased.

Quantifying the Bottleneck

Bahdanau et al. (2014) demonstrated the degradation empirically: BLEU scores (translation quality) dropped steeply for sentences longer than 20-30 words. The fixed-size context vector simply could not carry enough information. Their solution — attention — allowed the decoder to look back at the entire input sequence at each decoding step, not just the compressed summary. This single change dramatically improved long-sentence translation.

Attention as Learned Alignment

Instead of compressing the entire input into one vector, attention lets the decoder look at all encoder hidden states and decide which ones are relevant for the current output token.

At each decoding step, the model computes a relevance score between the current decoder state and every encoder hidden state. These scores are normalized to a probability distribution (using softmax). The decoder then computes a weighted sum of encoder states, weighted by these relevance scores — the context vector is now dynamic, changing at every step.

For example, when translating “le chat noir” to “the black cat,” the decoder generating “black” should attend primarily to “noir” — the attention mechanism learns this alignment automatically from training data.

Attention Is Soft Alignment

Traditional machine translation used hard alignment — each output word corresponds to exactly one input word. But language is not one-to-one: words reorder, merge, and split across languages. Attention provides soft alignment — each output word can attend to multiple input words with varying strength. This handles reordering, phrasal translation, and one-to-many mappings naturally. The alignment is not prescribed; it emerges from training.

Query-Key-Value Framework

Modern attention formalizes the mechanism with three projections of the input:

Query (Q): “What am I looking for?” — represents the current position’s information need.

Key (K): “What do I contain?” — represents what each position offers.

Value (V): “What do I provide?” — the actual content to be aggregated.

The attention computation: for each query, compute a dot product with every key to get relevance scores. Divide by the square root of the key dimension (scaling prevents the dot products from growing too large). Apply softmax to get attention weights. Multiply each value by its weight and sum.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

The query asks a question. The keys announce what they have. The dot product measures compatibility. The values provide the answer.

Why Scale by sqrt(d_k)

Without scaling, the dot products between queries and keys grow in magnitude with the dimension d_k. Large dot products push softmax into regions where the gradient is extremely small — the distribution becomes nearly one-hot, and learning slows to a crawl. Dividing by sqrt(d_k) keeps the dot products in a range where softmax has useful gradients. This is a small detail with large practical impact — scaling makes the difference between training that converges and training that stalls.

Self-Attention: Attending to Yourself

The original attention mechanism attended from decoder states to encoder states — two separate sequences. Self-attention applies attention within a single sequence. Each position attends to every other position in the same sequence, including itself.

For the sentence “The animal didn’t cross the street because it was too tired,” self-attention allows “it” to attend strongly to “animal” — learning the coreference. For “it was too wide,” “it” would attend to “street.” The same mechanism resolves different relationships based on context.

Self-attention captures dependencies regardless of distance. Position 1 can attend to position 100 in a single step — no information must flow through intermediate positions. This is fundamentally different from RNNs, where information between distant positions must survive propagation through every step in between.

Computational Cost of Self-Attention

Self-attention computes a score between every pair of positions. For a sequence of length n, this produces an n x n attention matrix — quadratic in sequence length. For a 1,000-token sequence, that is 1 million attention scores. For 100,000 tokens, 10 billion. This quadratic cost is why context windows have hard limits and why efficient attention variants (sparse attention, linear attention, ring attention) are active research areas. The 128K or 200K context windows in modern LLMs require significant engineering to handle this scaling.

Multi-Head Attention

A single attention head learns one type of relationship — perhaps syntactic agreement or semantic similarity. Multi-head attention runs multiple attention heads in parallel, each with its own Q, K, V projections. Each head can learn a different relationship pattern.

In practice, different heads specialize: one head might track subject-verb agreement, another might track coreference, another might attend to nearby tokens for local syntactic patterns. The outputs of all heads are concatenated and projected to the final output.

Multi-head attention is more expressive than a single head with the same total dimensionality, because it can simultaneously represent multiple types of relationships.

Attention Head Pruning

Research has shown that many attention heads in trained models are redundant — removing them has minimal impact on performance. This suggests that models learn more heads than strictly necessary, and the important patterns are concentrated in a subset of heads. This observation informs model compression: pruning redundant heads reduces computation without significant quality loss.

Positional Encoding: Why Order Matters

Self-attention is permutation-invariant — it computes the same result regardless of the order of the input. But word order matters: “dog bites man” and “man bites dog” have very different meanings. Positional encodings inject position information into the input, telling the model where each token sits in the sequence.

The original transformer used sinusoidal positional encodings — fixed functions of position that the model learns to interpret. Modern models use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative position directly in the attention computation.

Relative vs. Absolute Position

Absolute positional encodings assign a fixed vector to each position (position 0, position 1, …). This limits the model to sequences no longer than the maximum position seen during training. Relative positional encodings (like ALiBi and RoPE) encode the distance between positions rather than their absolute location. This makes the model more robust to sequences longer than it was trained on — a critical property for LLMs that need flexible context windows.

Why Attention Replaced RNNs

RNNs had two fatal limitations that attention solved:

Parallelization: RNNs process tokens sequentially — each step depends on the previous one. Self-attention processes all positions in parallel. On modern GPUs with thousands of cores, this is the difference between hours and minutes of training.

Long-range dependencies: in an RNN, information from position 0 must survive propagation through every intermediate step to reach position 100. Gradients vanish, and the signal degrades. Self-attention connects every position to every other in a single step — the path length is always 1, regardless of distance.

These two advantages — parallelism and direct long-range connections — are why the transformer replaced RNNs as the dominant sequence architecture and why the 2017 paper was titled “Attention Is All You Need.”

The Paper That Changed Everything

Vaswani et al., “Attention Is All You Need” (2017), proposed the transformer: a model built entirely from attention mechanisms with no recurrence and no convolution. It achieved state-of-the-art translation quality with dramatically less training time. But the paper’s true significance was not translation performance — it was demonstrating that attention alone could be the foundation of a general-purpose sequence model. Every GPT, BERT, T5, LLaMA, and Claude model descends from this architecture.

Key Takeaways

This lesson establishes:

The bottleneck problem in seq2seq models and how attention solves it
The query-key-value framework and the attention score computation
The distinction between cross-attention (encoder-decoder) and self-attention
What multi-head attention provides over a single attention head
Why attention replaced RNNs (parallelization and long-range dependencies)

Next: The Transformer