A transformer operates on numbers, not text. Before any attention or feed-forward computation, raw text must be converted to a sequence of integers (tokenization), and those integers must be mapped to dense vectors the model can process (embeddings). These conversions determine what the model can see, how efficiently it uses its context window, and what it can represent.
Neural networks perform matrix multiplications, additions, and activation functions — all numerical operations. Text has no natural numerical representation. The letter “a” is not inherently closer to “b” than to “z.” The word “bank” has no natural position in number space.
The naive approach — assign each word a unique integer — fails immediately. The vocabulary of English exceeds a million words once technical terms, proper nouns, and morphological variants are included. Unknown words (typos, new slang, code, other languages) have no representation at all. And integers impose a false ordering: if “cat” = 5 and “dog” = 6, the model might incorrectly infer they are “closer” than “cat” = 5 and “fish” = 9000.
Tokenization and embeddings solve these problems by splitting text into subword units and mapping them to learned vector representations.
Character-level models use a tiny vocabulary (roughly 256 tokens for UTF-8 bytes) but produce very long sequences — “understanding” becomes 13 tokens. Long sequences are expensive because attention is quadratic. Word-level models have short sequences but enormous vocabularies and cannot handle unknown words. Subword tokenization is the compromise: common words are single tokens (“the,” “and”), uncommon words are split into meaningful subword units (“un” + “derstand” + “ing”). This keeps vocabulary size manageable (32K-100K tokens) while handling arbitrary input.
Byte Pair Encoding (BPE): the dominant algorithm. Start with individual characters (or bytes). Count every adjacent pair in the training corpus. Merge the most frequent pair into a new token. Repeat thousands of times. The result is a vocabulary where common words are single tokens and rare words are sequences of subwords. GPT models use BPE.
WordPiece: similar to BPE but selects merges that maximize the likelihood of the training data rather than raw frequency. Used by BERT. Produces slightly different tokenizations than BPE but the practical difference is small.
SentencePiece: treats the input as a raw stream of Unicode characters (no pre-tokenization into words). This handles languages without whitespace-delimited words (Chinese, Japanese) naturally. Can implement either BPE or a unigram model internally.
tiktoken: OpenAI’s fast BPE implementation. Not a different algorithm — it is BPE with specific training data choices and performance optimizations for production inference.
Tokenization decisions have real consequences. GPT tokenizers split “ChatGPT” into multiple tokens but keep " the" (with a leading space) as one token. Simple arithmetic can fail because numbers are split into individual digits — “1234” might become [“12”, “34”], making arithmetic require cross-token reasoning. Code tokenization is particularly tricky: indentation, brackets, and operators may be tokenized inconsistently. These are not bugs in the model’s reasoning — they are artifacts of how the input is represented. Understanding tokenization explains many LLM behaviors that otherwise seem mysterious.
Smaller vocabulary (8K-16K tokens): each token is a smaller unit. Sequences are longer (more tokens per text). More of the context window is consumed. But the embedding table is smaller, and the model can represent any input regardless of language or domain.
Larger vocabulary (64K-256K tokens): common words and phrases are single tokens. Sequences are shorter and more efficient. But the embedding table consumes more parameters, and rare tokens get less training exposure. Extremely large vocabularies also waste embedding capacity on tokens the model rarely encounters.
The sweet spot for modern LLMs is typically 32K-100K tokens. This is large enough that common English words are single tokens, small enough that the embedding table is manageable, and the BPE algorithm ensures that even rare or novel words have reasonable subword decompositions.
English-centric tokenizers handle other languages poorly. A word that is a single token in English might require 3-5 tokens in Hindi or Thai, because the BPE merges were learned primarily from English text. This means non-English users consume their context window 2-5x faster for the same content. Models like LLaMA 2 have been criticized for exactly this imbalance. Multilingual models (mGPT, BLOOM) use training corpora balanced across languages to produce more equitable tokenizations.
An embedding maps each token ID to a dense vector — typically 768 to 12,288 dimensions. Unlike a one-hot encoding (where each word is an isolated basis vector), embeddings place semantically similar words near each other in vector space.
Word2Vec intuition: trained on predicting a word from its context (or vice versa), Word2Vec discovered that embedding arithmetic captures semantic relationships. The classic example: vector(“king”) - vector(“man”) + vector(“woman”) is closest to vector(“queen”). The embedding space encodes gender as a direction. Similarly, country-capital relationships, verb tenses, and other semantic regularities emerge as geometric patterns.
These relationships are not programmed — they emerge from the co-occurrence statistics of language. Words that appear in similar contexts get similar embeddings. This is the distributional hypothesis: a word is characterized by the company it keeps.
Embedding spaces have rich geometric structure. Synonyms cluster together. Antonyms are separated along specific dimensions. Analogies are parallelograms. Linguistic hierarchies (mammal -> dog -> poodle) form consistent directional patterns. This structure is useful far beyond language: recommendation systems embed users and products in the same space, search engines embed queries and documents, and molecular biology embeds protein sequences. The insight — that learned vector representations capture meaningful structure — is one of the most broadly applicable ideas in modern ML.
Word2Vec assigns each word a single fixed embedding regardless of context. But “bank” means something different in “river bank” and “investment bank.” Static embeddings cannot capture this.
Transformer-based models produce contextual embeddings: the representation of each token depends on the entire surrounding sequence. The same word in different contexts gets different vectors. This is computed through the attention mechanism — by attending to the full context, each token’s representation is influenced by every other token.
BERT was the first widely adopted contextual embedding model. Feed a sentence into BERT, and the output for each token is a vector that encodes not just the word itself but its role in that specific sentence. This is why BERT-based models dramatically outperformed Word2Vec on tasks requiring disambiguation, coreference resolution, and pragmatic understanding.
Modern retrieval systems (RAG, semantic search) work by embedding both queries and documents into the same vector space, then finding documents whose embeddings are closest to the query embedding. This is semantic search — it matches meaning, not keywords. A query about “automobile fuel efficiency” will match a document about “car gas mileage” even though they share no words. The quality of the embedding model determines the quality of retrieval, which is why specialized embedding models (like sentence-transformers, text-embedding-ada-002) are a critical component of modern AI systems.
A model’s context window is measured in tokens, not words. A 128K context window means 128,000 tokens — roughly 96,000 words for English text, but fewer for code (more whitespace tokens) or non-English languages (less efficient tokenization).
The context window is a hard limit: the model cannot see anything outside it. For a 200-page document that exceeds the context window, the model has no access to content beyond the limit. This constraint drives techniques like retrieval-augmented generation (RAG), where a search system selects relevant chunks from a large corpus and places only those chunks in the context window.
Context window size has been a key competitive dimension: GPT-4 Turbo offers 128K tokens, Claude offers 200K tokens. Longer contexts enable working with larger documents, longer conversations, and more complex multi-document reasoning — but attention’s quadratic cost means longer contexts require proportionally more computation.
Having a 200K token context window does not mean the model uses all 200K tokens equally well. Research consistently shows that model performance degrades for information placed in the middle of very long contexts — the “lost in the middle” phenomenon. Models attend most effectively to the beginning and end of the context. This has practical implications: when constructing prompts with many documents, the placement of critical information matters. The nominal context length is an upper bound; the effective context length for reliable retrieval is often shorter.
This lesson establishes:
Next: AI Foundations Check