LLMs have a knowledge cutoff — they know nothing about events after their training data was collected. They hallucinate — they generate confident, plausible, wrong answers. They cannot access proprietary data — internal documentation, a private codebase, or a database is not in the training set. RAG solves all three problems by retrieving relevant information at query time and injecting it into the prompt.
The core architecture is a six-stage pipeline:
The model does not “know” the answer — it reads the relevant documents in its context window and synthesizes a response from them, just as a human would answer a question after reading the relevant pages.
Modern models support context windows of 100K to 1M tokens. Why not load an entire document collection and skip retrieval? Three reasons. Cost: API pricing is per-token, and including irrelevant documents wastes money. Quality: models perform worse when relevant information is buried in a sea of irrelevant context (the “lost in the middle” phenomenon). Scale: most document collections exceed even the largest context windows. RAG retrieves only the relevant subset, keeping context focused and costs manageable.
How documents are split determines retrieval quality. Chunks that are too large dilute relevant information with surrounding noise. Chunks that are too small lose context and fragment coherent explanations.
Fixed-size chunking: split every N tokens with M tokens of overlap. Simple, predictable, works as a baseline. Typical values: 512 tokens per chunk, 50 tokens of overlap.
Recursive chunking: split on natural boundaries — paragraphs, sections, sentences — with fallback to fixed-size splits when sections are too long. Preserves document structure better than fixed-size.
Semantic chunking: use an embedding model to detect topic shifts and split where the embedding similarity between adjacent sentences drops. Produces chunks that are topically coherent but variable in size.
Smaller chunks (128-256 tokens) improve retrieval precision — the retrieved chunk is more likely to contain only relevant information. But they hurt context — the model may not have enough surrounding information to generate a good answer. Larger chunks (512-1024 tokens) provide more context but include more irrelevant text. The practical solution: retrieve small chunks for precision, then expand them to include surrounding context before injection (parent-document retrieval or windowed expansion).
An embedding model maps text to a dense vector in a high-dimensional space (typically 768-1536 dimensions). Texts with similar meaning produce vectors that are close together, measured by cosine similarity or dot product. This enables semantic search — finding relevant documents even when they use different words than the query.
Vector databases — Pinecone, Weaviate, Qdrant, Chroma, pgvector — are optimized for approximate nearest neighbor (ANN) search across millions or billions of vectors. They use indexing algorithms (HNSW, IVF) that trade a small amount of recall for orders-of-magnitude speed improvements over brute-force search.
Embedding model quality directly determines retrieval quality. The MTEB (Massive Text Embedding Benchmark) leaderboard ranks models on retrieval, classification, and clustering tasks. As of early 2025, the top models are from OpenAI (text-embedding-3-large), Cohere (embed-v3), and open-source entries like E5 and BGE. Model choice matters more than vector database choice — a better embedding model with a simple vector store outperforms a worse embedding model with a sophisticated database.
Initial retrieval (the vector similarity search) optimizes for recall — returning a broad set of potentially relevant chunks. Reranking then scores these candidates with a more expensive cross-encoder model that jointly processes the query and each candidate, producing a more accurate relevance score.
Hybrid search combines keyword search (BM25) with semantic search (vector similarity). Keyword search excels at exact matches — product names, error codes, specific identifiers — where semantic search may retrieve conceptually similar but factually wrong results. Hybrid search retrieves candidates from both methods, deduplicates, and reranks.
RRF is a simple, effective algorithm for combining ranked lists from different retrieval methods. For each document, compute 1/(k + rank) for each list it appears in, and sum the scores. k is a constant (typically 60) that controls how much weight to give lower-ranked results. RRF requires no training — it works out of the box as a combination strategy for hybrid search.
RAG systems have three failure modes that must be measured independently.
Retrieval relevance: did the retriever find the right documents? Measured by precision@k and recall@k against a labeled set of relevant documents per query.
Faithfulness: does the generated answer stick to the retrieved context, or does the model hallucinate information not present in the chunks? Measured by having a separate LLM judge whether each claim in the answer is supported by the retrieved context.
Answer correctness: is the final answer actually correct? Measured against ground-truth answers when available, or by human evaluation when not.
Frameworks like RAGAS automate these evaluations by generating test questions from the document collection and scoring retrieval, faithfulness, and correctness programmatically.
Retrieval failure (the right document was not retrieved) is usually a chunking or embedding problem — either the relevant text was split across chunks, or the embedding model failed to capture the semantic relationship between query and document. Faithfulness failure (the model ignores the context and generates from its parametric knowledge) is mitigated by explicit prompting (“Answer only based on the provided context”) and by including a “no answer” option when the context is insufficient. Answer quality failure (the answer is faithful to the context but incomplete or poorly structured) is addressed by prompt engineering and output format constraints.
Document Q&A: the canonical RAG use case. Index a collection of PDFs, documentation, or knowledge base articles. Users ask questions in natural language and receive answers grounded in the source material, with citations linking to the source chunks.
Code search: embed code snippets and documentation. Retrieve relevant code when a developer asks a natural language question about the codebase. Combine with AST-level chunking (splitting on function and class boundaries) for better retrieval precision.
Knowledge bases: enterprise RAG systems over internal documentation, Slack archives, support tickets, and design documents. These systems require careful attention to access control — the retriever must respect the user’s permissions and not surface documents they are not authorized to see.
A production RAG system is more than a retriever and a generator. It includes: a data ingestion pipeline (watching for new and updated documents, re-chunking and re-embedding), metadata filtering (restricting retrieval by date, source, access level), caching (reusing results for repeated queries), observability (logging queries, retrieved chunks, and generated answers for debugging), and feedback loops (users flagging wrong answers to improve the system).
This lesson establishes:
Next: AI Agents and Tool Use