Lab

Lab: Build a RAG Pipeline

30 min Python

Goal

Build a complete retrieval-augmented generation pipeline in Python. You will chunk documents, generate embeddings, implement cosine similarity search, retrieve relevant context for a query, and format a prompt ready for an LLM. The entire pipeline runs in about 50 lines of core logic.

Setup

You need Python 3.8 or later and two packages:

pip install sentence-transformers numpy

sentence-transformers provides pre-trained embedding models. numpy handles the vector math. No vector database is needed – you will implement similarity search directly.

Create a file called rag_lab.py.

Step 1: Define Source Documents

Start with a small knowledge base of hardcoded documents on distinct topics. In a production system, these would come from files, databases, or APIs.

import numpy as np
from sentence_transformers import SentenceTransformer

# Source documents -- each is a short passage on a different topic
documents = [
    {
        "title": "Photosynthesis",
        "text": (
            "Photosynthesis is the process by which green plants and certain "
            "other organisms convert light energy into chemical energy. During "
            "photosynthesis, plants capture sunlight using chlorophyll in their "
            "leaves and use it to transform water and carbon dioxide into oxygen "
            "and glucose. The glucose provides energy for the plant's metabolic "
            "processes. Photosynthesis is responsible for producing the oxygen "
            "in Earth's atmosphere."
        ),
    },
    {
        "title": "TCP/IP",
        "text": (
            "TCP/IP is the foundational protocol suite for the internet. TCP "
            "(Transmission Control Protocol) provides reliable, ordered delivery "
            "of data between applications. IP (Internet Protocol) handles "
            "addressing and routing packets across networks. Together they enable "
            "applications to communicate across diverse networks without needing "
            "to know the underlying hardware. TCP uses a three-way handshake "
            "(SYN, SYN-ACK, ACK) to establish connections and sequence numbers "
            "to ensure packets arrive in order."
        ),
    },
    {
        "title": "The French Revolution",
        "text": (
            "The French Revolution began in 1789 with the storming of the "
            "Bastille and ended in 1799 with Napoleon's rise to power. It was "
            "driven by widespread discontent with the monarchy, economic hardship, "
            "and Enlightenment ideals of liberty and equality. The revolution "
            "abolished feudalism, established the Declaration of the Rights of Man "
            "and Citizen, and fundamentally transformed French governance. The Reign "
            "of Terror under Robespierre saw thousands executed by guillotine."
        ),
    },
    {
        "title": "Hash Functions",
        "text": (
            "A cryptographic hash function takes an arbitrary-length input and "
            "produces a fixed-size output called a digest. Good hash functions "
            "have three properties: preimage resistance (cannot reverse the hash), "
            "second preimage resistance (cannot find a different input with the "
            "same hash), and collision resistance (cannot find any two inputs with "
            "the same hash). SHA-256 produces a 256-bit digest and is widely used "
            "in TLS certificates, git, and blockchain systems."
        ),
    },
    {
        "title": "Mitochondria",
        "text": (
            "Mitochondria are membrane-bound organelles found in the cytoplasm "
            "of eukaryotic cells. They generate most of the cell's supply of "
            "adenosine triphosphate (ATP), the molecule used as energy currency. "
            "Mitochondria have their own DNA, separate from nuclear DNA, which "
            "supports the endosymbiotic theory that they originated as free-living "
            "bacteria that were engulfed by ancestral eukaryotic cells. Each cell "
            "can contain hundreds to thousands of mitochondria depending on its "
            "energy demands."
        ),
    },
]

print(f"Loaded {len(documents)} documents")

Step 2: Chunk the Documents

Split each document into smaller chunks. For this lab, use simple sentence-based chunking. Production systems use more sophisticated strategies (recursive, semantic), but sentence splitting demonstrates the principle.

def chunk_documents(docs, max_chunk_length=200):
    """Split documents into chunks by sentence boundaries."""
    chunks = []
    for doc in docs:
        sentences = doc["text"].replace(". ", ".\n").split("\n")
        current_chunk = ""
        for sentence in sentences:
            if len(current_chunk) + len(sentence) > max_chunk_length and current_chunk:
                chunks.append({
                    "title": doc["title"],
                    "text": current_chunk.strip(),
                })
                current_chunk = sentence
            else:
                current_chunk += (" " if current_chunk else "") + sentence
        if current_chunk.strip():
            chunks.append({
                "title": doc["title"],
                "text": current_chunk.strip(),
            })
    return chunks


chunks = chunk_documents(documents)
print(f"Created {len(chunks)} chunks\n")
for i, chunk in enumerate(chunks):
    print(f"  Chunk {i}: [{chunk['title']}] {chunk['text'][:60]}...")

Loaded 5 documents
Created 10 chunks

  Chunk 0: [Photosynthesis] Photosynthesis is the process by which green plants an...
  Chunk 1: [Photosynthesis] The glucose provides energy for the plant's metabolic p...
  Chunk 2: [TCP/IP] TCP/IP is the foundational protocol suite for the internet...
  Chunk 3: [TCP/IP] Together they enable applications to communicate across div...
  Chunk 4: [The French Revolution] The French Revolution began in 1789 with the storming o...
  Chunk 5: [The French Revolution] The revolution abolished feudalism, established the Decla...
  Chunk 6: [Hash Functions] A cryptographic hash function takes an arbitrary-length in...
  Chunk 7: [Hash Functions] SHA-256 produces a 256-bit digest and is widely used in T...
  Chunk 8: [Mitochondria] Mitochondria are membrane-bound organelles found in the c...
  Chunk 9: [Mitochondria] Mitochondria have their own DNA, separate from nuclear DN...

Each document has been split into two or more chunks at sentence boundaries, keeping each chunk under 200 characters.

Step 3: Generate Embeddings

Use a pre-trained embedding model to convert each chunk into a dense vector. The first run downloads the model (about 90 MB).

print("\nLoading embedding model...")
model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed all chunks
chunk_texts = [c["text"] for c in chunks]
embeddings = model.encode(chunk_texts, show_progress_bar=False)

print(f"Embedding shape: {embeddings.shape}")
print(f"Each chunk is now a {embeddings.shape[1]}-dimensional vector")

Loading embedding model...
Embedding shape: (10, 384)
Each chunk is now a 384-dimensional vector

Each chunk is now represented as a 384-dimensional vector. Chunks with similar meaning will have vectors that are close together in this space.

Step 4: Implement Cosine Similarity Search

Build the retrieval function. Cosine similarity measures the angle between two vectors – 1.0 means identical direction, 0.0 means orthogonal.

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def retrieve(query, model, embeddings, chunks, top_k=3):
    """Embed a query and return the top-k most similar chunks."""
    query_embedding = model.encode([query], show_progress_bar=False)[0]

    # Compute similarity against all chunks
    similarities = []
    for i, emb in enumerate(embeddings):
        sim = cosine_similarity(query_embedding, emb)
        similarities.append((sim, i))

    # Sort by similarity (descending) and return top-k
    similarities.sort(reverse=True)
    results = []
    for sim, idx in similarities[:top_k]:
        results.append({
            "score": float(sim),
            "title": chunks[idx]["title"],
            "text": chunks[idx]["text"],
        })
    return results

Step 5: Test Retrieval

Run a few queries and inspect the results.

queries = [
    "How do plants produce energy from sunlight?",
    "How does TCP establish a connection?",
    "What is SHA-256 used for?",
]

for query in queries:
    print(f"\nQuery: {query}")
    results = retrieve(query, model, embeddings, chunks, top_k=3)
    for rank, r in enumerate(results):
        print(f"  {rank + 1}. [{r['title']}] score={r['score']:.4f}")
        print(f"     {r['text'][:80]}...")

Query: How do plants produce energy from sunlight?
  1. [Photosynthesis] score=0.7234
     Photosynthesis is the process by which green plants and certain other organisms ...
  2. [Photosynthesis] score=0.5102
     The glucose provides energy for the plant's metabolic processes...
  3. [Mitochondria] score=0.2841
     Mitochondria are membrane-bound organelles found in the cytoplasm of eukaryotic...

Query: How does TCP establish a connection?
  1. [TCP/IP] score=0.7518
     Together they enable applications to communicate across diverse networks without...
  2. [TCP/IP] score=0.6293
     TCP/IP is the foundational protocol suite for the internet...
  3. [Hash Functions] score=0.1204
     A cryptographic hash function takes an arbitrary-length input and produces a fix...

Query: What is SHA-256 used for?
  1. [Hash Functions] score=0.6847
     SHA-256 produces a 256-bit digest and is widely used in TLS certificates, git, ...
  2. [Hash Functions] score=0.5231
     A cryptographic hash function takes an arbitrary-length input and produces a fix...
  3. [TCP/IP] score=0.1156
     TCP/IP is the foundational protocol suite for the internet...

The retriever correctly identifies relevant chunks for each query. Note that the scores for irrelevant chunks are much lower, showing clear separation between relevant and irrelevant results.

Step 6: Format the RAG Prompt

Build a prompt that injects the retrieved context, ready to send to an LLM.

def build_rag_prompt(query, results):
    """Format a RAG prompt with retrieved context."""
    context_parts = []
    for i, r in enumerate(results):
        context_parts.append(f"[Source {i + 1}: {r['title']}]\n{r['text']}")
    context = "\n\n".join(context_parts)

    prompt = (
        "Answer the question based only on the provided context. "
        "If the context does not contain enough information, say so.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        "Answer:"
    )
    return prompt


# Build and display a RAG prompt
query = "How do plants convert sunlight into energy?"
results = retrieve(query, model, embeddings, chunks, top_k=2)
prompt = build_rag_prompt(query, results)

print("\nFull RAG Prompt:")
print("=" * 60)
print(prompt)
print("=" * 60)
print(f"\nPrompt length: {len(prompt)} characters")

Full RAG Prompt:
============================================================
Answer the question based only on the provided context. If the context does not contain enough information, say so.

Context:
[Source 1: Photosynthesis]
Photosynthesis is the process by which green plants and certain other organisms convert light energy into chemical energy. During photosynthesis, plants capture sunlight using chlorophyll in their leaves and use it to transform water and carbon dioxide into oxygen and glucose.

[Source 2: Photosynthesis]
The glucose provides energy for the plant's metabolic processes. Photosynthesis is responsible for producing the oxygen in Earth's atmosphere.

Question: How do plants convert sunlight into energy?

Answer:
============================================================

Prompt length: 612 characters

This prompt follows the standard RAG pattern: explicit instruction to use only the context, the retrieved chunks with source attribution, and the user’s question. An LLM receiving this prompt would generate an answer grounded in the retrieved text rather than relying on its training data.

Step 7 (Optional): Call an LLM API

If you have an OpenAI API key, you can complete the pipeline by sending the prompt to a model. This step is optional – the pipeline is complete without it.

# Uncomment the following to call the OpenAI API:
#
# import os
# from openai import OpenAI
#
# client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
#
# response = client.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=[{"role": "user", "content": prompt}],
#     max_tokens=200,
# )
#
# print("\nLLM Response:")
# print(response.choices[0].message.content)

Summary

This lab built a complete RAG pipeline from scratch:

Document chunking: split text at sentence boundaries to create retrievable units
Embedding generation: converted chunks to 384-dimensional vectors using a pre-trained model
Cosine similarity search: found the most relevant chunks for a query by comparing vector directions
Retrieval function: returned top-k chunks ranked by semantic similarity
Prompt formatting: injected retrieved context into a structured prompt with source attribution

The entire retrieval pipeline – from query to formatted prompt – runs in milliseconds on a laptop with no external services. Production systems add vector databases for scale, reranking for precision, and hybrid search for exact-match queries, but the core logic is the same.