A chatbot generates text. An agent takes actions. The transition from chatbot to agent requires three additions: the ability to call external tools, a loop that observes results and decides what to do next, and memory that persists across steps. This lesson covers the architectures that make LLMs act on the world rather than just describe it.
A chatbot processes a user message and produces a response. An agent operates in a loop:
The agent loop continues until the model determines the task is complete. This is fundamentally different from single-turn generation — the model maintains state across multiple steps and adapts its plan based on intermediate results.
Many useful tasks cannot be completed in a single LLM call. Debugging code requires reading the error, forming a hypothesis, making an edit, running the test, and iterating. Research requires searching multiple sources, synthesizing findings, identifying gaps, and searching again. Planning requires generating a plan, evaluating it against constraints, and revising. The agent loop provides the iteration structure that single-turn generation lacks.
Modern LLM APIs support function calling: the developer defines a set of tools (as JSON schemas describing function names, parameters, and return types), and the model can choose to call a tool instead of generating text. The API returns a structured function call, the application executes the function, and the result is fed back to the model.
Tools extend the model’s capabilities beyond text generation. A model cannot query a database, but it can generate a SQL query and call a tool that executes it. A model cannot browse the web, but it can call a search API. A model cannot run code, but it can call a code execution sandbox.
Tool definitions are prompts — the model decides which tool to use based on the tool’s name, description, and parameter descriptions. Clear, specific tool descriptions produce better tool selection. “search_database: Search the product catalog by name, category, or price range” is better than “search: Run a search.” Parameter descriptions should include constraints and examples: “price_max: Maximum price in USD (e.g., 49.99).” The model treats tool definitions as part of its context, so the same prompt engineering principles apply.
ReAct (Reasoning + Acting) interleaves chain-of-thought reasoning with tool use. At each step, the model first generates a “Thought” explaining its reasoning, then an “Action” specifying the tool call, then receives an “Observation” containing the result. This three-part structure makes the agent’s decision-making transparent and debuggable.
Thought: The user wants the current stock price of AAPL.
I need to call the stock price API.
Action: get_stock_price(symbol="AAPL")
Observation: {"symbol": "AAPL", "price": 187.43, "currency": "USD"}
Thought: I have the price. I can now respond to the user.
Answer: Apple (AAPL) is currently trading at $187.43 USD.
The Thought step is critical — it forces the model to reason about what information it has and what it needs before acting, reducing impulsive or incorrect tool calls.
Modern function-calling APIs often skip the explicit Thought/Action/Observation format in favor of implicit reasoning — the model decides which tool to call without generating visible reasoning. This is faster and cheaper (fewer tokens) but less interpretable. For debugging, auditing, or safety-critical applications, explicit ReAct-style reasoning provides a trace of the agent’s decision-making. Some implementations use both: explicit reasoning for complex decisions, direct function calls for routine operations.
Single agent: one model handles all reasoning and tool use. Simple to build, effective for well-scoped tasks. Most production deployments use this pattern.
Multi-agent: multiple specialized agents collaborate. One agent handles research, another writes code, a third reviews. Communication happens through shared context or message passing. More complex to orchestrate but allows specialization — each agent can have different system prompts, tools, and even different underlying models.
Orchestrator pattern: a central agent decomposes the task, delegates subtasks to specialized agents, collects results, and synthesizes the final output. The orchestrator handles planning and coordination; worker agents handle execution. This is the architecture behind systems like Claude Code and similar coding agents.
Multi-agent architectures are justified when: different subtasks require different tools or permissions (a research agent should not have write access to production databases), the task naturally decomposes into independent parallel subtasks, or different subtasks benefit from different models (a cheap fast model for routine work, an expensive capable model for complex reasoning). For most applications, a single well-prompted agent with good tools is simpler and sufficient.
Complex tasks require planning before execution. Without explicit planning, agents often take the first available action and get stuck in loops or pursue unproductive paths.
Effective agent planning strategies:
Planning quality improves with chain-of-thought prompting, explicit plan formats (“Step 1: … Step 2: …”), and self-critique (“Is this plan complete? What could go wrong?”).
Agents that plan too much waste tokens and time on detailed plans that become obsolete after the first unexpected result. Agents that plan too little take random walks through the action space. The balance is task-dependent. For routine operations (file manipulation, database queries), minimal planning is fine. For open-ended research or complex debugging, upfront planning with adaptive revision is necessary. The best agents recognize when their plan has failed and re-plan rather than continuing to execute a broken plan.
Agents need memory at two scales.
Short-term memory is the context window. Everything the agent has observed, thought, and done in the current session is available as context. This is limited by the context window size and degrades as the context grows (the “lost in the middle” problem). Strategies: summarize earlier steps to free context space, store key findings in a scratchpad at the top of the context.
Long-term memory persists across sessions. Retrieved from a vector database or key-value store, long-term memory lets agents remember user preferences, past conversations, learned procedures, and accumulated knowledge. RAG is the retrieval mechanism — the agent queries its memory store the same way a RAG system queries a document collection.
Effective agent memory systems separate different types of memory. Episodic memory stores what happened (conversation transcripts, action logs). Semantic memory stores what the agent knows (facts, user preferences, domain knowledge). Procedural memory stores how to do things (successful strategies, learned workflows). Each type benefits from different storage and retrieval strategies. Episodic memory is retrieved chronologically; semantic memory is retrieved by similarity; procedural memory is retrieved by task type.
MCP is an open standard for connecting LLMs to external data sources and tools. Instead of each application implementing its own tool integrations, MCP provides a standardized interface. An MCP server exposes resources (data the model can read) and tools (actions the model can take). An MCP client (the LLM application) discovers available servers and their capabilities at runtime.
This is analogous to USB for AI — a universal connector protocol. A coding agent with MCP support can connect to any MCP-compatible database, API, file system, or service without custom integration code. The protocol handles tool discovery, parameter validation, and result formatting.
MCP servers exist for databases (PostgreSQL, SQLite), version control (GitHub), file systems, search engines, and many other services. An agent configured with multiple MCP servers can query a database, search the web, read files, and create pull requests — all through a uniform interface. The agent does not need to know the implementation details of each service; it discovers available tools through the protocol and uses them through standardized function calls. This composability is what makes MCP significant: it shifts the integration burden from application developers to protocol-level standardization.
Coding agents (Claude Code, Cursor, GitHub Copilot Workspace): read code, understand the codebase, plan changes, write code, run tests, iterate on failures. The agent loop is essential — coding rarely succeeds in a single generation step.
Research agents: given a question, search multiple sources, evaluate the credibility and relevance of each, synthesize findings, identify gaps, and search again. The agent produces a research report grounded in cited sources.
Browser automation: navigate web pages, fill forms, extract data, interact with web applications. The agent observes the page state (via screenshot or DOM), decides what element to interact with, takes the action, and observes the result.
The reliability of an agent is the product of the reliability of each step. If each step succeeds with 95% probability and the task requires 10 steps, the overall success rate is 0.95^10 = 60%. This is why agent reliability is an active research area. Strategies: better models (higher per-step reliability), better tools (reducing the number of steps needed), verification loops (checking results before proceeding), and fallback strategies (detecting failures and trying alternative approaches).
This lesson establishes:
Next: LLMs Check