Reframe from three modes to two worlds

Restructures section 01 from "web chat / in-editor / agentic" into "web
chat vs. tools that live with your code," with the autocomplete /
in-project chat / agentic spectrum as a sub-structure of the latter.
Inline edits are reduced to a historical note tied to the 2023
instruction-tuned LLM era.

- Rename 01-three-modes -> 01-two-worlds and 03-in-editor-workflow ->
  03-autocomplete; section 03 narrows to autocomplete (ghost text habits,
  the autocomplete-your-verification trap)
- Section 04 reframes in-project chat as the default venue, web chat as
  a special-case venue; adds "Carrying context across sessions" covering
  dev-log.md, CLAUDE.md, .cursorrules
- Section 05 reworks intro to contrast against in-project chat instead
  of "editor extension"; tightens prose and removes em-dashes
- Update cross-references and tool-mode language in 02, 06, 07, and
  the root README to match the new framing
- Swap the CRDT example in section 04 for finite-volume methods, fitting
  the CHEG audience
- Minor typo/wording fixes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Eric Furst 2026-05-28 23:01:09 -04:00
commit d2ca02bd90
10 changed files with 308 additions and 270 deletions

View file

@ -2,13 +2,13 @@
## Key idea
An agentic tool is an AI that takes actions on its own, whether it's reading a file, running a command, editing, testing, reading the output, editing again, without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about *when* to deploy an agent and *how* to supervise it.
An agentic tool is an AI that takes actions on its own (reading files, running commands, editing, testing, observing results, editing again) without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about *when* to deploy an agent and *how* to supervise it.
This section is about *using* agentic tools as an engineer or scientist solving problems with code — models, data analysis, simulations, coursework — rather than as someone building production software for end users. If you want to understand how tool use works under the hood and how to build a system like this, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop) section on tool use and agentic systems.
This section is about *using* agentic tools as an engineer or scientist (for modeling, data analysis, simulations, or coursework), not building production software for end users. For how tool use works under the hood, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop) section on tool use and agentic systems.
## Key goals
- Recognize agentic mode and what distinguishes it from an editor extension
- Recognize when you have moved from in-project chat into agentic territory, and what changes when you do
- Identify the kinds of task where an agent is most useful
- Brief an agent in a way that produces good results
- Supervise effectively: scope, permissions, review
@ -18,71 +18,72 @@ This section is about *using* agentic tools as an engineer or scientist solving
## What an agent actually does
A typical agentic tool is built on the same underlying model as the chat or editor extension, but wrapped in a *loop* that lets the model take actions in your environment. A simplified version of one step:
An agentic tool wraps a chat model in a *loop* that lets it take actions in your environment. In practice, the same in-project chat panel you use for a one-shot question ([section 04](../04-conversations/)) becomes an agent the moment you give it a multi-step goal. There is no separate "agent app" to launch. One step in the loop:
1. The model receives your goal and the current state (files, terminal output, etc.)
2. The model decides on the next action: read a file, run a command, write an edit
3. The action runs; the result is fed back to the model
4. Repeat until the model believes the goal is met (or it asks you a question)
The two things that make this different from in-editor edits:
What's new compared to a single chat message or autocomplete:
- **Actions are real.** The agent can run `rm`, `git push`, `pip install`, or hit external APIs. Permission models vary by tool, but the capability is the defining feature.
- **The model owns the plan.** You don't write the steps. Instead, the model figures out the steps.
- **Actions are real.** The agent can run `rm`, `git push`, `pip install`, or hit external APIs. Permission models vary, but the capability is the defining feature.
- **The model owns the plan.** You don't write the steps; the model figures them out and runs them.
**Examples (early 2026):** Claude Code (CLI), Cursor (agent mode and Background Agents), Cline and Windsurf Cascade (VS Code), Microsoft Copilot agent, GitHub Copilot Workspace, Aider, and more autonomous platforms like Devin and Replit Agent.
## Variations on the basic loop
The read-act-observe-act cycle described above is the core, but the agentic landscape has expanded substantially through 2025 and into early 2026, and a working knowledge of the main variations helps you choose the right tool and supervise it well.
The read-act-observe-act cycle is the core, but the landscape expanded substantially through 2025:
- **Sub-agents and parallelism.** A primary agent can spawn sub-agents to handle independent branches of work — searching different parts of a codebase at once, running parallel investigations, or specializing roles such as one agent writing and another reviewing. Claude Code's `Agent` tool and similar features in other platforms enable this. The supervision burden shifts, though. You are no longer watching one loop but several.
- **Plan-then-execute modes.** Many agents now offer a mode where they first produce a written plan, you review and edit it, and only then do they execute. Claude Code's plan mode, Cursor's planning step, and similar features fit this pattern. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task.
- **Async and background agents.** Some agents run while you do other things and report back when finished — Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace. The trade is real-time visibility for parallelism with your own work, and it changes how you brief the agent because you cannot easily course-correct mid-task.
- **MCP and external tools.** The Model Context Protocol, introduced by Anthropic in late 2024 and widely adopted since, lets agents connect to external systems through standardized servers — Slack, Linear, GitHub, databases, monitoring dashboards, file systems on remote machines. "The agent reads files and runs commands" is now a starting point rather than a ceiling; in practice, agents reach into whatever services your team uses.
- **Sandboxed execution.** Some agents run inside isolated virtual machines or containers, which limits what an agent can affect and means destructive actions only impact the sandbox. Devin and some Cursor modes work this way. The downside is reduced access to your real environment, but the upside is genuine freedom to experiment without risking your machine.
- **Plan-then-execute modes.** The agent first produces a written plan; you review and edit it; only then does it execute. Claude Code's plan mode and Cursor's planning step fit here. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task.
- **Sub-agents and parallelism.** A primary agent spawns sub-agents for independent branches of work, such as searching different parts of a codebase at once or specializing roles (one writes, one reviews). The supervision burden shifts: you watch several loops, not one.
- **Async and background agents.** Some agents run while you do other things and report back when finished (Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace). You trade real-time visibility for parallelism with your own work, and you have to brief more carefully because mid-task course-correction is hard.
- **MCP and external tools.** The Model Context Protocol, introduced by Anthropic in late 2024, lets agents connect to external systems (Slack, Linear, GitHub, databases, dashboards, remote filesystems) through standardized servers. "Reads files and runs commands" is now a starting point, not a ceiling.
- **Sandboxed execution.** Some agents run inside isolated VMs or containers, so destructive actions only affect the sandbox. The downside is reduced access to your real environment; the upside is genuine room to experiment.
These variations do not change the supervision principles below — clear briefs, permission control, review the result — but they do change how those principles are applied. A plan-mode agent shifts review from the result to the plan; a sub-agent setup means supervising several flows at once; a sandboxed agent means review can be looser because consequences are contained.
These variations don't change the supervision principles below, only how they're applied: plan mode shifts review from result to plan, sub-agents multiply the loops you watch, sandboxing lets review run looser because consequences are contained.
## When to use an agent
Agents shine on tasks where the *work between steps* is the expensive part for a human:
Agents are best for tasks where the *work between steps* is the expensive part for a human:
- **Multi-file changes that need verification.** *"Rename this concept across the codebase and make sure the tests still pass"* — or, for scientific code without a formal test suite, *"...and make sure my analysis script still reproduces the expected numbers."* The agent reads, edits, re-runs the verification, re-edits if needed. You would do the same thing manually, with much more context-switching.
- **Multi-file changes that need verification.** *"Rename this concept across the codebase and make sure the tests still pass."* For scientific code without a formal test suite: *"...and make sure my analysis script still reproduces the expected numbers."* The agent reads, edits, re-runs verification, re-edits if needed. You would do the same thing manually with much more context-switching.
- **Exploring an unfamiliar codebase.** *"How is authentication handled in this project? Find the entry point and explain the flow."* The agent grep-walks the project; you read the summary.
- **Repetitive maintenance.** *"Update all the imports from `old_lib` to `new_lib` and adjust the calls that changed."* Mechanical, scoped, verifiable.
- **End-to-end small features in well-tested code.** *"Add an endpoint that does X, following the patterns in the existing endpoints. Update the tests."*
Agents are *less* useful for:
- **A single line you already know how to write.** Inline edit is faster.
- **A single line you already know how to write.** Autocomplete or typing it yourself is faster.
- **A targeted edit you can describe in one sentence.** A single message to the in-project chat ([section 04](../04-conversations/)) is faster than spinning up an agent loop.
- **A design discussion.** Use chat. The agent has nowhere to act.
- **Anything where you don't know what "done" looks like.** The agent will reach a state and stop; if you can't tell whether it's the right state, you've shifted the problem rather than solving it.
- **Anything where you don't know what "done" looks like.** The agent will reach a state and stop. If you can't tell whether it's the right state, you've shifted the problem rather than solved it.
## Briefing an agent well
A good agent brief looks more like a task description for a new teammate than a search query. Include:
A good brief looks more like a task description for a new teammate than a search query. Include:
- **The goal**, stated as outcome rather than steps. *"Add a `--dry-run` flag to the `migrate` command that prints what would change without writing anything."*
- **Constraints** the agent might not infer. *"Use the existing logging helper rather than `print`. Match the style of the other flags."*
- **What "done" means.** *"All existing tests still pass. There is a new test verifying the `--dry-run` output for the simple case."* For code without a formal test suite, substitute whichever form of verification you use — sanity-check runs, known-answer comparisons, a regression script's expected output.
- **What "done" means.** *"All existing tests still pass. There is a new test verifying the `--dry-run` output for the simple case."* For code without a formal test suite, substitute whichever verification you use (sanity-check runs, known-answer comparisons, regression scripts).
- **What to ask about, not assume.** *"If the migration step has side effects I can't easily reverse, stop and ask before running it."*
The single biggest predictor of an agent doing the right thing is how well-bounded the task is. *"Improve this code"* is poorly bounded; the agent will improve it in directions you may not want. *"Reduce the duplication between `parse_csv` and `parse_tsv` by extracting a shared helper, preserving the existing return signatures"* is well-bounded.
The biggest predictor of an agent doing the right thing is how well-bounded the task is. *"Improve this code"* is poorly bounded; the agent will improve it in directions you may not want. *"Reduce the duplication between `parse_csv` and `parse_tsv` by extracting a shared helper, preserving the existing return signatures"* is well-bounded.
## Supervision
Agentic tools work because they take real actions. That means real consequences if they take the wrong ones. Three things to think about before letting an agent loose:
Agentic tools work because they take real actions, which means real consequences when those actions are wrong. Three things to think about before letting one loose:
### Permissions
Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked outright. Default toward *more* confirmation when you are starting out with a new tool or a new codebase. Speed up later as you learn what the agent does well.
Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked. Default toward *more* confirmation with a new tool or a new codebase, and speed up as you learn what the agent does well.
A useful rule of thumb: **destructive or remote-affecting actions deserve confirmation.** Local edits to a project under version control are reversible. `git push --force`, `rm -rf`, `pip uninstall`, and anything that hits an external service or shared system are not.
Rule of thumb: **destructive or remote-affecting actions deserve confirmation.** Local edits to a project under version control are reversible. `git push --force`, `rm -rf`, `pip uninstall`, and anything that hits an external service or shared system are not.
### Working directory and damage control
@ -94,12 +95,12 @@ An agent pointed at a fresh sandbox can experiment freely. An agent pointed at y
### Review
The agent's report *"I added the flag, updated the tests, and they pass"* describes what it intended to do, not necessarily what it did. Always check:
The agent's report (*"I added the flag, updated the tests, and they pass"*) describes what it intended to do, not necessarily what it did. Always check:
- `git diff` (or the equivalent) — what actually changed?
- Any verification — tests, sanity-check scripts, known-answer comparisons — did it actually verify the new behavior, or did it get loosened to pass?
- Any new files — were they expected?
- Any commands run — were there surprises in the output?
- `git diff`: what actually changed?
- Verification: did the tests or sanity checks actually exercise the new behavior, or did they get loosened to pass?
- New files: were they expected?
- Commands run: any surprises in the output?
Spot-checking is fast. Skipping it is how subtle bugs and security issues land in your codebase.
@ -108,23 +109,23 @@ Spot-checking is fast. Skipping it is how subtle bugs and security issues land i
Agentic tools use many model calls per task. A task that takes one back-and-forth in chat can take thirty in an agent. Watch for:
- **Long-running loops.** If an agent has been working for a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind.
- **Wide context.** Agents that read many files pay for that context on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo.
- **Wandering.** If the agent has drifted from the original goal, stopping and restarting with a tighter brief is usually cheaper than letting it wander back on its own.
- **Long-running loops.** If an agent has been working a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind.
- **Wide context.** Agents pay for every file they read on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo.
- **Wandering.** If the agent has drifted from the original goal, stop and restart with a tighter brief rather than letting it wander back on its own.
## Common failure modes
- **The agent does the wrong thing efficiently.** The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch this in review and brief better next time.
- **Checks get loosened rather than the code being fixed.** The agent finds a failing test or a sanity check that doesn't pass, decides the check was wrong, and weakens it rather than fixing what it was checking. Always look at what changed in your verification scripts and test files.
- **Cascading small edits.** The agent makes a small change, notices a knock-on, fixes that, notices another, fixes that... twenty edits later, half the codebase has been touched. Tight scopes and good initial briefs prevent this.
- **Confident hallucinations about a library or API.** The agent will use a function that doesn't exist with full confidence, then patch around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar.
- **The agent does the wrong thing efficiently.** The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch in review and brief better next time.
- **Checks get loosened rather than the code being fixed.** The agent finds a failing check, decides the check was wrong, and weakens it. Always look at what changed in your verification scripts and test files.
- **Cascading small edits.** A small change triggers a knock-on, which triggers another, and twenty edits later half the codebase has been touched. Tight scope and a good brief prevent this.
- **Confident hallucinations about a library or API.** The agent uses a function that doesn't exist, then patches around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar.
- **Permissions creep.** "Just this once, allow this command unsupervised" turns into a default. Re-tighten when you change tasks.
## Exercises
> **Exercise 1:** Pick a small, well-scoped task you've been putting off — a refactor, a chore, a small feature — and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot.
> **Exercise 1:** Pick a small, well-scoped task you've been putting off (a refactor, a chore, a small feature) and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot.
> **Exercise 2:** Compare an agentic run with a manual run of the same task on a small scale. Time both. Account not just for elapsed time but for the *quality* of the result and the time you spent reviewing.