coding-with-ai/05-agentic-workflow
Eric Furst d2ca02bd90 Reframe from three modes to two worlds
Restructures section 01 from "web chat / in-editor / agentic" into "web
chat vs. tools that live with your code," with the autocomplete /
in-project chat / agentic spectrum as a sub-structure of the latter.
Inline edits are reduced to a historical note tied to the 2023
instruction-tuned LLM era.

- Rename 01-three-modes -> 01-two-worlds and 03-in-editor-workflow ->
  03-autocomplete; section 03 narrows to autocomplete (ghost text habits,
  the autocomplete-your-verification trap)
- Section 04 reframes in-project chat as the default venue, web chat as
  a special-case venue; adds "Carrying context across sessions" covering
  dev-log.md, CLAUDE.md, .cursorrules
- Section 05 reworks intro to contrast against in-project chat instead
  of "editor extension"; tightens prose and removes em-dashes
- Update cross-references and tool-mode language in 02, 06, 07, and
  the root README to match the new framing
- Swap the CRDT example in section 04 for finite-volume methods, fitting
  the CHEG audience
- Minor typo/wording fixes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 23:01:09 -04:00
..
README.md Reframe from three modes to two worlds 2026-05-28 23:01:09 -04:00

Agentic Workflow

Key idea

An agentic tool is an AI that takes actions on its own (reading files, running commands, editing, testing, observing results, editing again) without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about when to deploy an agent and how to supervise it.

This section is about using agentic tools as an engineer or scientist (for modeling, data analysis, simulations, or coursework), not building production software for end users. For how tool use works under the hood, see the llm-workshop section on tool use and agentic systems.

Key goals

  • Recognize when you have moved from in-project chat into agentic territory, and what changes when you do
  • Identify the kinds of task where an agent is most useful
  • Brief an agent in a way that produces good results
  • Supervise effectively: scope, permissions, review
  • Be aware of cost and failure modes

What an agent actually does

An agentic tool wraps a chat model in a loop that lets it take actions in your environment. In practice, the same in-project chat panel you use for a one-shot question (section 04) becomes an agent the moment you give it a multi-step goal. There is no separate "agent app" to launch. One step in the loop:

  1. The model receives your goal and the current state (files, terminal output, etc.)
  2. The model decides on the next action: read a file, run a command, write an edit
  3. The action runs; the result is fed back to the model
  4. Repeat until the model believes the goal is met (or it asks you a question)

What's new compared to a single chat message or autocomplete:

  • Actions are real. The agent can run rm, git push, pip install, or hit external APIs. Permission models vary, but the capability is the defining feature.
  • The model owns the plan. You don't write the steps; the model figures them out and runs them.

Examples (early 2026): Claude Code (CLI), Cursor (agent mode and Background Agents), Cline and Windsurf Cascade (VS Code), Microsoft Copilot agent, GitHub Copilot Workspace, Aider, and more autonomous platforms like Devin and Replit Agent.

Variations on the basic loop

The read-act-observe-act cycle is the core, but the landscape expanded substantially through 2025:

  • Plan-then-execute modes. The agent first produces a written plan; you review and edit it; only then does it execute. Claude Code's plan mode and Cursor's planning step fit here. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task.
  • Sub-agents and parallelism. A primary agent spawns sub-agents for independent branches of work, such as searching different parts of a codebase at once or specializing roles (one writes, one reviews). The supervision burden shifts: you watch several loops, not one.
  • Async and background agents. Some agents run while you do other things and report back when finished (Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace). You trade real-time visibility for parallelism with your own work, and you have to brief more carefully because mid-task course-correction is hard.
  • MCP and external tools. The Model Context Protocol, introduced by Anthropic in late 2024, lets agents connect to external systems (Slack, Linear, GitHub, databases, dashboards, remote filesystems) through standardized servers. "Reads files and runs commands" is now a starting point, not a ceiling.
  • Sandboxed execution. Some agents run inside isolated VMs or containers, so destructive actions only affect the sandbox. The downside is reduced access to your real environment; the upside is genuine room to experiment.

These variations don't change the supervision principles below, only how they're applied: plan mode shifts review from result to plan, sub-agents multiply the loops you watch, sandboxing lets review run looser because consequences are contained.

When to use an agent

Agents are best for tasks where the work between steps is the expensive part for a human:

  • Multi-file changes that need verification. "Rename this concept across the codebase and make sure the tests still pass." For scientific code without a formal test suite: "...and make sure my analysis script still reproduces the expected numbers." The agent reads, edits, re-runs verification, re-edits if needed. You would do the same thing manually with much more context-switching.
  • Exploring an unfamiliar codebase. "How is authentication handled in this project? Find the entry point and explain the flow." The agent grep-walks the project; you read the summary.
  • Repetitive maintenance. "Update all the imports from old_lib to new_lib and adjust the calls that changed." Mechanical, scoped, verifiable.
  • End-to-end small features in well-tested code. "Add an endpoint that does X, following the patterns in the existing endpoints. Update the tests."

Agents are less useful for:

  • A single line you already know how to write. Autocomplete or typing it yourself is faster.
  • A targeted edit you can describe in one sentence. A single message to the in-project chat (section 04) is faster than spinning up an agent loop.
  • A design discussion. Use chat. The agent has nowhere to act.
  • Anything where you don't know what "done" looks like. The agent will reach a state and stop. If you can't tell whether it's the right state, you've shifted the problem rather than solved it.

Briefing an agent well

A good brief looks more like a task description for a new teammate than a search query. Include:

  • The goal, stated as outcome rather than steps. "Add a --dry-run flag to the migrate command that prints what would change without writing anything."
  • Constraints the agent might not infer. "Use the existing logging helper rather than print. Match the style of the other flags."
  • What "done" means. "All existing tests still pass. There is a new test verifying the --dry-run output for the simple case." For code without a formal test suite, substitute whichever verification you use (sanity-check runs, known-answer comparisons, regression scripts).
  • What to ask about, not assume. "If the migration step has side effects I can't easily reverse, stop and ask before running it."

The biggest predictor of an agent doing the right thing is how well-bounded the task is. "Improve this code" is poorly bounded; the agent will improve it in directions you may not want. "Reduce the duplication between parse_csv and parse_tsv by extracting a shared helper, preserving the existing return signatures" is well-bounded.

Supervision

Agentic tools work because they take real actions, which means real consequences when those actions are wrong. Three things to think about before letting one loose:

Permissions

Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked. Default toward more confirmation with a new tool or a new codebase, and speed up as you learn what the agent does well.

Rule of thumb: destructive or remote-affecting actions deserve confirmation. Local edits to a project under version control are reversible. git push --force, rm -rf, pip uninstall, and anything that hits an external service or shared system are not.

Working directory and damage control

An agent pointed at a fresh sandbox can experiment freely. An agent pointed at your home directory can do real damage. Before starting:

  • Be sure you are in the right directory
  • Have a clean git state (or know what's uncommitted) so you can see what the agent changed
  • Know what the agent has access to outside the project (secrets, environment variables, network)

Review

The agent's report ("I added the flag, updated the tests, and they pass") describes what it intended to do, not necessarily what it did. Always check:

  • git diff: what actually changed?
  • Verification: did the tests or sanity checks actually exercise the new behavior, or did they get loosened to pass?
  • New files: were they expected?
  • Commands run: any surprises in the output?

Spot-checking is fast. Skipping it is how subtle bugs and security issues land in your codebase.

Cost awareness

Agentic tools use many model calls per task. A task that takes one back-and-forth in chat can take thirty in an agent. Watch for:

  • Long-running loops. If an agent has been working a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind.
  • Wide context. Agents pay for every file they read on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo.
  • Wandering. If the agent has drifted from the original goal, stop and restart with a tighter brief rather than letting it wander back on its own.

Common failure modes

  • The agent does the wrong thing efficiently. The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch in review and brief better next time.
  • Checks get loosened rather than the code being fixed. The agent finds a failing check, decides the check was wrong, and weakens it. Always look at what changed in your verification scripts and test files.
  • Cascading small edits. A small change triggers a knock-on, which triggers another, and twenty edits later half the codebase has been touched. Tight scope and a good brief prevent this.
  • Confident hallucinations about a library or API. The agent uses a function that doesn't exist, then patches around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar.
  • Permissions creep. "Just this once, allow this command unsupervised" turns into a default. Re-tighten when you change tasks.

Exercises

Exercise 1: Pick a small, well-scoped task you've been putting off (a refactor, a chore, a small feature) and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot.

Exercise 2: Compare an agentic run with a manual run of the same task on a small scale. Time both. Account not just for elapsed time but for the quality of the result and the time you spent reviewing.

Exercise 3: Deliberately give an agent an under-specified brief and see what interpretation it picks. The point is to develop intuition for what the agent will assume when you leave room for assumption.