coding-with-ai/05-agentic-workflow/README.md
Eric Furst 5780cdf097 Initial commit: coding-with-ai
A practical guide to working effectively with AI coding assistants
(chat interfaces, in-editor extensions, agentic tools) for engineers
and scientists solving problems with code rather than building
production software.

Seven sections:

- 01-three-modes: web chat vs in-editor vs agentic, with heuristics
  for choosing and a framing of chat as natural-language programming.
- 02-errors-and-logs: the canonical copy-paste case; framing the
  paste for useful answers.
- 03-in-editor-workflow: autocomplete, inline edit, side panel,
  quick actions; habits that survive tool changes.
- 04-conversations: multi-turn discussions, context-window
  awareness, opening well, prompt iteration, when to start fresh.
- 05-agentic-workflow: variations on the basic loop (sub-agents,
  plan mode, async, MCP, sandboxing); briefing, supervision,
  damage control, cost and energy.
- 06-verifying-and-citing: hallucinations and silent errors;
  privacy framed against the cloud-services baseline; proportional
  disclosure norms.
- 07-local-models: local models as a cross-cutting alternative
  across all three modes; hardware tiers, tool support,
  capability gap.

Tool-agnostic where possible; current tool examples are
illustrative and expected to date.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 17:48:13 -04:00

131 lines
11 KiB
Markdown

# Agentic Workflow
## Key idea
An agentic tool is an AI that takes actions on its own, whether it's reading a file, running a command, editing, testing, reading the output, editing again, without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about *when* to deploy an agent and *how* to supervise it.
This section is about *using* agentic tools as an engineer or scientist solving problems with code — models, data analysis, simulations, coursework — rather than as someone building production software for end users. If you want to understand how tool use works under the hood and how to build a system like this, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop) section on tool use and agentic systems.
## Key goals
- Recognize agentic mode and what distinguishes it from an editor extension
- Identify the kinds of task where an agent is most useful
- Brief an agent in a way that produces good results
- Supervise effectively: scope, permissions, review
- Be aware of cost and failure modes
---
## What an agent actually does
A typical agentic tool is built on the same underlying model as the chat or editor extension, but wrapped in a *loop* that lets the model take actions in your environment. A simplified version of one step:
1. The model receives your goal and the current state (files, terminal output, etc.)
2. The model decides on the next action: read a file, run a command, write an edit
3. The action runs; the result is fed back to the model
4. Repeat until the model believes the goal is met (or it asks you a question)
The two things that make this different from in-editor edits:
- **Actions are real.** The agent can run `rm`, `git push`, `pip install`, or hit external APIs. Permission models vary by tool, but the capability is the defining feature.
- **The model owns the plan.** You don't write the steps. Instead, the model figures out the steps.
**Examples (early 2026):** Claude Code (CLI), Cursor (agent mode and Background Agents), Cline and Windsurf Cascade (VS Code), Microsoft Copilot agent, GitHub Copilot Workspace, Aider, and more autonomous platforms like Devin and Replit Agent.
## Variations on the basic loop
The read-act-observe-act cycle described above is the core, but the agentic landscape has expanded substantially through 2025 and into early 2026, and a working knowledge of the main variations helps you choose the right tool and supervise it well.
- **Sub-agents and parallelism.** A primary agent can spawn sub-agents to handle independent branches of work — searching different parts of a codebase at once, running parallel investigations, or specializing roles such as one agent writing and another reviewing. Claude Code's `Agent` tool and similar features in other platforms enable this. The supervision burden shifts, though. You are no longer watching one loop but several.
- **Plan-then-execute modes.** Many agents now offer a mode where they first produce a written plan, you review and edit it, and only then do they execute. Claude Code's plan mode, Cursor's planning step, and similar features fit this pattern. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task.
- **Async and background agents.** Some agents run while you do other things and report back when finished — Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace. The trade is real-time visibility for parallelism with your own work, and it changes how you brief the agent because you cannot easily course-correct mid-task.
- **MCP and external tools.** The Model Context Protocol, introduced by Anthropic in late 2024 and widely adopted since, lets agents connect to external systems through standardized servers — Slack, Linear, GitHub, databases, monitoring dashboards, file systems on remote machines. "The agent reads files and runs commands" is now a starting point rather than a ceiling; in practice, agents reach into whatever services your team uses.
- **Sandboxed execution.** Some agents run inside isolated virtual machines or containers, which limits what an agent can affect and means destructive actions only impact the sandbox. Devin and some Cursor modes work this way. The downside is reduced access to your real environment, but the upside is genuine freedom to experiment without risking your machine.
These variations do not change the supervision principles below — clear briefs, permission control, review the result — but they do change how those principles are applied. A plan-mode agent shifts review from the result to the plan; a sub-agent setup means supervising several flows at once; a sandboxed agent means review can be looser because consequences are contained.
## When to use an agent
Agents shine on tasks where the *work between steps* is the expensive part for a human:
- **Multi-file changes that need verification.** *"Rename this concept across the codebase and make sure the tests still pass"* — or, for scientific code without a formal test suite, *"...and make sure my analysis script still reproduces the expected numbers."* The agent reads, edits, re-runs the verification, re-edits if needed. You would do the same thing manually, with much more context-switching.
- **Exploring an unfamiliar codebase.** *"How is authentication handled in this project? Find the entry point and explain the flow."* The agent grep-walks the project; you read the summary.
- **Repetitive maintenance.** *"Update all the imports from `old_lib` to `new_lib` and adjust the calls that changed."* Mechanical, scoped, verifiable.
- **End-to-end small features in well-tested code.** *"Add an endpoint that does X, following the patterns in the existing endpoints. Update the tests."*
Agents are *less* useful for:
- **A single line you already know how to write.** Inline edit is faster.
- **A design discussion.** Use chat. The agent has nowhere to act.
- **Anything where you don't know what "done" looks like.** The agent will reach a state and stop; if you can't tell whether it's the right state, you've shifted the problem rather than solving it.
## Briefing an agent well
A good agent brief looks more like a task description for a new teammate than a search query. Include:
- **The goal**, stated as outcome rather than steps. *"Add a `--dry-run` flag to the `migrate` command that prints what would change without writing anything."*
- **Constraints** the agent might not infer. *"Use the existing logging helper rather than `print`. Match the style of the other flags."*
- **What "done" means.** *"All existing tests still pass. There is a new test verifying the `--dry-run` output for the simple case."* For code without a formal test suite, substitute whichever form of verification you use — sanity-check runs, known-answer comparisons, a regression script's expected output.
- **What to ask about, not assume.** *"If the migration step has side effects I can't easily reverse, stop and ask before running it."*
The single biggest predictor of an agent doing the right thing is how well-bounded the task is. *"Improve this code"* is poorly bounded; the agent will improve it in directions you may not want. *"Reduce the duplication between `parse_csv` and `parse_tsv` by extracting a shared helper, preserving the existing return signatures"* is well-bounded.
## Supervision
Agentic tools work because they take real actions. That means real consequences if they take the wrong ones. Three things to think about before letting an agent loose:
### Permissions
Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked outright. Default toward *more* confirmation when you are starting out with a new tool or a new codebase. Speed up later as you learn what the agent does well.
A useful rule of thumb: **destructive or remote-affecting actions deserve confirmation.** Local edits to a project under version control are reversible. `git push --force`, `rm -rf`, `pip uninstall`, and anything that hits an external service or shared system are not.
### Working directory and damage control
An agent pointed at a fresh sandbox can experiment freely. An agent pointed at your home directory can do real damage. Before starting:
- Be sure you are in the right directory
- Have a clean git state (or know what's uncommitted) so you can see what the agent changed
- Know what the agent has access to outside the project (secrets, environment variables, network)
### Review
The agent's report — *"I added the flag, updated the tests, and they pass"* — describes what it intended to do, not necessarily what it did. Always check:
- `git diff` (or the equivalent) — what actually changed?
- Any verification — tests, sanity-check scripts, known-answer comparisons — did it actually verify the new behavior, or did it get loosened to pass?
- Any new files — were they expected?
- Any commands run — were there surprises in the output?
Spot-checking is fast. Skipping it is how subtle bugs and security issues land in your codebase.
## Cost awareness
Agentic tools use many model calls per task. A task that takes one back-and-forth in chat can take thirty in an agent. Watch for:
- **Long-running loops.** If an agent has been working for a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind.
- **Wide context.** Agents that read many files pay for that context on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo.
- **Wandering.** If the agent has drifted from the original goal, stopping and restarting with a tighter brief is usually cheaper than letting it wander back on its own.
## Common failure modes
- **The agent does the wrong thing efficiently.** The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch this in review and brief better next time.
- **Checks get loosened rather than the code being fixed.** The agent finds a failing test or a sanity check that doesn't pass, decides the check was wrong, and weakens it rather than fixing what it was checking. Always look at what changed in your verification scripts and test files.
- **Cascading small edits.** The agent makes a small change, notices a knock-on, fixes that, notices another, fixes that... twenty edits later, half the codebase has been touched. Tight scopes and good initial briefs prevent this.
- **Confident hallucinations about a library or API.** The agent will use a function that doesn't exist with full confidence, then patch around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar.
- **Permissions creep.** "Just this once, allow this command unsupervised" turns into a default. Re-tighten when you change tasks.
## Exercises
> **Exercise 1:** Pick a small, well-scoped task you've been putting off — a refactor, a chore, a small feature — and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot.
> **Exercise 2:** Compare an agentic run with a manual run of the same task on a small scale. Time both. Account not just for elapsed time but for the *quality* of the result and the time you spent reviewing.
> **Exercise 3:** Deliberately give an agent an under-specified brief and see what interpretation it picks. The point is to develop intuition for what the agent will assume when you leave room for assumption.