coding-with-ai/05-agentic-workflow/README.md
Eric Furst 4194680475 Align prose with STYLE.md across modules 01-07 and top-level README
Replace residual em-dashes, arrow-notation shorthand, and a handful of
filler intensifiers; fix two small typos. Add .gitignore to keep the
working CHANGES.md audit out of the repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:47:19 -04:00

134 lines
11 KiB
Markdown

# Agentic Workflow
## Key idea
An agentic tool is an AI that takes actions on its own (reading files, running commands, editing, testing, observing results, editing again) without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about *when* to deploy an agent and *how* to supervise it.
This section is about *using* agentic tools as an engineer or scientist (for modeling, data analysis, simulations, or coursework), not building production software for end users. For how tool use works under the hood, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop) section on tool use and agentic systems.
> **A note on overlap with [section 04](../04-conversations/):** the boundary between conversation and agentic work is fuzzy in practice. The same in-project chat panel becomes an agent the moment you give it a multi-step goal it executes on its own. This distinction is behavioral, not a separate tool to launch, and you will routinely cross between the two modes within a single session. This section focuses on what changes once the model is taking actions, but expect the framing to apply to many of your existing chat sessions too.
## Key goals
- Recognize when you have moved from in-project chat into agentic territory, and what changes when you do
- Identify the kinds of task where an agent is most useful
- Brief an agent in a way that produces good results
- Supervise effectively: scope, permissions, review
- Be aware of cost and failure modes
---
## What an agent actually does
An agentic tool wraps a chat model in a *loop* that lets it take actions in your environment. One step in the loop:
1. The model receives your goal and the current state (files, terminal output, etc.)
2. The model decides on the next action: read a file, run a command, write an edit
3. The action runs; the result is fed back to the model
4. Repeat until the model believes the goal is met (or it asks you a question)
What's new compared to a single chat message or autocomplete:
- **Actions are real.** The agent can run `rm`, `git push`, `pip install`, or hit external APIs. Permission models vary, but the capability is the defining feature.
- **The model owns the plan.** You don't write the steps; the model figures them out and runs them.
**Examples (early 2026):** Claude Code (CLI), Cursor (agent mode and Background Agents), Cline and Windsurf Cascade (VS Code), Microsoft Copilot agent, GitHub Copilot Workspace, Aider, and more autonomous platforms like Devin and Replit Agent.
## Variations on the basic loop
The read-act-observe-act cycle is the core, but the landscape expanded substantially through 2025:
- **Plan-then-execute modes.** The agent first produces a written plan; you review and edit it; only then does it execute. Claude Code's plan mode and Cursor's planning step fit here. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task.
- **Sub-agents and parallelism.** A primary agent spawns sub-agents for independent branches of work, such as searching different parts of a codebase at once or specializing roles (one writes, one reviews). The supervision burden shifts: you watch several loops, not one.
- **Async and background agents.** Some agents run while you do other things and report back when finished (Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace). You trade real-time visibility for parallelism with your own work, and you have to brief more carefully because mid-task course-correction is hard.
- **MCP and external tools.** The Model Context Protocol, introduced by Anthropic in late 2024, lets agents connect to external systems (Slack, Linear, GitHub, databases, dashboards, remote filesystems) through standardized servers. "Reads files and runs commands" is now a starting point, not a ceiling.
- **Sandboxed execution.** Some agents run inside isolated VMs or containers, so destructive actions only affect the sandbox. The downside is reduced access to your real environment; the upside is genuine room to experiment.
These variations don't change the supervision principles below, only how they're applied: plan mode shifts review from result to plan, sub-agents multiply the loops you watch, sandboxing lets review run looser because consequences are contained.
## When to use an agent
Agents are best for tasks where the *work between steps* is the expensive part for a human:
- **Multi-file changes that need verification.** *"Rename this concept across the codebase and make sure the tests still pass."* For scientific code without a formal test suite: *"...and make sure my analysis script still reproduces the expected numbers."* The agent reads, edits, re-runs verification, re-edits if needed. You would do the same thing manually with much more context-switching.
- **Exploring an unfamiliar codebase.** *"How is authentication handled in this project? Find the entry point and explain the flow."* The agent grep-walks the project; you read the summary.
- **Repetitive maintenance.** *"Update all the imports from `old_lib` to `new_lib` and adjust the calls that changed."* Mechanical, scoped, verifiable.
- **End-to-end small features in well-tested code.** *"Add an endpoint that does X, following the patterns in the existing endpoints. Update the tests."*
Agents are *less* useful for:
- **A single line you already know how to write.** Autocomplete or typing it yourself is faster.
- **A targeted edit you can describe in one sentence.** A single message to the in-project chat ([section 04](../04-conversations/)) is faster than spinning up an agent loop.
- **A design discussion.** Use chat. The agent has nowhere to act.
- **Anything where you don't know what "done" looks like.** The agent will reach a state and stop. If you can't tell whether it's the right state, you've shifted the problem rather than solved it.
## Briefing an agent well
A good brief looks more like a task description for a new teammate than a search query. Include:
- **The goal**, stated as outcome rather than steps. *"Add a `--dry-run` flag to the `migrate` command that prints what would change without writing anything."*
- **Constraints** the agent might not infer. *"Use the existing logging helper rather than `print`. Match the style of the other flags."*
- **What "done" means.** *"All existing tests still pass. There is a new test verifying the `--dry-run` output for the simple case."* For code without a formal test suite, substitute whichever verification you use (sanity-check runs, known-answer comparisons, regression scripts).
- **What to ask about, not assume.** *"If the migration step has side effects I can't easily reverse, stop and ask before running it."*
The biggest predictor of an agent doing the right thing is how well-bounded the task is. *"Improve this code"* is poorly bounded; the agent will improve it in directions you may not want. *"Reduce the duplication between `parse_csv` and `parse_tsv` by extracting a shared helper, preserving the existing return signatures"* is well-bounded.
## Supervision
Agentic tools work because they take real actions, which means real consequences when those actions are wrong. Three things to think about before letting one loose:
### Permissions
Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked. Default toward *more* confirmation with a new tool or a new codebase, and speed up as you learn what the agent does well.
Rule of thumb: **destructive or remote-affecting actions deserve confirmation.** Local edits to a project under version control are reversible. `git push --force`, `rm -rf`, `pip uninstall`, and anything that hits an external service or shared system are not.
### Working directory and damage control
An agent pointed at a fresh sandbox can experiment freely. An agent pointed at your home directory can do real damage! Before starting:
- Be sure you are in the right directory
- Have a clean git state (or know what's uncommitted) so you can see what the agent changed
- Know what the agent has access to outside the project (secrets, environment variables, network)
### Review
The agent's report (*"I added the flag, updated the tests, and they pass"*) describes what it intended to do, not necessarily what it did. Always check:
- `git diff`: what actually changed?
- Verification: did the tests or sanity checks actually exercise the new behavior, or did they get loosened to pass?
- New files: were they expected?
- Commands run: any surprises in the output?
Spot-checking is fast. Skipping it is how subtle bugs and security issues land in your codebase.
## Cost awareness
Agentic tools use many model calls per task. A task that takes one back-and-forth in chat can take thirty in an agent. Watch for:
- **Long-running loops.** If an agent has been working a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind.
- **Wide context.** Agents pay for every file they read on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo.
- **Wandering.** If the agent has drifted from the original goal, stop and restart with a tighter brief rather than letting it wander back on its own.
## Common failure modes
- **The agent does the wrong thing efficiently.** The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch in review and brief better next time.
- **Checks get loosened rather than the code being fixed.** The agent finds a failing check, decides the check was wrong, and weakens it. Always look at what changed in your verification scripts and test files.
- **Cascading small edits.** A small change triggers a knock-on, which triggers another, and twenty edits later half the codebase has been touched. Tight scope and a good brief prevent this.
- **Confident hallucinations about a library or API.** The agent uses a function that doesn't exist, then patches around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar.
- **Permissions creep.** "Just this once, allow this command unsupervised" turns into a default. Re-tighten when you change tasks.
## Exercises
> **Exercise 1:** Pick a small, well-scoped task you've been putting off (a refactor, a chore, a small feature) and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot.
> **Exercise 2:** Compare an agentic run with a manual run of the same task on a small scale. Time both. Account not just for elapsed time but for the *quality* of the result and the time you spent reviewing.
> **Exercise 3:** Deliberately give an agent an under-specified brief and see what interpretation it picks. The point is to develop intuition for what the agent will assume when you leave room for assumption.