commit 5780cdf097914577e79741bb7240bbd31bf0581b Author: Eric Furst Date: Thu May 28 17:48:13 2026 -0400 Initial commit: coding-with-ai A practical guide to working effectively with AI coding assistants (chat interfaces, in-editor extensions, agentic tools) for engineers and scientists solving problems with code rather than building production software. Seven sections: - 01-three-modes: web chat vs in-editor vs agentic, with heuristics for choosing and a framing of chat as natural-language programming. - 02-errors-and-logs: the canonical copy-paste case; framing the paste for useful answers. - 03-in-editor-workflow: autocomplete, inline edit, side panel, quick actions; habits that survive tool changes. - 04-conversations: multi-turn discussions, context-window awareness, opening well, prompt iteration, when to start fresh. - 05-agentic-workflow: variations on the basic loop (sub-agents, plan mode, async, MCP, sandboxing); briefing, supervision, damage control, cost and energy. - 06-verifying-and-citing: hallucinations and silent errors; privacy framed against the cloud-services baseline; proportional disclosure norms. - 07-local-models: local models as a cross-cutting alternative across all three modes; hardware tiers, tool support, capability gap. Tool-agnostic where possible; current tool examples are illustrative and expected to date. Co-Authored-By: Claude Opus 4.7 diff --git a/01-three-modes/README.md b/01-three-modes/README.md new file mode 100644 index 0000000..1a64a53 --- /dev/null +++ b/01-three-modes/README.md @@ -0,0 +1,113 @@ +# Three Modes + +## Key idea + +There are three distinct ways to work with AI assistants today, and they suit different problems. Knowing which is which, and which one would actually help for a given problem, is the most important judgment that we would like to develop. + +## Key goals + +- Recognize the three modes: web chat, in-editor extension, and agentic tool +- Understand the characteristic strengths and weaknesses of each +- Develop heuristics for choosing the right mode for a given task + +--- + +## The three modes + +### 1. Web chat + +Web chat is a a browser- or app-based conversation with a model. You type, paste, or drag content in and the model responds in the same window. + +**Examples (early 2026):** ChatGPT, Claude.ai, Gemini, Microsoft Copilot (web). Each has a free tier with usage limits, a paid plan that removes those limits, and (often) institutional access through a university or employer agreement. You can also self-host a chat interface against a local model — see [section 07](../07-local-models/). + +**What it's good at:** + +- One-shot interpretation tasks: *explain this error, what does this log mean, what does this regex match* +- Multi-turn design discussions: *"I'm choosing between approach A and B, what should I think about?"* +- Non-code work: drafting documentation, writing commit messages, explaining a concept +- Working with content you do not want the AI to "live in" — a snippet from a paper, output from a server you don't own, a script from an unfamiliar repo + +**Weaknesses:** + +- It's disconnected from your project. The model is stateless and has no idea what files exist, what your codebase conventions are, or what you changed five minutes ago, unless you paste it. +- Round-trip friction: copy from terminal → paste into chat → wait → read → copy answer → paste back. Fine for one-shot, but it is painful for iterative editing. +- You have to remember to bring the relevant context with you each time. + + +### 2. In-editor extension + +An AI assistant living inside your editor (VS Code, JetBrains, Neovim, etc.) with awareness of your open files and project. + +**Examples (early 2026):** GitHub Copilot, Claude (VS Code extension), Codeium, Microsoft Copilot (in VS Code), Cursor and Windsurf (which are VS Code forks with deeper AI baked in). + +**What it's good at:** + +- Autocomplete while typing — the model suggests the next few lines as you write +- Inline edits: highlight a block, ask for a refactor or fix, review a *diff* (a side-by-side view of the proposed changes — what is being added, removed, or modified), accept or reject changes +- "Explain this" on a selection, function, or whole file without leaving the editor +- Quick rename, extract function, add type hints — the kind of work that ought to be in place + +**Weaknesses:** + +- Conversation UX is limited compared to web chat — fine for *"do X to this code,"* awkward for *"let's talk through the design."* +- Easy to accept suggestions you did not fully read (the autocomplete habit). +- The model's context is whatever the extension chooses to feed it — usually the open file plus some recently viewed files. Larger projects can confuse it. + + +### 3. Agentic tools + +An AI that takes multi-step actions on its own: read files, run commands, edit, run tests, read the output, edit again. You set the goal; the agent runs the loop. + +**Examples (early 2026):** Claude Code (CLI), Cursor agent mode, Microsoft Copilot agent, Cline (VS Code extension), Aider. + +**What it's good at:** + +- Multi-file refactors with verification (*"rename this concept everywhere and make the tests pass"*) +- Investigating an unfamiliar codebase (*"find where X is defined and tell me how it's wired"*) +- Larger work units where you would otherwise be the one ferrying information between editor, terminal, and chat +- Repetitive maintenance (*"update all these import paths," "add docstrings to this module"*) + +**Weaknesses:** + +- The agent will happily do the wrong thing efficiently. Supervision matters! +- Permissioning matters, too: agents that can run arbitrary shell commands can do real damage if pointed at the wrong directory or given the wrong instructions. +- Cost can scale faster than you expect — multi-step tasks consume many model calls. +- For small, well-scoped edits, an agent is overkill compared to a simple inline edit. + + +## How to choose + +### Editor vs agentic: what's the actual difference? + +The chat/editor split is usually obvious. The editor/agent split trips people up. Two questions clarify it: + +- **Who drives the loop?** With the editor extension, *you* do — you make one request, see the proposed change, accept or reject, then make the next request. With an agent, *the model* does — it decides the steps, runs them, and reports back at the end (or at checkpoints you've configured). +- **How many actions does the task need?** A single targeted edit you can see in front of you is editor work. A task that needs a chain of actions, like reading several files, making changes in multiple places, or running tests, react to the output, is agent work. + +In other words, if you can point at the code on screen and say *"do X to this,"* then you want the editor. If the work is *"figure out how this codebase does X and change it consistently,"* then you want an agent. + +### Starting heuristic + +| If the work is... | Reach for... | Why | +|---|---|---| +| Explain an error, parse a log, interpret some output | Web chat | One-shot interpretation; the answer is words, not code-in-place | +| A targeted edit you can see in front of you — refactor this function, add types here, rewrite this block | In-editor extension | One action, one diff, one accept/reject decision; you stay in the driver's seat | +| A task that needs multiple steps — cross-file changes, run-tests-and-fix loops, "explore the project and then change it" | Agentic | The model owns the sequence; you set the goal and review the end state | +| Deciding between two approaches; talking through a design | Web chat or editor side panel | Conversation UX is what matters; either works (see [section 03](../03-in-editor-workflow/) for venue choice) | +| Writing a commit message, README, or documentation | Either chat or editor | Both work — chat if standalone, editor if it should live inline | + + +## Two principles underneath the heuristic + +**Match the mode to the output target.** If the answer should *be code in a file*, use a tool that can put it there (editor or agent). If the answer should be a conversation or an explanation, use chat. Mode-mismatching is what leads to painful copy-paste loops. + +**Match the mode to the iteration speed.** Single-shot interpretation → use a chat. Tight feedback loop on a known file → use an editor. Multi-step plan you would rather not babysit step by step → use an agent. + + +## Exercises + +> **Exercise 1:** Think of three recent times you used an AI assistant. For each, classify which mode you used and which mode this guide would suggest. Were any mismatched? If so, what did the mismatch cost you (time, friction, abandoned attempts)? + +> **Exercise 2:** Pick one tool from each mode you have access to. Use each one in the next week and keep a one-line note of what you used it for. After a week, look at your notes: are you using each mode for things it is genuinely good at? + + diff --git a/02-errors-and-logs/README.md b/02-errors-and-logs/README.md new file mode 100644 index 0000000..d7981b6 --- /dev/null +++ b/02-errors-and-logs/README.md @@ -0,0 +1,146 @@ +# Errors and Logs + +## Key idea + +Errors, stack traces, and log output are *exactly* the kind of thing chat models excel at parsing. A transformer's attention is built for finding the relevant token among noise. Use it! + +Errors and logs are the canonical copy-paste use case. The trick is pasting *enough* context but not *too much*, and being explicit about what you were trying to do. + +## Key goals + +- Recognize when a chat is the right tool for an error or log +- Paste the right amount of context — neither too little nor too much +- Frame the paste so the model can give a useful answer, not a guess +- Use the answer as a starting point, not gospel + +--- + +## Why chat is the right mode here + +A typical Python traceback is 10–40 lines of mostly-noise with one or two lines of actual signal. A 500-line server log has maybe three lines that matter. Two reasons chat works here: + +1. **The output is words, not code-in-place.** You are looking for an explanation or at least a pointer, not an edit to a file. Chat is the stronger choice. +2. **The input is self-contained.** You can paste the whole error and the model can reason about it without needing your project layout, history, or build state. + +In-editor extensions can also handle errors (most have "explain this error" features), and that is fine for small ones. But for a long traceback or a multi-page log, chat's room to expand and your ability to copy-paste freely makes it the better tool. + +## What to paste + +Three rules: + +### 1. Paste the whole relevant block, not just the last line + +The first line of a Python traceback ("File X, line Y, in foo") matters as much as the last line ("ValueError: ..."). Most stack traces tell a small story: *here's the path the program took to reach the error.* The model needs the path, not just the destination. + +**Bad:** + +``` +ValueError: cannot convert string to float +``` + +**Better:** + +``` +Traceback (most recent call last): + File "analyze.py", line 47, in + df["temp"] = df["temp"].astype(float) + File ".../pandas/core/series.py", line 5816, in astype + ... +ValueError: could not convert string to float: 'N/A' +``` + +The second version makes the cause (the string `'N/A'` in the data) obvious. The first leaves the model guessing. + +### 2. Trim noise that doesn't help + +A 2000-line log with five relevant lines is harder for the model (and you) than 50 lines centered on the relevant region. Use `grep`, `tail`, or your eyes to narrow it down. Include enough surrounding context that the model can see the lead-up to the failure, but cut the parts that are clearly unrelated (startup banners, unrelated services, repeated heartbeat lines). + +If you do not know which part is relevant, paste a reasonable chunk and *say so*: "I think the failure is somewhere in here but I'm not sure where to look." + +### 3. Include the command or code that triggered it + +The model can analyze a traceback in a vacuum, but answers improve a lot when it knows what you were running. Even a single sentence helps: + +> "I ran `python analyze.py temperatures.csv` and got this:" + +That tiny preamble lets the model relate the error to the code path you exercised. + +## What to say *around* the paste + +A useful pattern is three short lines plus the paste: + +> What I was trying to do. +> What I expected to happen. +> What actually happened (with the paste). + +For example: + +> I'm trying to load a CSV of furnace temperatures into a DataFrame and convert the temperature column to float. I expected the conversion to work. The file looks clean. Instead I'm getting this: +> +> ``` +> Traceback ... +> ValueError: could not convert string to float: 'N/A' +> ``` + +This is enough for the model to land on the right answer (some rows have `'N/A'` strings) and suggest the right fix (handle the missing values explicitly). + +If you've already tried things, say so. *"I tried `pd.read_csv(..., na_values='N/A')` and it didn't change the error."* This saves the model from suggesting what you already ruled out. + +## What to do with the answer + +The model's first answer to an error is often *plausible but wrong about the root cause*. It will identify the right neighborhood — "there is a non-numeric value in your column" — but may guess wrong about the specific row or the fix that works in your case. Treat the answer as a **search query**, not a firm result: + +- Does the diagnosis match what you see in your data or code? +- Is the suggested fix one you can verify quickly (one line, one test)? +- If a suggested fix doesn't work, *say so* in the next turn and paste the new output. The second answer is usually much better because the first one ruled out a possibility. + +The biggest mistake students make with chat-based error help is **treating the first response as authoritative**. The right mental model is: the model gives you a focused hypothesis, but you do the verification. + +## Common pitfalls + +- **Pasting only the last line of the traceback.** The model can guess, but it's a guess. Paste the whole traceback. +- **Pasting a 2000-line log unfiltered.** The model wastes attention on irrelevant material and the answer suffers. Trim. +- **Pasting code with no error message.** "Why doesn't this work?" without the actual failure makes the model invent failure modes. Always run the code and paste what happened. +- **Pasting proprietary code into a public chat.** See [section 06](../06-verifying-and-citing/) — what you paste, the service sees and may log. Match the chat to the sensitivity of the content. +- **Not iterating.** If the first answer is wrong, the second is usually better. Treat the conversation as a debugging session, not a single oracle query. + +## A worked example + +You run a script and see: + +``` +$ python train.py +Traceback (most recent call last): + File "train.py", line 12, in + model = model.to("cuda") + File ".../torch/nn/modules/module.py", line 1145, in to + return self._apply(convert) + File ".../torch/nn/modules/module.py", line 797, in _apply + module._apply(fn) + ... +RuntimeError: Found no NVIDIA driver on your system. Please check that you +have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx +``` + +**A bad prompt:** + +> Why doesn't my code work? `RuntimeError: Found no NVIDIA driver` + +**A better prompt:** + +> I'm running a PyTorch training script on my laptop. I expected it to fall back to CPU if no GPU was available, but instead I'm getting this. Is there a way to make the script run on whatever device is available? Here is the full traceback: +> +> ``` +> [full traceback as above] +> ``` + +The better prompt gets you a useful answer about the standard `device = "cuda" if torch.cuda.is_available() else "cpu"` pattern, with the specific fix for line 12 of your script. The bad prompt gets you a generic "install NVIDIA drivers" answer that doesn't help you on a laptop without an NVIDIA GPU. + + +## Exercises + +> **Exercise 1:** Find a recent error you debugged. Reconstruct a "bad paste" (last line only, no context) and a "good paste" (whole traceback, what you were doing, what you expected). Try both in a chat. How different are the answers? + +> **Exercise 2:** Take a log file from a real system you have run — even just a `pip install` that warned about something. Paste 50 lines unfiltered, then paste a `grep`-narrowed version of the same log. Compare the responses for usefulness. + +> **Exercise 3:** Pick a real error and run a multi-turn debugging session. Paste the error, try the suggested fix, paste the new output, and continue until the issue is resolved or you understand why it cannot be. Note where the model's hypothesis shifted between turns. diff --git a/03-in-editor-workflow/README.md b/03-in-editor-workflow/README.md new file mode 100644 index 0000000..f4f2890 --- /dev/null +++ b/03-in-editor-workflow/README.md @@ -0,0 +1,88 @@ +# In-Editor Workflow + +## Key idea + +Editor extensions are best used as a tight feedback loop: small suggestions, surgical edits, fast accept/reject decisions. The point is not to "let the AI write code for you." The point is to remove the keystrokes that don't deserve your attention while keeping the keystrokes that do. + +## Key goals + +- Recognize the four common in-editor patterns: autocomplete, inline edit, side-panel chat, quick actions +- Use each pattern for the kind of work it suits +- Develop habits that keep you in control of what lands in your code +- Know when to escalate to a chat, an agent, or just write it yourself + +--- + +## Four patterns + +In-editor AI extensions (GitHub Copilot, Claude for VS Code, Codeium, Microsoft Copilot, Cursor's built-in features) vary in keystrokes and naming, but most expose the same four patterns. Learn the patterns; the keystrokes will follow. + +### 1. Autocomplete (ghost text) + +As you type, the extension proposes the next few tokens or lines as faint "ghost text." You accept with Tab (typically) or keep typing to ignore. + +**Best for:** boilerplate you would have typed anyway. Loop scaffolds, function signatures whose shape is obvious, import lines, the body of a getter, repetitive variations of the same pattern. + +**Habit to build:** *read the suggestion before accepting it.* The cost of accepting wrong code that looks right is high — you'll find the bug an hour later in a debugger when you could have caught it in 200 milliseconds. If a suggestion is more than a few lines, the right move is usually to read it, decide, and either accept or rewrite. Do not Tab-and-pray. + +**Habit to break:** *do not autocomplete your verification.* Whether your verification is a formal unit test, a sanity-check script, or a comparison against a known answer, it is supposed to be *your* expression of what the code should do. If the model writes the check based on the code, the check passes by construction and confirms nothing. Write your check yourself; let the model help with the implementation. + +### 2. Inline edit (edit-this-selection) + +You highlight a block of code and invoke the AI with a brief instruction: *"rewrite using a list comprehension,"* *"extract the inner loop into a helper,"* *"add type hints."* The extension shows a diff; you accept or reject. + +**Best for:** surgical edits with a clear before/after. Refactors, type hint additions, adding docstrings, converting between equivalent forms, applying a stylistic change consistently. + +**Habit to build:** *think of the instruction as a one-line spec.* The clearer your instruction, the better the diff. *"Make this better"* is a worse prompt than *"split this into a parse step and a validate step, keeping the same return signature."* + +**Habit to break:** *do not invoke inline edit on a block you do not understand.* If you cannot evaluate the diff, you cannot reject a bad one. Skim or read the block first. + +### 3. Side-panel chat (with file or project context) + +A chat window in your editor where you can ask questions and the extension attaches whatever files or selections you've referenced. Some extensions auto-attach the open file; others require you to add context explicitly. + +**Best for:** *"explain this function," "why is this slow," "how would I extend this to do Y,"* — questions where the answer is words, not a direct edit, but where you want the answer informed by your actual code rather than a generic snippet. Side panels have matured to the point where they also handle **multi-turn design discussions well**, especially when the discussion is anchored in files you have open — the one-click attachment of files and selections is a real advantage over alt-tabbing to a web chat. + +**Habit to build:** *be explicit about what context you want included.* If the extension lets you pin specific files or selections to the conversation, use it. The model can only reason about what it sees. + +**When to step out to a web chat instead:** the discussion needs to outlive the editor session (a record you want to return to days later), needs to include collaborators who don't share your editor, or pulls in lots of non-code context (papers, third-party docs, screenshots). See [section 04: Conversations](../04-conversations/) for those patterns. + +### 4. Quick actions (rename, extract, add types) + +Many extensions surface "intelligent" versions of classic IDE operations: rename a symbol across a file or project, extract a selection into a function, add type hints to a function signature. + +**Best for:** classic refactors you would otherwise do by hand, with an AI doing the renaming or signature work. These are usually safe because the change set is small and visible. + +**Habit to build:** *check the scope.* "Rename across project" can change more than you expect — make sure you reviewed the file list or used a version-controlled state you can roll back from. + + +## When to escalate + +In-editor extensions are great inside their lane. Recognize when to step out of it: + +| Symptom | Escalate to | Why | +|---|---|---| +| The conversation needs to outlive this editor session, be shared with collaborators, or pull in non-code context | Web chat ([section 04](../04-conversations/)) | Web chats persist, share, and accept arbitrary content more easily | +| The edit you want spans many files | Agent ([section 05](../05-agentic-workflow/)) | Inline edit is per-file; agents handle cross-file work | +| You keep tabbing through bad suggestions for the same task | Write it yourself | The model doesn't have enough signal; you are faster | +| The output is large and the result needs verification | Write it yourself or pair with your own checks | Trust-but-verify gets expensive at scale | + + +## Habits that survive tool changes + +The tools will keep changing. These habits do not: + +- **Read every accepted suggestion.** Even autocompletes. Especially autocompletes. +- **Keep the cycle tight.** If the model is producing more than ~20 lines at a time without your review, you are no longer in the loop. +- **Use version control as a safety net.** Commit before any large AI-assisted change. `git diff` is the last line of defense. +- **Verify with your own checks.** Whether that means a formal test, a script that compares against a known answer, a plot you eyeball, or a hand calculation depends on what you are writing. The check has to come from you, not from the AI that wrote the code. +- **Be willing to write code yourself.** The AI is a tool, not a substitute for understanding what you're building. + + +## Exercises + +> **Exercise 1:** For one work session, keep autocomplete on but make a conscious "accept / reject / rewrite" decision on every suggestion of more than one token. Note how often each happens. The exercise is not "reject more" — it is to make the choice visible. + +> **Exercise 2:** Take a function in a recent project and use inline edit three times with three different instructions: a vague one (*"make this better"*), a specific one (*"split into parse and validate steps"*), and a constraint one (*"refactor to remove the nested if"*). Compare the diffs. + +> **Exercise 3:** Try a "no AI for one task" experiment: pick a small feature, write it yourself with the extension disabled. Then re-enable and use it for a comparable second feature. Note where the AI saved you time, where it cost you time, and where the difference was negligible. diff --git a/04-conversations/README.md b/04-conversations/README.md new file mode 100644 index 0000000..a6656bd --- /dev/null +++ b/04-conversations/README.md @@ -0,0 +1,113 @@ +# Conversations + +## Key idea + +A chat is at its best when you treat it as a *conversation*, not a search bar. Multi-turn discussions, such as design tradeoffs, exploring an unfamiliar library, or talking through a problem that you can't quite articulate, are where conversational interactions are better compared to single-shot edits or fire-and-forget agents. The patterns in this section apply whether you are in a dedicated web chat or your editor's side panel; see [section 03](../03-in-editor-workflow/) for choosing between those venues. The skill is steering the conversation so it stays useful. + +Another way to think about a chat with a capable model is as a kind of programming in natural language: you specify what you want, the model executes, you observe the output, and you refine the specification. The skills that make a programmer effective, including clarity, decomposition, anticipating ambiguity, and iterating, turn out to be the same skills that make a chat user effective. Until LLMs, natural language was almost never an executable specification, and this is one of the more remarkable shifts the technology has produced. This shift explains why "prompt engineering" became a buzzword. It's not magic words or incantations, it's specification quality. The same reason a vague programming spec produces buggy code, a vague prompt produces vague output. + +## Key goals + +- Recognize the kinds of problem that benefit from a multi-turn chat +- Open conversations in a way that sets them up for success +- Manage context across turns so the model stays grounded +- Know when to start a fresh chat versus continuing the current one +- Know when to end the conversation and just write the code + +--- + +## When a conversation is the right tool + +The mode chart in [section 01](../01-three-modes/) tells you to reach for chat when the answer is words, not code-in-place. Within that bucket, multi-turn conversation is specifically valuable for: + +- **Design tradeoffs.** *"I'm choosing between approach A and B. What should I weigh?"* The model can lay out the dimensions and you can push back on weightings. +- **Exploration of unfamiliar territory.** *"I've never used asyncio in Python. Walk me through what the event loop is actually doing."* Each follow-up question sharpens the model's answer. +- **Talking through a problem you can't quite name.** *"Something feels wrong about this architecture but I can't put my finger on it. Here's the structure..."* The act of describing it often clarifies your own thinking, and the model's questions back can probe weak spots. +- **Learning a new domain or library.** Conversations let you ask the dumb questions you'd be embarrassed to ask a colleague repeatedly. + +If you find yourself wanting the model to *produce a specific edit*, you have drifted out of conversation territory. Switch to the editor or an agent. + + +## Opening well + +A useful prompt pattern is to open with three things: + +1. **What you're trying to do** (the goal, in one sentence) +2. **What you've considered or tried** (the relevant constraints and prior work) +3. **The specific question** (one question, not five) + +Compare: + +> "Should I use asyncio or threads?" + +against: + +> "I'm building a CLI tool that scrapes data from ~50 HTTP endpoints and writes them to disk. The endpoints are slow (1–3 seconds each) but the work per response is light. I'm choosing between `asyncio` with `aiohttp` and a `ThreadPoolExecutor`. Given that profile, which would you reach for, and why?" + +The first prompt gets you a generic comparison. The second gets you a specific recommendation grounded in your problem. + + +## Managing context across turns + +Chat models do not have memory of your project. Instead, they have memory of *this conversation*, and that memory is bounded by the model's **context window** — the amount of recent conversation it can attend to at once. Current chat services handle tens to hundreds of thousands of tokens per session, but once a chat exceeds the limit the interface usually truncates or summarizes the oldest turns silently, and even within the limit the model attends more reliably to recent content than to material from many turns ago. The three strategies below work because they keep the most relevant material at the recent end of the window where the model can still see it clearly. As the conversation grows, three things matter: + +### Re-paste changed code rather than referring to "the function I sent earlier" + +If you've edited code based on the model's suggestion and want to continue the discussion, paste the *new* version. Saying "I tried your suggestion, but..." without showing what you actually did leaves the model guessing whether you implemented its suggestion correctly. + +### Summarize your own thinking back to the model occasionally + +Especially in longer conversations, a short *"so what I'm taking away is X, Y, Z — am I missing something?"* anchors the conversation and surfaces misunderstandings cheaply. It also forces you to articulate what you've learned, which is half the point. + +### Watch for the model drifting + +If the model starts repeating itself, hedging more than answering, or generating increasingly generic advice, it has likely run out of useful things to say with the context it has. That is a signal to either (a) introduce new information (code, constraints, examples) or (b) end or clear the conversation. + + +## When to start fresh + +Conversations accumulate context that can both help and hurt. Start a new chat when: + +- **You've shifted topics.** The new question has nothing to do with the previous discussion. Old context will leak in. +- **The conversation went sideways early.** If the model misunderstood your first message and the next several turns were spent correcting course, the corrected understanding is buried under that wrong understanding. A fresh start with a better first message is often faster. +- **The chat has become long enough that important details from early turns are out of recent attention.** Most chat interfaces handle this gracefully, but very long chats can have the model "forget" something you said ten turns ago. Restating it in a fresh chat is sometimes easier than fighting it in the existing one. + +There is no virtue in keeping a chat going longer than it needs to. Open a new one freely. Conversations are cheap. When you're deciding "should I keep this conversation going or start fresh?" you should be biased toward starting fresh. + + +## Patterns that work + +- **Compare and contrast.** *"What are the practical differences between pandas `merge` and `join`? When would I reach for each?"* Models are good at structured comparisons. +- **Devil's advocate.** *"I'm planning to use approach X. What would make that a bad choice? What's the strongest argument against it?"* Inverts the default "let me help you do what you said" tendency. +- **Explain to a target audience.** *"Explain CRDTs to me as if I have 10 years of backend experience but no distributed-systems background."* The audience framing tightens the level of abstraction. +- **Critique my draft.** *"Here is my approach / commit message / README. What's confusing or weak?"* Models are surprisingly useful as a first-pass reviewer. +- **Walk me through.** *"Walk me through what happens when I call `requests.get(...)`. Don't skip the boring parts."* Good for building mental models of libraries you use but don't fully understand. +- **Iterate on the prompt itself.** *"What would I have to add to my question to get a better answer?"* or *"Help me rewrite this prompt to be more specific."* The model is often perceptive about its own failure modes, and the resulting prompt is sharper than what you started with. Especially valuable when you are crafting a prompt you will reuse — a template, a system prompt, or an agent's instruction. + +## Patterns that don't + +- **Asking for "the best" with no criteria.** *"What's the best Python plotting library?"* gets you a generic matplotlib-vs-seaborn-vs-plotly survey. Add criteria like *"for publication-quality figures with mathematical annotations, where I need fine control over axes and tick formatting"* and the answer becomes more useful. +- **Long preamble before the question.** Models read top-down, but the actual question is what they answer. If you bury it in paragraph three, the model may answer paragraph one. +- **Asking the model to "be honest."** It is already trying to be helpful. The frame that works better is "what's the case against," not "be honest about whether this is good." + + +## When to stop talking and write code + +A conversation has done its job when you can clearly articulate the next concrete action. At that point, more conversation is procrastination, and the work should move to whichever execution mode fits — an inline edit for a single function, an agent to draft a multi-file change you will then review, or your own keyboard for the parts that benefit from your hands-on judgment. If you get stuck, come back to the conversation. + +Watch for these stop signals: + +- You catch yourself nodding along to the model's responses without learning anything new +- You've already decided what to do and are looking for permission +- The conversation is now about minor details that you could resolve by trying the thing + +The point of the conversation was to get you *to* the work, not to replace it. + + +## Exercises + +> **Exercise 1:** Open a chat about a real design decision you're currently facing. Use the three-part opening (goal, what you've considered, specific question). Note how the answer differs from a one-line version of the same question. + +> **Exercise 2:** Take a conversation you had recently that felt unproductive. Reread it from the model's perspective: what context was missing? Was the question buried? Now imagine the version of the chat you'd have if you started over with what you know now. + +> **Exercise 3:** Try the "devil's advocate" pattern on a decision you've already made and feel confident about. The discomfort of hearing the strongest argument against is informative — sometimes the decision survives intact (now better justified), sometimes it doesn't. diff --git a/05-agentic-workflow/README.md b/05-agentic-workflow/README.md new file mode 100644 index 0000000..ea04f21 --- /dev/null +++ b/05-agentic-workflow/README.md @@ -0,0 +1,131 @@ +# Agentic Workflow + +## Key idea + +An agentic tool is an AI that takes actions on its own, whether it's reading a file, running a command, editing, testing, reading the output, editing again, without you mediating each step. You set the goal and the agent runs the loop. That power is only useful when paired with judgment about *when* to deploy an agent and *how* to supervise it. + +This section is about *using* agentic tools as an engineer or scientist solving problems with code — models, data analysis, simulations, coursework — rather than as someone building production software for end users. If you want to understand how tool use works under the hood and how to build a system like this, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop) section on tool use and agentic systems. + +## Key goals + +- Recognize agentic mode and what distinguishes it from an editor extension +- Identify the kinds of task where an agent is most useful +- Brief an agent in a way that produces good results +- Supervise effectively: scope, permissions, review +- Be aware of cost and failure modes + +--- + +## What an agent actually does + +A typical agentic tool is built on the same underlying model as the chat or editor extension, but wrapped in a *loop* that lets the model take actions in your environment. A simplified version of one step: + +1. The model receives your goal and the current state (files, terminal output, etc.) +2. The model decides on the next action: read a file, run a command, write an edit +3. The action runs; the result is fed back to the model +4. Repeat until the model believes the goal is met (or it asks you a question) + +The two things that make this different from in-editor edits: + +- **Actions are real.** The agent can run `rm`, `git push`, `pip install`, or hit external APIs. Permission models vary by tool, but the capability is the defining feature. +- **The model owns the plan.** You don't write the steps. Instead, the model figures out the steps. + +**Examples (early 2026):** Claude Code (CLI), Cursor (agent mode and Background Agents), Cline and Windsurf Cascade (VS Code), Microsoft Copilot agent, GitHub Copilot Workspace, Aider, and more autonomous platforms like Devin and Replit Agent. + + +## Variations on the basic loop + +The read-act-observe-act cycle described above is the core, but the agentic landscape has expanded substantially through 2025 and into early 2026, and a working knowledge of the main variations helps you choose the right tool and supervise it well. + +- **Sub-agents and parallelism.** A primary agent can spawn sub-agents to handle independent branches of work — searching different parts of a codebase at once, running parallel investigations, or specializing roles such as one agent writing and another reviewing. Claude Code's `Agent` tool and similar features in other platforms enable this. The supervision burden shifts, though. You are no longer watching one loop but several. +- **Plan-then-execute modes.** Many agents now offer a mode where they first produce a written plan, you review and edit it, and only then do they execute. Claude Code's plan mode, Cursor's planning step, and similar features fit this pattern. It sits between "approve every action" (slow) and "let it run" (risky), and is often the right default for a non-trivial task. +- **Async and background agents.** Some agents run while you do other things and report back when finished — Cursor's Background Agents, Devin, Replit Agent, GitHub Copilot Workspace. The trade is real-time visibility for parallelism with your own work, and it changes how you brief the agent because you cannot easily course-correct mid-task. +- **MCP and external tools.** The Model Context Protocol, introduced by Anthropic in late 2024 and widely adopted since, lets agents connect to external systems through standardized servers — Slack, Linear, GitHub, databases, monitoring dashboards, file systems on remote machines. "The agent reads files and runs commands" is now a starting point rather than a ceiling; in practice, agents reach into whatever services your team uses. +- **Sandboxed execution.** Some agents run inside isolated virtual machines or containers, which limits what an agent can affect and means destructive actions only impact the sandbox. Devin and some Cursor modes work this way. The downside is reduced access to your real environment, but the upside is genuine freedom to experiment without risking your machine. + +These variations do not change the supervision principles below — clear briefs, permission control, review the result — but they do change how those principles are applied. A plan-mode agent shifts review from the result to the plan; a sub-agent setup means supervising several flows at once; a sandboxed agent means review can be looser because consequences are contained. + + +## When to use an agent + +Agents shine on tasks where the *work between steps* is the expensive part for a human: + +- **Multi-file changes that need verification.** *"Rename this concept across the codebase and make sure the tests still pass"* — or, for scientific code without a formal test suite, *"...and make sure my analysis script still reproduces the expected numbers."* The agent reads, edits, re-runs the verification, re-edits if needed. You would do the same thing manually, with much more context-switching. +- **Exploring an unfamiliar codebase.** *"How is authentication handled in this project? Find the entry point and explain the flow."* The agent grep-walks the project; you read the summary. +- **Repetitive maintenance.** *"Update all the imports from `old_lib` to `new_lib` and adjust the calls that changed."* Mechanical, scoped, verifiable. +- **End-to-end small features in well-tested code.** *"Add an endpoint that does X, following the patterns in the existing endpoints. Update the tests."* + +Agents are *less* useful for: + +- **A single line you already know how to write.** Inline edit is faster. +- **A design discussion.** Use chat. The agent has nowhere to act. +- **Anything where you don't know what "done" looks like.** The agent will reach a state and stop; if you can't tell whether it's the right state, you've shifted the problem rather than solving it. + + +## Briefing an agent well + +A good agent brief looks more like a task description for a new teammate than a search query. Include: + +- **The goal**, stated as outcome rather than steps. *"Add a `--dry-run` flag to the `migrate` command that prints what would change without writing anything."* +- **Constraints** the agent might not infer. *"Use the existing logging helper rather than `print`. Match the style of the other flags."* +- **What "done" means.** *"All existing tests still pass. There is a new test verifying the `--dry-run` output for the simple case."* For code without a formal test suite, substitute whichever form of verification you use — sanity-check runs, known-answer comparisons, a regression script's expected output. +- **What to ask about, not assume.** *"If the migration step has side effects I can't easily reverse, stop and ask before running it."* + +The single biggest predictor of an agent doing the right thing is how well-bounded the task is. *"Improve this code"* is poorly bounded; the agent will improve it in directions you may not want. *"Reduce the duplication between `parse_csv` and `parse_tsv` by extracting a shared helper, preserving the existing return signatures"* is well-bounded. + + +## Supervision + +Agentic tools work because they take real actions. That means real consequences if they take the wrong ones. Three things to think about before letting an agent loose: + +### Permissions + +Most tools have a permission model: which commands run automatically, which require confirmation, which are blocked outright. Default toward *more* confirmation when you are starting out with a new tool or a new codebase. Speed up later as you learn what the agent does well. + +A useful rule of thumb: **destructive or remote-affecting actions deserve confirmation.** Local edits to a project under version control are reversible. `git push --force`, `rm -rf`, `pip uninstall`, and anything that hits an external service or shared system are not. + +### Working directory and damage control + +An agent pointed at a fresh sandbox can experiment freely. An agent pointed at your home directory can do real damage. Before starting: + +- Be sure you are in the right directory +- Have a clean git state (or know what's uncommitted) so you can see what the agent changed +- Know what the agent has access to outside the project (secrets, environment variables, network) + +### Review + +The agent's report — *"I added the flag, updated the tests, and they pass"* — describes what it intended to do, not necessarily what it did. Always check: + +- `git diff` (or the equivalent) — what actually changed? +- Any verification — tests, sanity-check scripts, known-answer comparisons — did it actually verify the new behavior, or did it get loosened to pass? +- Any new files — were they expected? +- Any commands run — were there surprises in the output? + +Spot-checking is fast. Skipping it is how subtle bugs and security issues land in your codebase. + + +## Cost awareness + +Agentic tools use many model calls per task. A task that takes one back-and-forth in chat can take thirty in an agent. Watch for: + +- **Long-running loops.** If an agent has been working for a long time without progress, it may be stuck in a try-fix-try cycle. Intervening early is cheaper than letting it grind. +- **Wide context.** Agents that read many files pay for that context on every step. Pointed work in a small subdirectory costs less than open-ended exploration of a large repo. +- **Wandering.** If the agent has drifted from the original goal, stopping and restarting with a tighter brief is usually cheaper than letting it wander back on its own. + + +## Common failure modes + +- **The agent does the wrong thing efficiently.** The brief was ambiguous; the agent picked one interpretation and proceeded fast. Catch this in review and brief better next time. +- **Checks get loosened rather than the code being fixed.** The agent finds a failing test or a sanity check that doesn't pass, decides the check was wrong, and weakens it rather than fixing what it was checking. Always look at what changed in your verification scripts and test files. +- **Cascading small edits.** The agent makes a small change, notices a knock-on, fixes that, notices another, fixes that... twenty edits later, half the codebase has been touched. Tight scopes and good initial briefs prevent this. +- **Confident hallucinations about a library or API.** The agent will use a function that doesn't exist with full confidence, then patch around its own error when the test fails. Pin the agent to documentation or examples when the library is unfamiliar. +- **Permissions creep.** "Just this once, allow this command unsupervised" turns into a default. Re-tighten when you change tasks. + + +## Exercises + +> **Exercise 1:** Pick a small, well-scoped task you've been putting off — a refactor, a chore, a small feature — and brief an agent to do it. Write the brief first, before invoking the tool. Note how often you wanted to add a detail you forgot. + +> **Exercise 2:** Compare an agentic run with a manual run of the same task on a small scale. Time both. Account not just for elapsed time but for the *quality* of the result and the time you spent reviewing. + +> **Exercise 3:** Deliberately give an agent an under-specified brief and see what interpretation it picks. The point is to develop intuition for what the agent will assume when you leave room for assumption. diff --git a/06-verifying-and-citing/README.md b/06-verifying-and-citing/README.md new file mode 100644 index 0000000..d3d4c98 --- /dev/null +++ b/06-verifying-and-citing/README.md @@ -0,0 +1,160 @@ +# Verifying and Citing + +## Key idea + +AI assistants are useful because they generate plausible output fast. They are *dangerous* for the same reason. They can write plausible-but-wrong output that looks identical to plausible-and-right output until you check. This section is about the discipline of checking, plus the two other questions that come with using AI work: *what did you give the service?* and *how do you disclose what you used?* + +## Key goals + +- Develop habits for verifying AI output, especially the parts that look most confident +- Understand the privacy and intellectual-property implications of what you paste +- Know how to disclose AI use in academic and professional contexts + + +--- + +## Part 1: Verifying + +> **A note on terminology.** This section uses "check" and "test" with different meanings. A **unit test** is a specific software-development practice — a small piece of code (often written with a framework like `pytest`) that exercises a function with known inputs and confirms the output matches an expected result. Tests are automated and reusable, and they pay off when code will be edited many times by many people. A **check**, more broadly, is anything that verifies the code does what you intended: running on a known limit case, comparing to a published value, plotting and inspecting the shape, or hand-calculating a small input. Formal unit tests are one form of check, but for scientific code written for a single project they are often not the most natural form. Whenever this guide says "verification" or "check," any of these forms count; "test" appears only where an automated test is the right tool. + +### Why verification matters + +Hallucinations in AI-assisted coding fall into two broad categories: + +1. **Loud hallucinations** — code that fails to compile or run. Easy to catch; the tool tells you. +2. **Quiet hallucinations** — code that runs and produces a result, but the result is wrong. These are the dangerous ones. + +There is often a familiar pattern: a function that uses an API method that doesn't exist, a regex that handles all the cases you mentioned but fails silently on an edge case you didn't think to mention, a math expression that is dimensionally inconsistent but produces a number anyway. The output *looks like* an answer, so you accept it. Hours or weeks later, you discover the silent failure. + +The fix is not "don't use AI." The fix is to assume every AI-generated piece of code is plausible until verified, and to make verification cheap. + +### What to verify + +Not everything needs the same scrutiny. A reasonable triage: + +| The code is... | Verify by... | +|---|---| +| A one-line refactor in code you understand | Reading the diff | +| A new function with a clear contract | Running it on inputs whose answer you can compute or look up — a known limit case, a published value, or a hand calculation. (A formal unit test is one form; for scientific code, often not the most natural.) | +| Code that calls an unfamiliar library | Checking the library's docs for the methods and signatures the AI used | +| Code involving math, units, or numerical methods | Walking through the math on paper or with a known input | +| Multi-file changes from an agent | `git diff`, run whatever verification you normally use (tests, sanity-check scripts, known-answer comparisons), and *read* any verification scripts the agent touched to make sure they didn't get weakened | + +Two specific habits are worth building: + +**Check that named functions and methods actually exist.** A common hallucination is "the AI confidently uses `pd.read_excel_smart()`," which sounds real but does not exist. A quick `grep` of the library docs (or a 10-second IDE check) catches this. + +**Write your own checks, especially for edges.** If the AI writes the implementation *and* the check, the check verifies the AI's interpretation of your goal, not the goal itself. Run the function on a limit case you can verify by hand (zero, infinity, a published value), compare against an analytical answer where one exists, or write a formal test if that is the right form for your code. Whatever the form, at least one check should come from you. + +### Trust gradients + +Trust is calibrated by domain. Some quick heuristics: + +- **High trust:** Boilerplate, well-known patterns, idiomatic refactors, common-library usage at common signatures. +- **Medium trust:** Unfamiliar library usage, performance-sensitive code, security-adjacent code. +- **Low trust:** Numerical methods, statistics, anything involving units or domain-specific physics, anything safety-relevant, security-critical code, code where being subtly wrong is much worse than being obviously wrong. + +Trust is not "the model is bad at X." It's "the *consequence* of being wrong about X is high, so verify proportionately." + +--- + +## Part 2: Privacy and IP + +### Baseline: the same risk profile as other cloud services + +When you paste content into a chat, that content is sent to the service. Editor extensions and agents do the same with the files they read. This is the same baseline as Gmail, Google Drive, GitHub, OneDrive, Dropbox, or any other cloud service you already use. In these cases, your content lives on someone else's servers, subject to their data-handling policies. For most academic and research work, including coursework, classroom code, public datasets, open-source libraries, drafts of your own writing, the privacy risk is no different from what you accept every time you use those other services. + +### Three ways AI services *are* different from Gmail + +The genuine differences are narrower than the general "the cloud is watching" framing suggests, but they are real: + +1. **Training-data inclusion.** Gmail and Drive do not train on your content. AI services historically have, varying by the provider and the plan. WHile defaults have been changing, and most paid and enterprise tiers now opt out, the precedent is real and there is no Gmail equivalent. Check the current setting for the service you use. +2. **Aggregation richness.** A chat history reveals more than an email archive. What you are working on, what you do not know, what you are puzzling over. These can accumulate in conversations in a way they do not in inboxes. Aggregated chat history is potentially a richer and more sensitive documentation of your work than aggregated email is. +3. **Routine review of flagged content.** Most AI services explicitly reserve the right to have humans review conversations flagged by their safety systems. Gmail has no equivalent "if our spam filter trips, a person may read this" policy. In practice your conversations are almost certainly not reviewed, but the legal posture is different. + +A fourth, less concrete consideration is that AI services are two to three years old as widely-adopted products. We have less empirical track record of their behavior than for Gmail (20+ years) or AWS (15+). + +### When this matters in a research context + +For most academic work, the cloud-service baseline is the right mental model. You wouldn't upload an unpublished manuscript to a random web tool you'd never heard of, but you also do not panic about every email you send. Many researchers paste unpublished manuscripts into Overleaf without thinking about it. AI chat services from established providers are in a similar position. They are known entities with real privacy policies and enterprise tiers with stronger guarantees. The instinct that makes you cautious about a brand-new free tool you've never heard of is right, but it shouldn't make you treat Claude or ChatGPT as fundamentally different from the cloud tools your work already lives in. A few categories do warrant extra care, though: + +- **Restricted research data.** Anything covered by your IRB protocol, your data-use agreement with a collaborator or industrial partner, or institutional policies around HIPAA, FERPA, export controls, or similar regimes. If a category of data is restricted on your computer, it is restricted in your chat too. +- **Unpublished work that isn't yours.** Collaborator drafts, manuscripts under review, code from a lab that hasn't been released. You don't own the right to share these regardless of how you happen to be sharing them. +- **NDA-covered or proprietary material.** From an industrial collaboration, an internship, an advisor's industry consulting work. Check the specific agreement. +- **Personally identifying information.** Participant data, survey responses, names attached to outcomes — even when "anonymized for internal use." If you need help analyzing it, paste a synthetic example with the same shape rather than the real thing. +- **Credentials, API keys, internal URLs.** Easy to leak by accident when pasting config files or logs. + +For most students most of the time who are dealing with coursework, classroom exercises, your own scripts, public datasets, open-source libraries, and drafts of your own writing, the answer is "the chat is fine, same risk as email." Graduate students and undergradute reseearchers working with sensitive research data are the most common case for the categories above. If that's you, take the agreements that govern your data seriously, and when in doubt, ask your advisor or your IRB. + +### A practical checklist + +Before pasting anything non-trivial: + +1. Does this contain credentials, keys, or tokens? *(strip them)* +2. Does this contain restricted research data, sensitive participant information, or unpublished work that isn't yours? *(use a synthetic example or stop)* +3. Is this covered by an NDA or institutional agreement that restricts where it can go? *(check the agreement)* +4. Would I be comfortable with this content appearing in an email I sent through Gmail? *(if yes, the chat is fine; if no, reconsider)* + +The check takes seconds and prevents most accidental disclosures. + +--- + +## Part 3: Citing and attribution + +### The landscape is in transition + +AI-use disclosure norms are still settling. Most universities, journals, and employers have adopted policies in the past two years, but the policies themselves often do not capture how the tools actually work. They may treat all AI use as equivalent, conflate autocomplete with substantive drafting, or be vague about what counts as "use." In practice, compliance is uneven, too. Many people don't disclose at all, others over-disclose in ways that aren't useful, and the gap between policy and behavior is wide. + +The advice in this section is therefore intended to be normative and proportional, not absolute. It is about what serves you and the integrity of your work when policies apply, not a claim that this is how the field universally behaves. + +### Why disclose anyway + +Two complementary reasons: + +1. **Compliance, when policy applies.** Many courses, programs, and journals require disclosure for completed work. The policies vary, enforcement is uneven, but not complying is a violation even if undetected. +2. **Self-protection.** If your work is later challenged by a reviewer, an examiner, a future employer, or a collaborator, having disclosed AI use is the defensible posture. Disclosure is the conservative choice for *your* future, regardless of what others do. + +### A proportional standard + +The realistic bar is not "note every Copilot autocomplete." That standard is impossible to meet in practice, and treating it as required is part of why disclosure norms feel unrealistic. A more useful distinction: + +- **Background assistance** that shaped *how* you worked, such as autocomplete, syntax help, name suggestions, quick lookups, debugging conversations. Usually no disclosure is needed unless your venue's policy is specific. +- **Substantive contribution** that shaped *what* you produced, such as AI drafted a section, generated significant chunks of code that you reviewed and accepted, planned the analytical approach, wrote the literature summary, debugged a critical reasoning step. These likely warrant a brief note. +- **Substituted work** where AI produced something you submitted as your own without meaningful engagement, including running an assignment through ChatGPT and turning in the output. This is the case policies are most worried about, and it sits closer to academic dishonesty than to the disclosure question. + +The distinction is *substantive contribution* versus *background assistance*, and it's roughly the same one you would use for human help. Casual tutoring from a peer doesn't go in your acknowledgments, but a significant intellectual contribution does. + +### Where and how to disclose + +The form depends on context: + +- **Course assignments.** Follow your course's stated policy. Most courses now have one; if yours doesn't, ask. When in doubt, a brief "I used [tool] for [purpose]" note at the top is the conservative choice. +- **Theses and dissertations.** Many programs now require an AI-use statement in the front matter. Check with your department. +- **Conference and journal papers.** Most major venues (ICML, NeurIPS, Nature family, ACS journals, etc.) have explicit policies as of 2024–2025. Common pattern: AI cannot be listed as an author; substantive AI use must be disclosed in methods or acknowledgements. +- **Code commits and PRs.** Practices vary here. Some teams use `Co-Authored-By: AI ` lines for substantial contributions. Others don't. Follow the team's convention; if there isn't one, ask. (This guide's own commits use `Co-Authored-By: Claude` lines where AI substantively contributed.) +- **Code review.** Reviewers benefit from knowing which sections of a change were AI-generated, not as a red flag, but to inform what kind of review the section needs. + +A useful pattern when you do disclose is to state three things: + +1. **What tool you used** (specific model and version if available — "Claude Opus 4.7," "ChatGPT-4o," "GitHub Copilot") +2. **What you used it for** ("debugging error messages," "drafting the introduction," "generating boilerplate code") +3. **What you did with the output** ("reviewed and edited," "used as a starting point and rewrote," "used as-is after verification") + +Example: + +> *I used Claude to help structure the introduction and to debug error messages I encountered while running the analysis. All AI-generated text was reviewed and edited. All AI-generated code was verified against known limit cases.* + +That's enough for most academic contexts. Specific journals may require a particular form. + +### What disclosure is not + +Disclosure is not an admission of wrongdoing. It is an accurate description of your process. The work is yours and the AI was a tool. Be matter-of-fact about it the same way you would acknowledge a specific Python library, a colleague's helpful conversation, or a textbook reference. + + +## Exercises + +> **Exercise 1:** Take a piece of AI-generated code you used recently. Apply the verification triage table — where on the spectrum was it, and did you verify proportionately? If not, do the verification now and note what (if anything) you find. + +> **Exercise 2:** Run through the pre-paste checklist on a recent chat session. Was there anything in the paste you would do differently now? + +> **Exercise 3:** Find the AI-use policy for one context that applies to you (a current course, your department, a journal you have submitted to, your employer). Read it. Note the differences from this guide's general advice — those differences are the ones that matter for you specifically. diff --git a/07-local-models/README.md b/07-local-models/README.md new file mode 100644 index 0000000..41b51fa --- /dev/null +++ b/07-local-models/README.md @@ -0,0 +1,146 @@ +# Using Local Models + +## Key idea + +You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware: no API, no per-token cost, no data leaving the machine. Local models are not a fourth *mode* on top of chat, editor, and agent — they cut across all three. The same workflow patterns apply; what changes is the tool that hosts the model and what you give up (and gain) by running it yourself. + +This section is about local models as a *user* of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop). + +## Key goals + +- Understand why you might prefer a local model to a cloud model +- Recognize which tools in each of the three modes support local models +- Calibrate expectations about capability and latency relative to frontier cloud models +- Identify the situations where local is the right choice and where cloud still wins + +--- + +## Why run a local model? + +Five reasons, ordered by how often they matter in practice: + +1. **Privacy and IP.** Code, data, and prompts never leave your machine. This is the deciding factor for proprietary work, IRB-constrained research, employer-restricted code, and anything covered by an NDA or government contract. *What you don't send, the service can't see.* +2. **Cost.** No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use — long agentic sessions, batch processing — the savings add up quickly. +3. **Offline operation.** Works on a plane, in a lab without internet, in a SCIF, on a remote field deployment. Cloud models simply don't. +4. **Control and reproducibility.** You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. Useful for reproducible research and long-lived pipelines. +5. **Learning.** Running a model yourself forces you to understand what it is, what it can do, and where it breaks. This is a real benefit for engineers and researchers who plan to work with these systems. + +These are *also* the reasons people use cloud models for the opposite of each: convenience, no setup, always-current, no local hardware burden. + + +## Hardware reality + +Local models are constrained by your hardware in a way cloud models are not. The dominant factor is **memory** — specifically VRAM on a GPU or unified memory on Apple Silicon. + +A rough sense of what runs comfortably where, as of early 2026: + +| Hardware | Practical model size | Example models | +|---|---|---| +| 8 GB RAM/VRAM | 1–3 B parameter models, heavily quantized | Gemma 2 2B, Phi 3 Mini | +| 16 GB | 7–8 B at moderate quantization | Llama 3.1 8B, Qwen 2.5 Coder 7B | +| 24–32 GB (high-end laptop GPU or Apple Silicon) | 13–32 B at moderate quantization | Qwen 2.5 Coder 32B, Mistral Small | +| 48–64 GB (Mac Studio, server GPU) | 70 B class at heavy quantization | Llama 3.3 70B, DeepSeek Coder V2 | +| 128 GB+ workstation | 70 B at lighter quantization, or multiple models | larger Qwen, Mixtral variants | + +**Quantization** (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. You trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization. + +If you took the time to fill out the spec table in [computing-setup section 01](https://lem.che.udel.edu/git/furst/computing-setup/src/branch/main/01-know-your-machine/), you already know what tier you're in. + + +## Local models across the three modes + +The three-mode framing from [section 01](../01-three-modes/) still applies — what changes is the host. + +### Local in *chat* mode + +You can have a private, local ChatGPT-style experience entirely on your laptop. + +| Tool | What it is | +|---|---| +| **Ollama** | A CLI + background service that downloads and runs models. Lowest friction; serves an OpenAI-compatible API on `localhost:11434`. | +| **LM Studio** | A polished desktop app for downloading, running, and chatting with models. Good for those who want a GUI from the start. | +| **Open WebUI** | A self-hosted web UI (like ChatGPT) that talks to Ollama or any OpenAI-compatible backend. Good if you want a familiar chat experience or want to share access on a LAN. | +| **Jan**, **GPT4All** | Other desktop chat apps with similar goals. | + +The Ollama-powered backends in particular are useful well beyond chat — most of the editor and agentic tools below can connect to an Ollama endpoint, which means setting up Ollama once unlocks every mode. + +### Local in *editor* mode + +Several VS Code extensions support local models. Notably, **GitHub Copilot, Microsoft Copilot, and the Claude extension do not** — they require their vendor's cloud service. If you want a local model in your editor, you need a different extension. + +| Extension | Notes | +|---|---| +| **Continue.dev** | Open-source, the flagship local-friendly extension. Works with Ollama, LM Studio, llama.cpp, and many cloud providers. Supports autocomplete, inline edit, and a chat panel. The first tool to try. | +| **Cody** (Sourcegraph) | Has a "local context" mode and can use local models via Ollama. Also has a strong cloud product. | +| **Llama Coder** | Ollama-focused, autocomplete-first. Lightweight. | +| **Tabby** | A self-hosted code completion server. Heavier setup but good for shared use within a team or lab. | + +For Neovim users: `codecompanion.nvim`, `avante.nvim`, and `gen.nvim` all support local backends. + +### Local in *agentic* mode + +Agentic tools are where local-vs-cloud differences are most visible. Multi-step tasks make many model calls, so latency and capability gaps compound. + +| Tool | Notes | +|---|---| +| **Aider** | Terminal-based pair programmer. Supports any OpenAI-compatible endpoint, including Ollama. Mature local support. | +| **Cline** (VS Code extension) | Agentic VS Code extension with broad provider support including local via Ollama. | +| **OpenHands** (formerly OpenDevin) | Open-source agentic platform. Works with local models with some setup. | + +Notable exclusions (as of early 2026): **Claude Code, Cursor agent mode, and Microsoft Copilot agent do not support local models.** They are tied to their respective cloud providers. + + +## The capability gap + +Frontier cloud models (Claude Opus, GPT-4o, Gemini Pro) are still better than local models at almost every coding task. Pretending otherwise sets you up for disappointment. Some honest framing: + +- **For autocomplete and short suggestions**, a good local 7–13 B model (Qwen 2.5 Coder, DeepSeek Coder Lite, Codestral) is genuinely useful and the gap to cloud is small. +- **For one-shot Q&A and short refactors**, the gap is noticeable but acceptable. You may need a second try where a frontier model would have nailed it the first time. +- **For long reasoning chains, multi-file work, or anything subtle**, the gap is large. Frontier cloud models still win clearly. +- **For agentic loops**, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating; on a 32–70 B model, they're usable. On a frontier cloud model, they're effective. + +The gap is narrowing every few months. The advice above will date faster than most of this guide. + +There is also a **latency gap**. A frontier cloud model returns a response in a second or two; a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and "wait, think about something else, come back." + + +## When local makes the most sense + +The clearest cases: + +- **Privacy- or IP-restricted code or data** that you genuinely cannot send to a third-party service. Employer policy, IRB constraints, NDAs, government contracts, classified work. +- **Heavy regular use** where the per-token cost would be prohibitive. Long agentic sessions, batch summarization of large datasets, internal tooling that hundreds of people query. +- **Reproducible research pipelines** where you need to pin a specific model version that won't change. +- **Offline or air-gapped environments**. +- **Learning and experimentation.** Running a model yourself is the most direct way to understand what it is and what it does. + +The cases where cloud still wins: + +- **You don't have the hardware.** Frontier cloud is cheaper than buying a workstation if you're not going to use it heavily. +- **You're at the frontier of difficulty** — the hardest reasoning, the longest contexts, the newest capabilities. The cloud has more parameters than your laptop. +- **You use AI occasionally and care more about ease than control.** Cloud is one click; local is one weekend. + + +## A practical starting setup + +If you want to try local models, the lowest-friction path is: + +1. Install [Ollama](https://ollama.com/) (`brew install ollama` on macOS; one-liner installer on Linux; native installer on Windows). +2. Pull a model sized for your hardware: + - 8 GB RAM: `ollama pull gemma2:2b` or `ollama pull phi3.5` + - 16 GB: `ollama pull llama3.1:8b` or `ollama pull qwen2.5-coder:7b` + - 24–32 GB+: `ollama pull qwen2.5-coder:32b` or `ollama pull llama3.3:70b` (the 70 B will be tight) +3. Try it in chat: `ollama run ` in the terminal, or point Open WebUI at it. +4. Try it in your editor: install **Continue.dev**, configure it to use Ollama as the provider, point it at your model. +5. Try it agentic: install **Aider** (`pip install aider-chat`), run `aider --model ollama/` in a project directory. + +For deeper setup — choosing models, understanding quantization, building applications around local models — go to the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop). That repo covers how local models work and how to build with them, including RAG, semantic search, and tool use. + + +## Exercises + +> **Exercise 1:** Estimate which model size your hardware can run, using the table in *Hardware reality* and the spec table you filled out in computing-setup. If you have a machine that can run 7–32 B class models, install Ollama and pull a coder model. Run it in chat and ask it to explain a function from a recent project. + +> **Exercise 2:** Install **Continue.dev** in VS Code and configure it to use your Ollama model. Disable your cloud AI extension temporarily. Use the local model for a normal coding task for an hour. Note where the gap to cloud was noticeable and where it didn't matter. + +> **Exercise 3:** Pick a real task that involves code or data you would rather not send to a third party (a research script, a personal project, employer code you have permission to work on locally but not externally). Complete it using only local tools. Reflect on the tradeoff. diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..1a4f640 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Eric M. Furst, University of Delaware + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..3a5c366 --- /dev/null +++ b/README.md @@ -0,0 +1,40 @@ +# Coding with AI + +A practical guide to working effectively with AI coding assistants — chat interfaces (ChatGPT, Claude, Gemini, Microsoft Copilot), in-editor extensions, and agentic tools. Our focus is on *workflow* and *judgment*: when to reach for which mode, what to paste, how to prompt, how to verify, and what to cite. + +AI tools change quickly, but the patterns change slowly. This guide aims at the patterns and uses current tools as examples. + +**A note on scope.** This guide is about *coding* — writing, editing, refactoring, and debugging software. Students and engineers also use AI tools heavily for *learning* tasks: explaining concepts, summarizing literature, generating practice problems, study quizzes, mnemonics, working through homework, finding the right vocabulary for a half-remembered idea. The three-mode framework here applies broadly, but the tools, examples, and tradeoffs for learning use cases are different enough to deserve their own guide. + +## Sections + +| # | Topic | Description | +|---|-------|-------------| +| [01](01-three-modes/) | **Three modes** | Web chat, in-editor, and agentic. When to use each one and the heuristics for choosing. | +| [02](02-errors-and-logs/) | **Errors and logs** | The canonical copy-paste case. How to frame what you paste so the assistant can actually help. | +| [03](03-in-editor-workflow/) | **In-editor workflow** | Autocomplete, inline edit, "explain this," refactor. Patterns that make the editor extension worth its slot. | +| [04](04-conversations/) | **Conversations** | Multi-turn design discussions, managing context, when to start a fresh chat. | +| [05](05-agentic-workflow/) | **Agentic workflow** | What agentic tools (Claude Code, Cursor agent, Microsoft Copilot agent mode) actually do, and how to supervise them. | +| [06](06-verifying-and-citing/) | **Verifying and citing** | Reviewing AI output for hallucinations and silent errors. Privacy and IP of what you paste. Attribution in academic and professional work. | +| [07](07-local-models/) | **Using local models** | Local models as a cross-cutting alternative — privacy, cost, offline operation. Which tools support local in each of the three modes, and where the capability gap to cloud still matters. | + +## Who this is for + +Students and practicing engineers who are already using AI assistants but want to use them more deliberately — including those whose default workflow is "ask ChatGPT, copy the answer back." There is nothing wrong with copy-paste, but our goal is to know *when* it is the right tool and when to use something else. + +## Prerequisites + +- A working development setup (editor, terminal, version control). See [computing-setup](https://lem.che.udel.edu/git/furst/computing-setup) and [cli-walkthrough](https://lem.che.udel.edu/git/furst/cli-walkthrough) for the underlying skills. +- Access to at least one AI tool. The examples use Claude and ChatGPT in chat form, and GitHub Copilot / Claude / Codeium / Microsoft Copilot interchangeably as editor extensions. University-provided access (e.g., Microsoft Copilot or Gemini through institutional agreements) works equally well for nearly everything covered here. + +## A note on tools and dates + +Tool capabilities, pricing, and policies change frequently. Where this guide names a specific feature ("Cursor's agent mode," "Claude Code"), the description reflects what those tools did as of the first half of 2026. The underlying patterns, inlcuding copy-paste versus in-editor versus agentic AI, are durable. Remember to treat any tool-specific advice as illustrative. + +## License + +MIT + +## Author + +Eric M. Furst, University of Delaware