Initial commit: coding-with-ai
A practical guide to working effectively with AI coding assistants (chat interfaces, in-editor extensions, agentic tools) for engineers and scientists solving problems with code rather than building production software. Seven sections: - 01-three-modes: web chat vs in-editor vs agentic, with heuristics for choosing and a framing of chat as natural-language programming. - 02-errors-and-logs: the canonical copy-paste case; framing the paste for useful answers. - 03-in-editor-workflow: autocomplete, inline edit, side panel, quick actions; habits that survive tool changes. - 04-conversations: multi-turn discussions, context-window awareness, opening well, prompt iteration, when to start fresh. - 05-agentic-workflow: variations on the basic loop (sub-agents, plan mode, async, MCP, sandboxing); briefing, supervision, damage control, cost and energy. - 06-verifying-and-citing: hallucinations and silent errors; privacy framed against the cloud-services baseline; proportional disclosure norms. - 07-local-models: local models as a cross-cutting alternative across all three modes; hardware tiers, tool support, capability gap. Tool-agnostic where possible; current tool examples are illustrative and expected to date. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
commit
5780cdf097
9 changed files with 958 additions and 0 deletions
146
07-local-models/README.md
Normal file
146
07-local-models/README.md
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
# Using Local Models
|
||||
|
||||
## Key idea
|
||||
|
||||
You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware: no API, no per-token cost, no data leaving the machine. Local models are not a fourth *mode* on top of chat, editor, and agent — they cut across all three. The same workflow patterns apply; what changes is the tool that hosts the model and what you give up (and gain) by running it yourself.
|
||||
|
||||
This section is about local models as a *user* of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop).
|
||||
|
||||
## Key goals
|
||||
|
||||
- Understand why you might prefer a local model to a cloud model
|
||||
- Recognize which tools in each of the three modes support local models
|
||||
- Calibrate expectations about capability and latency relative to frontier cloud models
|
||||
- Identify the situations where local is the right choice and where cloud still wins
|
||||
|
||||
---
|
||||
|
||||
## Why run a local model?
|
||||
|
||||
Five reasons, ordered by how often they matter in practice:
|
||||
|
||||
1. **Privacy and IP.** Code, data, and prompts never leave your machine. This is the deciding factor for proprietary work, IRB-constrained research, employer-restricted code, and anything covered by an NDA or government contract. *What you don't send, the service can't see.*
|
||||
2. **Cost.** No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use — long agentic sessions, batch processing — the savings add up quickly.
|
||||
3. **Offline operation.** Works on a plane, in a lab without internet, in a SCIF, on a remote field deployment. Cloud models simply don't.
|
||||
4. **Control and reproducibility.** You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. Useful for reproducible research and long-lived pipelines.
|
||||
5. **Learning.** Running a model yourself forces you to understand what it is, what it can do, and where it breaks. This is a real benefit for engineers and researchers who plan to work with these systems.
|
||||
|
||||
These are *also* the reasons people use cloud models for the opposite of each: convenience, no setup, always-current, no local hardware burden.
|
||||
|
||||
|
||||
## Hardware reality
|
||||
|
||||
Local models are constrained by your hardware in a way cloud models are not. The dominant factor is **memory** — specifically VRAM on a GPU or unified memory on Apple Silicon.
|
||||
|
||||
A rough sense of what runs comfortably where, as of early 2026:
|
||||
|
||||
| Hardware | Practical model size | Example models |
|
||||
|---|---|---|
|
||||
| 8 GB RAM/VRAM | 1–3 B parameter models, heavily quantized | Gemma 2 2B, Phi 3 Mini |
|
||||
| 16 GB | 7–8 B at moderate quantization | Llama 3.1 8B, Qwen 2.5 Coder 7B |
|
||||
| 24–32 GB (high-end laptop GPU or Apple Silicon) | 13–32 B at moderate quantization | Qwen 2.5 Coder 32B, Mistral Small |
|
||||
| 48–64 GB (Mac Studio, server GPU) | 70 B class at heavy quantization | Llama 3.3 70B, DeepSeek Coder V2 |
|
||||
| 128 GB+ workstation | 70 B at lighter quantization, or multiple models | larger Qwen, Mixtral variants |
|
||||
|
||||
**Quantization** (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. You trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization.
|
||||
|
||||
If you took the time to fill out the spec table in [computing-setup section 01](https://lem.che.udel.edu/git/furst/computing-setup/src/branch/main/01-know-your-machine/), you already know what tier you're in.
|
||||
|
||||
|
||||
## Local models across the three modes
|
||||
|
||||
The three-mode framing from [section 01](../01-three-modes/) still applies — what changes is the host.
|
||||
|
||||
### Local in *chat* mode
|
||||
|
||||
You can have a private, local ChatGPT-style experience entirely on your laptop.
|
||||
|
||||
| Tool | What it is |
|
||||
|---|---|
|
||||
| **Ollama** | A CLI + background service that downloads and runs models. Lowest friction; serves an OpenAI-compatible API on `localhost:11434`. |
|
||||
| **LM Studio** | A polished desktop app for downloading, running, and chatting with models. Good for those who want a GUI from the start. |
|
||||
| **Open WebUI** | A self-hosted web UI (like ChatGPT) that talks to Ollama or any OpenAI-compatible backend. Good if you want a familiar chat experience or want to share access on a LAN. |
|
||||
| **Jan**, **GPT4All** | Other desktop chat apps with similar goals. |
|
||||
|
||||
The Ollama-powered backends in particular are useful well beyond chat — most of the editor and agentic tools below can connect to an Ollama endpoint, which means setting up Ollama once unlocks every mode.
|
||||
|
||||
### Local in *editor* mode
|
||||
|
||||
Several VS Code extensions support local models. Notably, **GitHub Copilot, Microsoft Copilot, and the Claude extension do not** — they require their vendor's cloud service. If you want a local model in your editor, you need a different extension.
|
||||
|
||||
| Extension | Notes |
|
||||
|---|---|
|
||||
| **Continue.dev** | Open-source, the flagship local-friendly extension. Works with Ollama, LM Studio, llama.cpp, and many cloud providers. Supports autocomplete, inline edit, and a chat panel. The first tool to try. |
|
||||
| **Cody** (Sourcegraph) | Has a "local context" mode and can use local models via Ollama. Also has a strong cloud product. |
|
||||
| **Llama Coder** | Ollama-focused, autocomplete-first. Lightweight. |
|
||||
| **Tabby** | A self-hosted code completion server. Heavier setup but good for shared use within a team or lab. |
|
||||
|
||||
For Neovim users: `codecompanion.nvim`, `avante.nvim`, and `gen.nvim` all support local backends.
|
||||
|
||||
### Local in *agentic* mode
|
||||
|
||||
Agentic tools are where local-vs-cloud differences are most visible. Multi-step tasks make many model calls, so latency and capability gaps compound.
|
||||
|
||||
| Tool | Notes |
|
||||
|---|---|
|
||||
| **Aider** | Terminal-based pair programmer. Supports any OpenAI-compatible endpoint, including Ollama. Mature local support. |
|
||||
| **Cline** (VS Code extension) | Agentic VS Code extension with broad provider support including local via Ollama. |
|
||||
| **OpenHands** (formerly OpenDevin) | Open-source agentic platform. Works with local models with some setup. |
|
||||
|
||||
Notable exclusions (as of early 2026): **Claude Code, Cursor agent mode, and Microsoft Copilot agent do not support local models.** They are tied to their respective cloud providers.
|
||||
|
||||
|
||||
## The capability gap
|
||||
|
||||
Frontier cloud models (Claude Opus, GPT-4o, Gemini Pro) are still better than local models at almost every coding task. Pretending otherwise sets you up for disappointment. Some honest framing:
|
||||
|
||||
- **For autocomplete and short suggestions**, a good local 7–13 B model (Qwen 2.5 Coder, DeepSeek Coder Lite, Codestral) is genuinely useful and the gap to cloud is small.
|
||||
- **For one-shot Q&A and short refactors**, the gap is noticeable but acceptable. You may need a second try where a frontier model would have nailed it the first time.
|
||||
- **For long reasoning chains, multi-file work, or anything subtle**, the gap is large. Frontier cloud models still win clearly.
|
||||
- **For agentic loops**, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating; on a 32–70 B model, they're usable. On a frontier cloud model, they're effective.
|
||||
|
||||
The gap is narrowing every few months. The advice above will date faster than most of this guide.
|
||||
|
||||
There is also a **latency gap**. A frontier cloud model returns a response in a second or two; a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and "wait, think about something else, come back."
|
||||
|
||||
|
||||
## When local makes the most sense
|
||||
|
||||
The clearest cases:
|
||||
|
||||
- **Privacy- or IP-restricted code or data** that you genuinely cannot send to a third-party service. Employer policy, IRB constraints, NDAs, government contracts, classified work.
|
||||
- **Heavy regular use** where the per-token cost would be prohibitive. Long agentic sessions, batch summarization of large datasets, internal tooling that hundreds of people query.
|
||||
- **Reproducible research pipelines** where you need to pin a specific model version that won't change.
|
||||
- **Offline or air-gapped environments**.
|
||||
- **Learning and experimentation.** Running a model yourself is the most direct way to understand what it is and what it does.
|
||||
|
||||
The cases where cloud still wins:
|
||||
|
||||
- **You don't have the hardware.** Frontier cloud is cheaper than buying a workstation if you're not going to use it heavily.
|
||||
- **You're at the frontier of difficulty** — the hardest reasoning, the longest contexts, the newest capabilities. The cloud has more parameters than your laptop.
|
||||
- **You use AI occasionally and care more about ease than control.** Cloud is one click; local is one weekend.
|
||||
|
||||
|
||||
## A practical starting setup
|
||||
|
||||
If you want to try local models, the lowest-friction path is:
|
||||
|
||||
1. Install [Ollama](https://ollama.com/) (`brew install ollama` on macOS; one-liner installer on Linux; native installer on Windows).
|
||||
2. Pull a model sized for your hardware:
|
||||
- 8 GB RAM: `ollama pull gemma2:2b` or `ollama pull phi3.5`
|
||||
- 16 GB: `ollama pull llama3.1:8b` or `ollama pull qwen2.5-coder:7b`
|
||||
- 24–32 GB+: `ollama pull qwen2.5-coder:32b` or `ollama pull llama3.3:70b` (the 70 B will be tight)
|
||||
3. Try it in chat: `ollama run <model-name>` in the terminal, or point Open WebUI at it.
|
||||
4. Try it in your editor: install **Continue.dev**, configure it to use Ollama as the provider, point it at your model.
|
||||
5. Try it agentic: install **Aider** (`pip install aider-chat`), run `aider --model ollama/<model-name>` in a project directory.
|
||||
|
||||
For deeper setup — choosing models, understanding quantization, building applications around local models — go to the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop). That repo covers how local models work and how to build with them, including RAG, semantic search, and tool use.
|
||||
|
||||
|
||||
## Exercises
|
||||
|
||||
> **Exercise 1:** Estimate which model size your hardware can run, using the table in *Hardware reality* and the spec table you filled out in computing-setup. If you have a machine that can run 7–32 B class models, install Ollama and pull a coder model. Run it in chat and ask it to explain a function from a recent project.
|
||||
|
||||
> **Exercise 2:** Install **Continue.dev** in VS Code and configure it to use your Ollama model. Disable your cloud AI extension temporarily. Use the local model for a normal coding task for an hour. Note where the gap to cloud was noticeable and where it didn't matter.
|
||||
|
||||
> **Exercise 3:** Pick a real task that involves code or data you would rather not send to a third party (a research script, a personal project, employer code you have permission to work on locally but not externally). Complete it using only local tools. Reflect on the tradeoff.
|
||||
Loading…
Add table
Add a link
Reference in a new issue