Restructures section 01 from "web chat / in-editor / agentic" into "web chat vs. tools that live with your code," with the autocomplete / in-project chat / agentic spectrum as a sub-structure of the latter. Inline edits are reduced to a historical note tied to the 2023 instruction-tuned LLM era. - Rename 01-three-modes -> 01-two-worlds and 03-in-editor-workflow -> 03-autocomplete; section 03 narrows to autocomplete (ghost text habits, the autocomplete-your-verification trap) - Section 04 reframes in-project chat as the default venue, web chat as a special-case venue; adds "Carrying context across sessions" covering dev-log.md, CLAUDE.md, .cursorrules - Section 05 reworks intro to contrast against in-project chat instead of "editor extension"; tightens prose and removes em-dashes - Update cross-references and tool-mode language in 02, 06, 07, and the root README to match the new framing - Swap the CRDT example in section 04 for finite-volume methods, fitting the CHEG audience - Minor typo/wording fixes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| README.md | ||
Using Local Models
Key idea
You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware: no API, no per-token cost, no data leaving the machine. Local models cut across every workflow we've covered — web chat, autocomplete, in-project chat, and agentic — rather than being a separate mode. The same workflow patterns apply; what changes is the tool that hosts the model and what you give up (and gain) by running it yourself.
This section is about local models as a user of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see the llm-workshop.
Key goals
- Understand why you might prefer a local model to a cloud model
- Recognize which tools across the autocomplete/chat/agent spectrum support local models
- Calibrate expectations about capability and latency relative to frontier cloud models
- Identify the situations where local is the right choice and where cloud still wins
Why run a local model?
Five reasons, ordered by how often they matter in practice:
- Privacy and IP. Code, data, and prompts never leave your machine. This is the deciding factor for proprietary work, IRB-constrained research, employer-restricted code, and anything covered by an NDA or government contract. What you don't send, the service can't see.
- Cost. No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use — long agentic sessions, batch processing — the savings add up quickly.
- Offline operation. Works on a plane, in a lab without internet, in a SCIF, on a remote field deployment. Cloud models simply don't.
- Control and reproducibility. You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. Useful for reproducible research and long-lived pipelines.
- Learning. Running a model yourself forces you to understand what it is, what it can do, and where it breaks. This is a real benefit for engineers and researchers who plan to work with these systems.
These are also the reasons people use cloud models for the opposite of each: convenience, no setup, always-current, no local hardware burden.
Hardware reality
Local models are constrained by your hardware in a way cloud models are not. The dominant factor is memory — specifically VRAM on a GPU or unified memory on Apple Silicon.
A rough sense of what runs comfortably where, as of early 2026:
| Hardware | Practical model size | Example models |
|---|---|---|
| 8 GB RAM/VRAM | 1–3 B parameter models, heavily quantized | Gemma 2 2B, Phi 3 Mini |
| 16 GB | 7–8 B at moderate quantization | Llama 3.1 8B, Qwen 2.5 Coder 7B |
| 24–32 GB (high-end laptop GPU or Apple Silicon) | 13–32 B at moderate quantization | Qwen 2.5 Coder 32B, Mistral Small |
| 48–64 GB (Mac Studio, server GPU) | 70 B class at heavy quantization | Llama 3.3 70B, DeepSeek Coder V2 |
| 128 GB+ workstation | 70 B at lighter quantization, or multiple models | larger Qwen, Mixtral variants |
Quantization (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. You trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization.
If you took the time to fill out the spec table in computing-setup section 01, you already know what tier you're in.
Local models across the workflow
The framing from section 01 still applies — what changes is the host. Below, we walk through where local models fit in each kind of work.
Local in web-chat style
You can have a private, local ChatGPT-style experience entirely on your laptop.
| Tool | What it is |
|---|---|
| Ollama | A CLI + background service that downloads and runs models. Lowest friction; serves an OpenAI-compatible API on localhost:11434. |
| LM Studio | A polished desktop app for downloading, running, and chatting with models. Good for those who want a GUI from the start. |
| Open WebUI | A self-hosted web UI (like ChatGPT) that talks to Ollama or any OpenAI-compatible backend. Good if you want a familiar chat experience or want to share access on a LAN. |
| Jan, GPT4All | Other desktop chat apps with similar goals. |
The Ollama-powered backends in particular are useful well beyond chat — most of the in-editor and agentic tools below can connect to an Ollama endpoint, which means setting up Ollama once unlocks every other use case.
Local for autocomplete and in-project chat
Several VS Code extensions support local models for autocomplete and side-panel chat. Notably, GitHub Copilot, Microsoft Copilot, and the Claude (legacy) extension do not — they require their vendor's cloud service. If you want a local model in your editor, you need a different extension.
| Extension | Notes |
|---|---|
| Continue.dev | Open-source, the flagship local-friendly extension. Works with Ollama, LM Studio, llama.cpp, and many cloud providers. Supports autocomplete and a chat panel. The first tool to try. |
| Cody (Sourcegraph) | Has a "local context" mode and can use local models via Ollama. Also has a strong cloud product. |
| Llama Coder | Ollama-focused, autocomplete-first. Lightweight. |
| Tabby | A self-hosted code completion server. Heavier setup but good for shared use within a team or lab. |
For Neovim users: codecompanion.nvim, avante.nvim, and gen.nvim all support local backends.
Local in agentic mode
Agentic tools are where local-vs-cloud differences are most visible. Multi-step tasks make many model calls, so latency and capability gaps compound.
| Tool | Notes |
|---|---|
| Aider | Terminal-based pair programmer. Supports any OpenAI-compatible endpoint, including Ollama. Mature local support. |
| Cline (VS Code extension) | Agentic VS Code extension with broad provider support including local via Ollama. |
| OpenHands (formerly OpenDevin) | Open-source agentic platform. Works with local models with some setup. |
Notable exclusions (as of early 2026): Claude Code, Cursor agent mode, and Microsoft Copilot agent do not support local models. They are tied to their respective cloud providers.
The capability gap
Frontier cloud models (Claude Opus, GPT-4o, Gemini Pro) are still better than local models at almost every coding task. Pretending otherwise sets you up for disappointment. Some honest framing:
- For autocomplete and short suggestions, a good local 7–13 B model (Qwen 2.5 Coder, DeepSeek Coder Lite, Codestral) is genuinely useful and the gap to cloud is small.
- For one-shot Q&A and short refactors, the gap is noticeable but acceptable. You may need a second try where a frontier model would have nailed it the first time.
- For long reasoning chains, multi-file work, or anything subtle, the gap is large. Frontier cloud models still win clearly.
- For agentic loops, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating; on a 32–70 B model, they're usable. On a frontier cloud model, they're effective.
The gap is narrowing every few months. The advice above will date faster than most of this guide.
There is also a latency gap. A frontier cloud model returns a response in a second or two; a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and "wait, think about something else, come back."
When local makes the most sense
The clearest cases:
- Privacy- or IP-restricted code or data that you genuinely cannot send to a third-party service. Employer policy, IRB constraints, NDAs, government contracts, classified work.
- Heavy regular use where the per-token cost would be prohibitive. Long agentic sessions, batch summarization of large datasets, internal tooling that hundreds of people query.
- Reproducible research pipelines where you need to pin a specific model version that won't change.
- Offline or air-gapped environments.
- Learning and experimentation. Running a model yourself is the most direct way to understand what it is and what it does.
The cases where cloud still wins:
- You don't have the hardware. Frontier cloud is cheaper than buying a workstation if you're not going to use it heavily.
- You're at the frontier of difficulty — the hardest reasoning, the longest contexts, the newest capabilities. The cloud has more parameters than your laptop.
- You use AI occasionally and care more about ease than control. Cloud is one click; local is one weekend.
A practical starting setup
If you want to try local models, the lowest-friction path is:
- Install Ollama (
brew install ollamaon macOS; one-liner installer on Linux; native installer on Windows). - Pull a model sized for your hardware:
- 8 GB RAM:
ollama pull gemma2:2borollama pull phi3.5 - 16 GB:
ollama pull llama3.1:8borollama pull qwen2.5-coder:7b - 24–32 GB+:
ollama pull qwen2.5-coder:32borollama pull llama3.3:70b(the 70 B will be tight)
- 8 GB RAM:
- Try it in chat:
ollama run <model-name>in the terminal, or point Open WebUI at it. - Try it in your editor: install Continue.dev, configure it to use Ollama as the provider, point it at your model.
- Try it agentic: install Aider (
pip install aider-chat), runaider --model ollama/<model-name>in a project directory.
For deeper setup — choosing models, understanding quantization, building applications around local models — go to the llm-workshop. That repo covers how local models work and how to build with them, including RAG, semantic search, and tool use.
Exercises
Exercise 1: Estimate which model size your hardware can run, using the table in Hardware reality and the spec table you filled out in computing-setup. If you have a machine that can run 7–32 B class models, install Ollama and pull a coder model. Run it in chat and ask it to explain a function from a recent project.
Exercise 2: Install Continue.dev in VS Code and configure it to use your Ollama model. Disable your cloud AI extension temporarily. Use the local model for a normal coding task for an hour. Note where the gap to cloud was noticeable and where it didn't matter.
Exercise 3: Pick a real task that involves code or data you would rather not send to a third party (a research script, a personal project, employer code you have permission to work on locally but not externally). Complete it using only local tools. Reflect on the tradeoff.