coding-with-ai/07-local-models/README.md
Eric Furst 4194680475 Align prose with STYLE.md across modules 01-07 and top-level README
Replace residual em-dashes, arrow-notation shorthand, and a handful of
filler intensifiers; fix two small typos. Add .gitignore to keep the
working CHANGES.md audit out of the repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:47:19 -04:00

14 KiB
Raw Blame History

Using Local Models

Key idea

You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware with no API, no per-token cost, and no data leaving the machine. Local models cut across every workflow we've covered (web chat, autocomplete, in-project chat, and agentic) rather than being a separate mode. The same workflow patterns we've reviewed in this tutorial apply to local models. What changes is the tool that hosts the model and what you give up (and gain) by running it yourself.

This section is about local models as a user of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see llm-workshop.

Key goals

  • Understand why you might prefer a local model to a cloud model
  • Recognize which tools across the autocomplete/chat/agent spectrum support local models
  • Calibrate expectations about capability and latency relative to frontier cloud models
  • Identify the situations where local is the right choice and where cloud still wins

Why run a local model?

Here are six reasons to run a local model, ordered by how often they matter in practice:

  1. Privacy and IP. Code, data, and prompts never leave your machine. This is the deciding factor for proprietary work, IRB-constrained research, employer-restricted code, and anything covered by an NDA or government contract. What you don't send, the service can't see.
  2. Cost. No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use (long agentic sessions, batch processing) the savings add up quickly.
  3. Offline operation. Works on a plane, in a lab without internet, in a SCIF, on a remote field deployment. Cloud models simply don't.
  4. Control and reproducibility. You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. This is useful for reproducible research and long-lived pipelines.
  5. Learning. Running a model yourself forces you to understand what it is, what it can do, and where it breaks. This is a real benefit for engineers and researchers who plan to work with these systems.
  6. Future-proofing against vendor decisions. Pricing structures, rate limits, terms of service, available model lineups, and the surrounding tools (CLIs, IDE extensions, SDKs) all change on the vendor's schedule, not yours. A workflow built around a local model is insulated from price hikes, deprecated APIs, retired models, regional availability changes, and the slow drift of vendor lock-in. This matters most for work you expect to maintain for years.

These are also the reasons people use cloud models for the opposite of each: performance, convenience, no setup, always-current, no local hardware burden, and someone else worrying about keeping the model up to date.

Hardware reality

Local models are constrained by your hardware in a way cloud models are not. The dominant factor is memory. Specifically, VRAM on a GPU or unified memory on Apple Silicon.

A rough sense of what runs comfortably where (snapshot as of early 2026; the specific models below will date within a year, but the size tiers will not):

Hardware Practical model size Example models
8 GB RAM/VRAM 14 B parameter models, heavily quantized Gemma 4 e4b, Phi-4 mini, Qwen3 4B
16 GB 714 B at moderate quantization Qwen3 8B, Qwen3.5 9B, Gemma 3 12B
2432 GB (high-end laptop GPU or Apple Silicon) 1332 B at moderate quantization Qwen3.6 27B, Qwen3-Coder 30B, Mistral Small 3.2, Phi-4, Gemma 3 27B
4864 GB (Mac Studio, server GPU) 70 B class at heavy quantization, or smaller MoE Llama 3.3 70B, Qwen3.6 35B, Qwen3-Coder 30B (lighter quantization)
128 GB+ workstation 70 B at lighter quantization, MoE models, or multiple models in parallel Llama 4 Scout (16x17b), DeepSeek V4 Flash, GLM-5.1, Qwen3 235B

For an up-to-date view of what's available and how it ranks, treat the Ollama model library as the catalog and combine two kinds of signal:

  • Benchmark aggregators for the quantitative picture: Artificial Analysis (composite Intelligence Index), LMArena (human-preference Elo, including a Code Arena), and Vellum's open LLM leaderboard (deliberately excludes saturated benchmarks like MMLU).
  • Practitioner signal from r/LocalLLaMA, which is where quantization quality, VRAM-fit reports, tokens/sec on specific GPUs, and reactions to new model drops actually surface. Treat the boards as the baseline and the subreddit as the field report.

The original Hugging Face Open LLM Leaderboard was retired in 2025; its v1 and v2 result archives remain downloadable but no new models are scored against it. The table above will drift; those live sources track the frontier.

A note on licenses, since "open weights" is not the same as "open source." Models are released under widely varying terms. GLM-5.1 ships under a plain MIT license with no field-of-use restrictions, which is the cleanest of the current frontier-tier open releases. Meta's Llama licenses include a usage-scale cap. Several research releases restrict commercial use. If you plan to deploy a model in a classroom, lab pipeline, or research-group setting, check the license on the model's Hugging Face or Ollama page before you build on it.

Quantization (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. We trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization.

If you took the time to fill out the spec table in computing-setup section 01, you already know what tier you're in.

Local models across the workflow

The framing from section 01 still applies. What changes is the host. Below, we walk through where local models fit in each kind of work.

Local in web-chat style

You can have a private, local ChatGPT-style experience entirely on your laptop.

Tool What it is
Ollama A CLI + background service that downloads and runs models. Lowest friction; serves an OpenAI-compatible API on localhost:11434.
LM Studio A polished desktop app for downloading, running, and chatting with models. Good for those who want a GUI from the start.
Open WebUI A self-hosted web UI (like ChatGPT) that talks to Ollama or any OpenAI-compatible backend. Good if you want a familiar chat experience or want to share access on a LAN.
Jan, GPT4All Other desktop chat apps with similar goals.

The Ollama-powered backends in particular are useful well beyond chat — most of the in-editor and agentic tools below can connect to an Ollama endpoint, which means setting up Ollama once unlocks every other use case.

Local for autocomplete and in-project chat

Several VS Code extensions support local models for autocomplete and side-panel chat. Notably, GitHub Copilot, Microsoft Copilot, and the Claude (legacy) extension do not. They require their vendor's cloud service. If you want a local model in your editor, you need a different extension.

Extension Notes
Continue.dev Open-source, the flagship local-friendly extension. Works with Ollama, LM Studio, llama.cpp, and many cloud providers. Supports autocomplete and a chat panel. The first tool to try.
Cody (Sourcegraph) Has a "local context" mode and can use local models via Ollama. Also has a strong cloud product.
Llama Coder Ollama-focused, autocomplete-first. Lightweight.
Tabby A self-hosted code completion server. Heavier setup but good for shared use within a team or lab.

For Neovim users: codecompanion.nvim, avante.nvim, and gen.nvim all support local backends.

Local in agentic mode

Agentic tools are where local-vs-cloud differences are most visible. Multi-step tasks make many model calls, so latency and capability gaps compound.

Tool Notes
Aider Terminal-based pair programmer. Supports any OpenAI-compatible endpoint, including Ollama. Mature local support.
Cline (VS Code extension) Agentic VS Code extension with broad provider support including local via Ollama.
OpenHands (formerly OpenDevin) Open-source agentic platform. Works with local models with some setup.

Notable exclusions (as of early 2026): Claude Code, Cursor agent mode, and Microsoft Copilot agent do not support local models. They are tied to their respective cloud providers.

The capability gap

Frontier cloud models (Claude Opus, GPT-4o, Gemini Pro) are still better than local models at almost every coding task. Pretending otherwise sets you up for disappointment. Some honest framing:

  • For autocomplete and short suggestions, a good local 713 B model (Qwen3 8B, DeepSeek-Coder-V2 16B, Codestral) is genuinely useful and the gap with cloud models is small.
  • For one-shot Q&A and short refactors, the gap is noticeable but acceptable. You may need a second try where a frontier model would have nailed it the first time.
  • For long reasoning chains, multi-file work, or anything subtle, the gap is large. Frontier cloud models still win clearly.
  • For agentic loops, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating, and on a 3270 B model, they're usable. On a frontier cloud model, they can be fairly effective.

The gap is narrowing every few months, so this advice above will date faster than most of this guide!

There is also a latency gap. A frontier cloud model returns a response in a second or two, while a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and waiting or doing something else in the meantime.

When local makes the most sense

The clearest cases:

  • Privacy- or IP-restricted code or data that you genuinely cannot send to a third-party service. Employer policy, IRB constraints, NDAs, government contracts, classified work.
  • Heavy regular use where the per-token cost would be prohibitive. Long agentic sessions, batch summarization of large datasets, internal tooling that hundreds of people query.
  • Reproducible research pipelines where you need to pin a specific model version that won't change.
  • Offline or air-gapped environments.
  • Learning and experimentation. Running a model yourself is the most direct way to understand what it is and what it does.

The cases where cloud still wins:

  • You don't have the hardware. Frontier cloud is cheaper than buying a workstation if you're not going to use it heavily.
  • You're at the frontier of difficulty: the hardest reasoning, the longest contexts, the newest capabilities. A cloud model has (many) more parameters than your laptop or workstation.
  • You use AI occasionally and care more about ease than control. Cloud access is one (or two) clicks.

A practical starting setup

If you want to try local models, the lowest-friction path is:

  1. Install Ollama (brew install ollama on macOS; one-liner installer on Linux; native installer on Windows).
  2. Pull a model sized for your hardware (verify current names against the Ollama library; these tags drift):
    • 8 GB RAM: ollama pull gemma4:e4b or ollama pull phi4-mini
    • 16 GB: ollama pull qwen3:8b or ollama pull gemma3:12b
    • 2432 GB+: ollama pull qwen3:32b, ollama pull qwen3-coder:30b, or ollama pull llama3.3:70b (the 70 B will be tight)
  3. Try it in chat: ollama run <model-name> in the terminal, or point Open WebUI at it.
  4. Try it in your editor: install Continue.dev, configure it to use Ollama as the provider, point it at your model.
  5. Try it agentic: install Aider (pip install aider-chat), run aider --model ollama/<model-name> in a project directory.

For deeper setup — choosing models, understanding quantization, building applications around local models — go to the llm-workshop. That repo covers how local models work and how to build with them, including RAG, semantic search, and tool use.

Exercises

Exercise 1: Estimate which model size your hardware can run, using the table in Hardware reality and the spec table you filled out in computing-setup. If you have a machine that can run 732 B class models, install Ollama and pull a coder model. Run it in chat and ask it to explain a function from a recent project.

Exercise 2: Install Continue.dev in VS Code and configure it to use your Ollama model. Disable your cloud AI extension temporarily. Use the local model for a normal coding task for an hour. Note where the gap to cloud was noticeable and where it didn't matter.

Exercise 3: Pick a real task that involves code or data you would rather not send to a third party (a research script, a personal project, employer code you have permission to work on locally but not externally). Complete it using only local tools. Reflect on the tradeoff.