Align prose with STYLE.md across modules 01-07 and top-level README

Replace residual em-dashes, arrow-notation shorthand, and a handful of filler intensifiers; fix two small typos. Add .gitignore to keep the working CHANGES.md audit out of the repo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 08:47:19 -04:00 · 2026-05-29 08:47:19 -04:00 · 4194680475
commit 4194680475
parent d2ca02bd90
9 changed files with 102 additions and 82 deletions
--- a/07-local-models/README.md
+++ b/07-local-models/README.md
@ -2,9 +2,9 @@

 ## Key idea

-You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware: no API, no per-token cost, no data leaving the machine. Local models cut across every workflow we've covered — web chat, autocomplete, in-project chat, and agentic — rather than being a separate mode. The same workflow patterns apply; what changes is the tool that hosts the model and what you give up (and gain) by running it yourself.
+You do not have to use a frontier cloud model to use AI in your work. A "local" model runs entirely on your own hardware with no API, no per-token cost, and no data leaving the machine. Local models cut across every workflow we've covered (web chat, autocomplete, in-project chat, and agentic) rather than being a separate mode. The same workflow patterns we've reviewed in this tutorial apply to local models. What changes is the tool that hosts the model and what you give up (and gain) by running it yourself.

-This section is about local models as a *user* of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see the [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop).
+This section is about local models as a *user* of AI coding tools. If you want to understand how local models work under the hood, train your own, or build the infrastructure around them, see  [llm-workshop](https://lem.che.udel.edu/git/furst/llm-workshop).

 ## Key goals

@ -17,39 +17,49 @@ This section is about local models as a *user* of AI coding tools. If you want t

 ## Why run a local model?

-Five reasons, ordered by how often they matter in practice:
+Here are six reasons to run a local model, ordered by how often they matter in practice:

 1. **Privacy and IP.** Code, data, and prompts never leave your machine. This is the deciding factor for proprietary work, IRB-constrained research, employer-restricted code, and anything covered by an NDA or government contract. *What you don't send, the service can't see.*
-2. **Cost.** No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use — long agentic sessions, batch processing — the savings add up quickly.
+2. **Cost.** No per-token billing. After the hardware cost (which you may already have paid for), inference is effectively free. For heavy use (long agentic sessions, batch processing) the savings add up quickly.
 3. **Offline operation.** Works on a plane, in a lab without internet, in a SCIF, on a remote field deployment. Cloud models simply don't.
-4. **Control and reproducibility.** You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. Useful for reproducible research and long-lived pipelines.
+4. **Control and reproducibility.** You pin a specific model version. It doesn't get retired, deprecated, or silently updated under you. This is useful for reproducible research and long-lived pipelines.
 5. **Learning.** Running a model yourself forces you to understand what it is, what it can do, and where it breaks. This is a real benefit for engineers and researchers who plan to work with these systems.
+6. **Future-proofing against vendor decisions.** Pricing structures, rate limits, terms of service, available model lineups, and the surrounding tools (CLIs, IDE extensions, SDKs) all change on the vendor's schedule, not yours. A workflow built around a local model is insulated from price hikes, deprecated APIs, retired models, regional availability changes, and the slow drift of vendor lock-in. This matters most for work you expect to maintain for years.

-These are *also* the reasons people use cloud models for the opposite of each: convenience, no setup, always-current, no local hardware burden.
+These are *also* the reasons people use cloud models for the opposite of each: performance, convenience, no setup, always-current, no local hardware burden, and someone else worrying about keeping the model up to date.


 ## Hardware reality

-Local models are constrained by your hardware in a way cloud models are not. The dominant factor is **memory** — specifically VRAM on a GPU or unified memory on Apple Silicon.
+Local models are constrained by your hardware in a way cloud models are not. The dominant factor is **memory**. Specifically, VRAM on a GPU or unified memory on Apple Silicon.

-A rough sense of what runs comfortably where, as of early 2026:
+A rough sense of what runs comfortably where (snapshot as of early 2026; the specific models below will date within a year, but the size tiers will not):

 | Hardware | Practical model size | Example models |
 |---|---|---|
-| 8 GB RAM/VRAM | 1–3 B parameter models, heavily quantized | Gemma 2 2B, Phi 3 Mini |
-| 16 GB | 7–8 B at moderate quantization | Llama 3.1 8B, Qwen 2.5 Coder 7B |
-| 24–32 GB (high-end laptop GPU or Apple Silicon) | 13–32 B at moderate quantization | Qwen 2.5 Coder 32B, Mistral Small |
-| 48–64 GB (Mac Studio, server GPU) | 70 B class at heavy quantization | Llama 3.3 70B, DeepSeek Coder V2 |
-| 128 GB+ workstation | 70 B at lighter quantization, or multiple models | larger Qwen, Mixtral variants |
+| 8 GB RAM/VRAM | 1–4 B parameter models, heavily quantized | Gemma 4 `e4b`, Phi-4 mini, Qwen3 4B |
+| 16 GB | 7–14 B at moderate quantization | Qwen3 8B, Qwen3.5 9B, Gemma 3 12B |
+| 24–32 GB (high-end laptop GPU or Apple Silicon) | 13–32 B at moderate quantization | Qwen3.6 27B, Qwen3-Coder 30B, Mistral Small 3.2, Phi-4, Gemma 3 27B |
+| 48–64 GB (Mac Studio, server GPU) | 70 B class at heavy quantization, or smaller MoE | Llama 3.3 70B, Qwen3.6 35B, Qwen3-Coder 30B (lighter quantization) |
+| 128 GB+ workstation | 70 B at lighter quantization, MoE models, or multiple models in parallel | Llama 4 Scout (`16x17b`), DeepSeek V4 Flash, GLM-5.1, Qwen3 235B |

-**Quantization** (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. You trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization.
+For an up-to-date view of what's available and how it ranks, treat the [Ollama model library](https://ollama.com/library) as the catalog and combine two kinds of signal:
+
+- **Benchmark aggregators** for the quantitative picture: [Artificial Analysis](https://artificialanalysis.ai/) (composite Intelligence Index), [LMArena](https://lmarena.ai/) (human-preference Elo, including a Code Arena), and [Vellum's open LLM leaderboard](https://www.vellum.ai/llm-leaderboard) (deliberately excludes saturated benchmarks like MMLU).
+- **Practitioner signal** from [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/), which is where quantization quality, VRAM-fit reports, tokens/sec on specific GPUs, and reactions to new model drops actually surface. Treat the boards as the baseline and the subreddit as the field report.
+
+The original Hugging Face Open LLM Leaderboard was retired in 2025; its v1 and v2 result archives remain downloadable but no new models are scored against it. The table above will drift; those live sources track the frontier.
+
+A note on **licenses**, since "open weights" is not the same as "open source." Models are released under widely varying terms. GLM-5.1 ships under a plain MIT license with no field-of-use restrictions, which is the cleanest of the current frontier-tier open releases. Meta's Llama licenses include a usage-scale cap. Several research releases restrict commercial use. If you plan to deploy a model in a classroom, lab pipeline, or research-group setting, check the license on the model's Hugging Face or Ollama page before you build on it.
+
+**Quantization** (compressing model weights from 16-bit floats down to 4-bit or 5-bit integers) is what makes large models fit on consumer hardware. We trade a small amount of quality for a large amount of memory savings. Most local-model tools default to a sensible quantization.

 If you took the time to fill out the spec table in [computing-setup section 01](https://lem.che.udel.edu/git/furst/computing-setup/src/branch/main/01-know-your-machine/), you already know what tier you're in.


 ## Local models across the workflow

-The framing from [section 01](../01-two-worlds/) still applies — what changes is the host. Below, we walk through where local models fit in each kind of work.
+The framing from [section 01](../01-two-worlds/) still applies. What changes is the host. Below, we walk through where local models fit in each kind of work.

 ### Local in *web-chat* style

@ -66,7 +76,7 @@ The Ollama-powered backends in particular are useful well beyond chat — most o

 ### Local for autocomplete and in-project chat

-Several VS Code extensions support local models for autocomplete and side-panel chat. Notably, **GitHub Copilot, Microsoft Copilot, and the Claude (legacy) extension do not** — they require their vendor's cloud service. If you want a local model in your editor, you need a different extension.
+Several VS Code extensions support local models for autocomplete and side-panel chat. Notably, **GitHub Copilot, Microsoft Copilot, and the Claude (legacy) extension do not**. They require their vendor's cloud service. If you want a local model in your editor, you need a different extension.

 | Extension | Notes |
 |---|---|
@ -94,14 +104,14 @@ Notable exclusions (as of early 2026): **Claude Code, Cursor agent mode, and Mic

 Frontier cloud models (Claude Opus, GPT-4o, Gemini Pro) are still better than local models at almost every coding task. Pretending otherwise sets you up for disappointment. Some honest framing:

- **For autocomplete and short suggestions**, a good local 7–13 B model (Qwen 2.5 Coder, DeepSeek Coder Lite, Codestral) is genuinely useful and the gap to cloud is small.
+- **For autocomplete and short suggestions**, a good local 7–13 B model (Qwen3 8B, DeepSeek-Coder-V2 16B, Codestral) is genuinely useful and the gap with cloud models is small.
 - **For one-shot Q&A and short refactors**, the gap is noticeable but acceptable. You may need a second try where a frontier model would have nailed it the first time.
 - **For long reasoning chains, multi-file work, or anything subtle**, the gap is large. Frontier cloud models still win clearly.
- **For agentic loops**, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating; on a 32–70 B model, they're usable. On a frontier cloud model, they're effective.
+- **For agentic loops**, the gap compounds: each step has slightly worse output, errors propagate, and you spend more time supervising. Local agents on a 7 B model are frustrating, and on a 32–70 B model, they're usable. On a frontier cloud model, they can be fairly effective.

-The gap is narrowing every few months. The advice above will date faster than most of this guide.
+The gap is narrowing every few months, so this advice above will date faster than most of this guide!

-There is also a **latency gap**. A frontier cloud model returns a response in a second or two; a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and "wait, think about something else, come back."
+There is also a **latency gap**. A frontier cloud model returns a response in a second or two, while a 70 B local model on a typical workstation might take fifteen to thirty seconds for the same prompt. For autocomplete this is the difference between "helpful" and "in the way." For longer answers it's the difference between "fluid" and waiting or doing something else in the meantime.


 ## When local makes the most sense
@ -117,8 +127,8 @@ The clearest cases:
 The cases where cloud still wins:

 - **You don't have the hardware.** Frontier cloud is cheaper than buying a workstation if you're not going to use it heavily.
- **You're at the frontier of difficulty** — the hardest reasoning, the longest contexts, the newest capabilities. The cloud has more parameters than your laptop.
- **You use AI occasionally and care more about ease than control.** Cloud is one click; local is one weekend.
+- **You're at the frontier of difficulty:** the hardest reasoning, the longest contexts, the newest capabilities. A cloud model has (many) more parameters than your laptop or workstation. 
+- **You use AI occasionally and care more about ease than control.** Cloud access is one (or two) clicks.


 ## A practical starting setup
@ -126,10 +136,10 @@ The cases where cloud still wins:
 If you want to try local models, the lowest-friction path is:

 1. Install [Ollama](https://ollama.com/) (`brew install ollama` on macOS; one-liner installer on Linux; native installer on Windows).
-2. Pull a model sized for your hardware:
-   - 8 GB RAM: `ollama pull gemma2:2b` or `ollama pull phi3.5`
-   - 16 GB: `ollama pull llama3.1:8b` or `ollama pull qwen2.5-coder:7b`
-   - 24–32 GB+: `ollama pull qwen2.5-coder:32b` or `ollama pull llama3.3:70b` (the 70 B will be tight)
+2. Pull a model sized for your hardware (verify current names against the [Ollama library](https://ollama.com/library); these tags drift):
+   - 8 GB RAM: `ollama pull gemma4:e4b` or `ollama pull phi4-mini`
+   - 16 GB: `ollama pull qwen3:8b` or `ollama pull gemma3:12b`
+   - 24–32 GB+: `ollama pull qwen3:32b`, `ollama pull qwen3-coder:30b`, or `ollama pull llama3.3:70b` (the 70 B will be tight)
 3. Try it in chat: `ollama run <model-name>` in the terminal, or point Open WebUI at it.
 4. Try it in your editor: install **Continue.dev**, configure it to use Ollama as the provider, point it at your model.
 5. Try it agentic: install **Aider** (`pip install aider-chat`), run `aider --model ollama/<model-name>` in a project directory.