Update module docs: fix arXiv URL, uv setup, nanoGPT clone path

- Use HTTPS for arXiv API (was returning 301 on HTTP)
- Point module 01 preliminaries to root uv sync instead of separate venv
- Clone nanoGPT into 01-nanogpt/ and add to .gitignore
- Add llama3.1:8B to module 02 models table
- Various editorial updates to modules 01 and 02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Eric 2026-04-01 22:25:42 -04:00
commit e10e411e41
3 changed files with 41 additions and 34 deletions

View file

@ -17,9 +17,9 @@ Learn how to run LLMs locally without a cloud-based API.
---
Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this handout, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.)
Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this section, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.)
`Ollama` is written in Go and `llama.cpp` is a C++ library for running LLMs. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming.
`Ollama` is written in the language *Go* and `llama.cpp` is a C++ library. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming.
Both tools support a **GGUF** model format. This is a format suitable for running models efficiently on CPUs and lower-end GPUs. GGUF is a versioned binary specification that embeds the:
@ -93,7 +93,7 @@ Other sources of models include Huggingface:
There are so many models! The LLM ecosystem is growing rapidly, with many use-cases steering models toward different specialized tasks.
There are a few ways to download a model from different registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it.
There are a few ways to download a model from the registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it.
### Launch ollama from the command line
@ -116,7 +116,7 @@ There you go! The model will interact with you just like the chatbots we use in
> 2. `llama3.2:latest`
> 3. `gemma3:1b`
>
> Experiment with other models.
> Note the size of the model, the quantization used, the context length, and other parameters. Experiment with other models.
Here's an interaction with the gemma3 model:
@ -268,7 +268,7 @@ One advantage of running models locally is that your data never leaves your mach
You can incorporate `ollama` directly into your command line by passing a prompt as an argument:
```bash
ollama run llama3.2 "Summarize this file: $(cat README.md)"
ollama run llama3.1:8B "Summarize this file: $(cat README.md)"
```
The `$(cat ...)` substitution injects the file contents into the prompt. Now you can incorporate LLMs into shell scripts!
@ -280,13 +280,13 @@ The `data/` directory contains 10 emails from the University of Delaware preside
Summarize a single email:
```bash
ollama run llama3.2 "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)"
ollama run llama3.1:8B "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)"
```
Summarize several at once:
```bash
cat data/*.txt | ollama run llama3.2 "Summarize the following collection of emails. What are the major themes?"
cat data/*.txt | ollama run llama3.1:8B "Summarize the following collection of emails. What are the major themes?"
```
You can also save the output to a file:
@ -296,25 +296,29 @@ cat data/*.txt | ollama run command-r7b:latest \
"Summarize these emails:" > summary.txt
```
> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.2` and `command-r7b`). How do the summaries differ in length, style, and accuracy?
> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.1:8B` and `command-r7b`). How do the summaries differ in length, style, and accuracy?
### Summarizing arXiv abstracts
We can pull abstracts directly from arXiv using `curl`. The following command fetches the 20 most recent abstracts in Computation and Language (cs.CL):
```bash
curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml
```
Take a look at the XML with `less arxiv_cl.xml`. Now ask a model to summarize it:
```bash
ollama run llama3.2 "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)"
ollama run llama3.1:8B "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)"
```
> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you?
> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you? How large are the files compared to the model context length?
> **Exercise 7:** Experiment with running local models on your own documents or data.
> **Exercise 7** Run the same model multiple times on a single file. Note the differences in the model output each time.
> **Exercise 8:** Experiment with running local models on your own documents or data.
> **Exercise 9:** Use a local model to write a prompt for a cloud based service.
### Code generation
@ -342,11 +346,11 @@ Other coding models to try: `codellama:7b`, `deepseek-coder-v2:latest`, `starcod
**A word of caution.** When I tried the van der Waals prompt above, the model returned a confident response with correct-looking LaTeX, a well-structured Python function, and code that ran without errors. But the derivation was wrong. The rearrangement of the van der Waals equation didn't follow from the original, and the code implemented the wrong math. The function converged to *an* answer, but not a correct one.
**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it.
**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. (Ask yourself why as you consider the underlying transformer architecture we studied in module 01.) Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it.
> **Exercise 8:** Compare the output of a general-purpose model (`llama3.2`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output?
> **Exercise 10:** Compare the output of a general-purpose model (`llama3.1:8B`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output?
> **Exercise 9:** Ask a coding model to solve a problem where you already know the answer — a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps.
> **Exercise 11:** Ask a coding model to solve a problem where you already know the answer, such as a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps.
### Customize ollama
@ -393,13 +397,13 @@ Set system message.
## 5. Concluding remarks
Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent and it does a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses.
Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent, and they do a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses.
With local models (as well as cloud-based APIs), we can build new tools that make use of natural language processing. With `ollama` acting as a local server, the model can be run with Python, giving us the ability to implement its features in our own programs. For one Python library, see:
- https://github.com/ollama/ollama-python
In class, I demonstrated a simple thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy. This code takes a query from the user, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above).
Recently, we used this strategy to build a thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy (the next topic in this series). The assistant code takes a query from the user in natural language, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above).
## Additional resources and references
@ -431,6 +435,7 @@ Model registry:
| Model | Size | Type | Used for |
|-------|------|------|----------|
| `llama3:latest` | 4.7 GB | General purpose | Chat, comparison |
| `llama3.1:8B` | 4.9 GB | General purpose | arXiv summarization, comparison |
| `llama3.2:latest` | 2.0 GB | General purpose | Chat, summarization, comparison |
| `gemma3:1b` | 815 MB | General purpose | Chat, comparison |
| `command-r7b:latest` | 4.7 GB | RAG-optimized | Document summarization |