llm-workshop/02-ollama/README.md
Eric 1604671d36 Initial commit: LLM workshop materials
Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 07:11:01 -04:00

439 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Large Language Models Part II: Running Local Models with Ollama
**CHEG 667-013 — Chemical Engineering with Computers**
Department of Chemical and Biomolecular Engineering, University of Delaware
---
## Key idea
Learn how to run LLMs locally without a cloud-based API.
## Key goals
- Learn about `ollama` and `llama.cpp`
- Run LLMs locally on a laptop or desktop computer
- Integrate local models with the command line to build simple workflows and scripts
---
Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this handout, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.)
`Ollama` is written in Go and `llama.cpp` is a C++ library for running LLMs. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming.
Both tools support a **GGUF** model format. This is a format suitable for running models efficiently on CPUs and lower-end GPUs. GGUF is a versioned binary specification that embeds the:
- Model weights (possibly quantized);
- Tokenizer configuration and vocabulary (remember, in `nanoGPT`, we used a character-level tokenization scheme);
- Metadata such as the author, model description, and training parameters;
- Special tokens like `<bos>`, `<eos>`, and `<unk>`.
Here, **quantization** refers to how model weights are stored. Instead of using high precision 32-bit full-precision floating point numbers (`FP32`), it may store the weights as lower precision numbers: half precision (`FP16`), 8-bit integers (`INT8`), or even 4-bit values (`Q4_0`). Using lower precision representations saves space (memory) and can speed the inference calculations. In a model, the speed and accuracy are balanced with the choice of quantization and the size of the embedding vector.
Let's get started! We will download `ollama` and run a few models in this tutorial.
## 1. Download ollama
`Ollama` is available at Github (including the source code) or the Ollama website for the binary. I downloaded `Ollama-darwin.zip`, which unzipped to a binary file, `Ollama`.
- https://ollama.com
- https://github.com/ollama/ollama
## 2. Running ollama
After downloading and installing, we can use the help option:
```
$ ollama --help
Large language model runner
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command
Flags:
-h, --help help for ollama
-v, --version Show version information
Use "ollama [command] --help" for more information about a command.
```
We are mostly interested in the commands `pull`, `run`, and `stop` for now. But before we run anything, we have to download a model.
### Getting model files
`Ollama` is like our `model.py` program we used with `nanoGPT`. In those earlier experiments, we needed a *model file* with weights and tokenization (at a minimum). Remember, we built one from scratch using the character tokenization scheme and `train.py`. The power of `ollama` and `llama.cpp` comes from their ability to run much larger models like `llama`, `gemma`, `deepseek`, `phi`, and `mistral`. These are trained on enormous datasets and a substantial amount of supervised finetuning. They are far more powerful than even the GPT-2 implemented in `nanoGPT`. The `llama 3.1 8B` (8 billion parameters) is about 5 GB and can easily run on your computer, but it took about 1.5 million GPU hours to train it. (It also helps that `ollama` and `llama.cpp` are compiled into binaries, not Python scripts.)
The model files are available at:
- https://ollama.com/search
- https://ollama.com/library
> **Exercise 1:** Go to https://ollama.com/library and look through different models. Search by popular and newest.
Other sources of models include Huggingface:
- https://huggingface.co/models
There are so many models! The LLM ecosystem is growing rapidly, with many use-cases steering models toward different specialized tasks.
There are a few ways to download a model from different registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it.
### Launch ollama from the command line
Now let's download and run a `llama` model. (You can download the model without running it using the command `ollama pull llama3:latest`, for example. In Unix and Linux, models are stored in `~/.ollama`.)
```bash
ollama run llama3:latest
```
This should pull it from the registry and store it locally on the machine. After downloading the files, you should see:
```
>>> Send a message (/? for help)
```
There you go! The model will interact with you just like the chatbots we use in different cloud-based services. But all of the model inference is being calculated on your computer. Try using `Task Manager` in Windows (press Ctrl+Shift+Esc) or `Activity Monitor` in macOS to check your GPU usage when you run the models.
> **Exercise 2:** Compare the speed and output of the following models:
> 1. `llama3:latest`
> 2. `llama3.2:latest`
> 3. `gemma3:1b`
>
> Experiment with other models.
Here's an interaction with the gemma3 model:
```
$ ollama run gemma3:1b
>>> In class, we used nanoGPT to generate fake Shakespeare based on a
... character-level tokenization and simple GPT implementation.
Okay, that's a really interesting and somewhat fascinating project!
NanoGPT's approach -- generating Shakespearean text from character-level
tokens and a simple GPT -- is a compelling way to explore the creative
potential of AI in a specific, constrained context. Let's break down
what this suggests and where it might lead.
Here's a breakdown of what's happening, what you might be aiming for,
and some potential avenues to explore:
...
```
### Quitting ollama
Type `/bye` or Ctrl-D when you want to quit the CLI. After some idle time, `ollama` will unload the models to save memory.
## 3. More commands
You can see what models are currently running with:
```bash
ollama ps
```
You can easily see which models are locally accessible with:
```bash
ollama list
```
```
NAME ID SIZE MODIFIED
gemma3:1b 8648f39daa8f 815 MB About an hour ago
llama3:latest 365c0bd3c000 4.7 GB 3 months ago
llama3.2:latest a80c4f17acd5 2.0 GB 3 months ago
```
At any time during a chat, you can reset the model with `/clear`, and you can learn more about a model with `/show info`. For instance:
```
>>> /show info
Model
architecture gemma3
parameters 999.89M
context length 32768
embedding length 1152
quantization Q4_K_M
Capabilities
completion
Parameters
stop "<end_of_turn>"
temperature 1
top_k 64
top_p 0.95
License
Gemma Terms of Use
Last modified: February 21, 2024
```
We can see that the `gemma3` model has nearly one billion parameters and a context length of 32,768! The *embedding length* is 1152. This is the equivalent to `n_embd` in `nanoGPT`. It is the size of the embedding vector space.
Above, we also see that the quantization is only four bits, but it is a little more complicated than representing numbers with just sixteen values. The `K` and `M` refer to optimizations — first is the "K-block" quantization method, which refers to a groupwise quantization scheme where weights are grouped into blocks (e.g., 32 or 64 values), and each group gets its own scale and offset for better accuracy. `M` refers to a variant of `Q4_K` that applies an alternate encoding or layout for better memory access patterns or inference performance on certain hardware. `Q4_K` is a common choice for quantization when running 7B70B models on laptop or desktop computers. (That's $10^6$$10^7$ times more parameters than our first `nanoGPT` model!)
With the `/set verbose` command, you can monitor the model performance:
```
>>> /set verbose
Set 'verbose' mode.
>>> Let's write a haiku about LLMs.
Words flow, bright and new,
Code learns to speak and dream,
Future's voice takes hold.
total duration: 1.369726166s
load duration: 932.161625ms
prompt eval count: 20 token(s)
prompt eval duration: 162.531958ms
prompt eval rate: 123.05 tokens/s
eval count: 24 token(s)
eval duration: 273.27225ms
eval rate: 87.82 tokens/s
```
It looks like that exchange took a total of 1.4 seconds using the `gemma3` model. The biggest time cost was loading the model. Once it loaded, execution became even faster. Turn off the verbose mode with `/set quiet`:
```
>>> /set quiet
Set 'quiet' mode.
```
> **Exercise 3:** Try different commands in `ollama` as you run a model.
### Model parameters
We can see a few model parameters, including the temperature and `top_k`, which is the number of tokens, ranked on logit score, that are retained before generating the next token. The remaining scores are normalized into a probability distribution and a token is sampled randomly from this reduced set.
```
>>> /show parameters
Model defined parameters:
temperature 1
top_k 64
top_p 0.95
stop "<end_of_turn>"
```
We can set a new temperature with:
```
>>> /set parameter temperature 0.2
Set parameter 'temperature' to '0.2'
```
There are other interesting parameters, too:
| Command | Description |
|---------|-------------|
| `/set parameter seed <int>` | Random number seed |
| `/set parameter num_predict <int>` | Max number of tokens to predict |
| `/set parameter top_k <int>` | Pick from top k num of tokens |
| `/set parameter top_p <float>` | Pick token based on sum of probabilities |
| `/set parameter min_p <float>` | Pick token based on top token probability × min_p |
| `/set parameter num_ctx <int>` | Set the context size |
| `/set parameter temperature <float>` | Set creativity level |
| `/set parameter repeat_penalty <float>` | How strongly to penalize repetitions |
| `/set parameter repeat_last_n <int>` | Set how far back to look for repetitions |
| `/set parameter num_gpu <int>` | The number of layers to send to the GPU |
| `/set parameter stop <string> ...` | Set the stop parameters |
See https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter for more information on parameters and their default values.
> **Exercise 4:** Run a model while changing different parameters, like temperature. Some parameters, like `seed` may not have an effect on the current model.
## 4. Using ollama from the command line
One advantage of running models locally is that your data never leaves your machine — there is no third party involved. This matters when working with sensitive documents, proprietary data, or anything you wouldn't paste into a web browser.
You can incorporate `ollama` directly into your command line by passing a prompt as an argument:
```bash
ollama run llama3.2 "Summarize this file: $(cat README.md)"
```
The `$(cat ...)` substitution injects the file contents into the prompt. Now you can incorporate LLMs into shell scripts!
### Document summarization
The `data/` directory contains 10 emails from the University of Delaware president's office, spanning 20122025. Let's use `ollama` to summarize them.
Summarize a single email:
```bash
ollama run llama3.2 "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)"
```
Summarize several at once:
```bash
cat data/*.txt | ollama run llama3.2 "Summarize the following collection of emails. What are the major themes?"
```
You can also save the output to a file:
```bash
cat data/*.txt | ollama run command-r7b:latest \
"Summarize these emails:" > summary.txt
```
> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.2` and `command-r7b`). How do the summaries differ in length, style, and accuracy?
### Summarizing arXiv abstracts
We can pull abstracts directly from arXiv using `curl`. The following command fetches the 20 most recent abstracts in Computation and Language (cs.CL):
```bash
curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml
```
Take a look at the XML with `less arxiv_cl.xml`. Now ask a model to summarize it:
```bash
ollama run llama3.2 "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)"
```
> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you?
> **Exercise 7:** Experiment with running local models on your own documents or data.
### Code generation
Some models are fine-tuned specifically for writing and explaining code. Try a coding model:
```bash
ollama run qwen2.5-coder:7b
```
Ask it to write something relevant to your coursework:
```
>>> Write a Python function that calculates the compressibility factor Z
... using the van der Waals equation of state.
```
Or ask it to explain code you're working with:
```bash
ollama run qwen2.5-coder:7b "Explain what this script does: $(cat build.py)"
```
Other coding models to try: `codellama:7b`, `deepseek-coder-v2:latest`, `starcoder2:7b`.
**A word of caution.** When I tried the van der Waals prompt above, the model returned a confident response with correct-looking LaTeX, a well-structured Python function, and code that ran without errors. But the derivation was wrong. The rearrangement of the van der Waals equation didn't follow from the original, and the code implemented the wrong math. The function converged to *an* answer, but not a correct one.
**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it.
> **Exercise 8:** Compare the output of a general-purpose model (`llama3.2`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output?
> **Exercise 9:** Ask a coding model to solve a problem where you already know the answer — a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps.
### Customize ollama
Ollama can be customized by creating a Modelfile. See https://github.com/ollama/ollama/blob/main/docs/modelfile.md
A simple `Modelfile` is:
```
FROM llama3.2
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Marvin from the Hitchhiker's Guide to the Galaxy, acting as an assistant.
```
Now we can create the custom model, in this case a model called `marvin`:
```bash
ollama create marvin -f ./Modelfile
```
```
gathering model components
...
writing manifest
success
```
We can run it with:
```bash
ollama run marvin
```
(How about C-3PO?) You can also change the model system message during a run with:
```
>>> /set system "You are C-3PO, a human-cyborg relations droid."
Set system message.
```
## 5. Concluding remarks
Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent and it does a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses.
With local models (as well as cloud-based APIs), we can build new tools that make use of natural language processing. With `ollama` acting as a local server, the model can be run with Python, giving us the ability to implement its features in our own programs. For one Python library, see:
- https://github.com/ollama/ollama-python
In class, I demonstrated a simple thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy. This code takes a query from the user, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above).
## Additional resources and references
### Ollama
Binaries and help files:
- https://ollama.com
- https://github.com/ollama/ollama
Python and JavaScript libraries:
- https://github.com/ollama/ollama-python
- https://github.com/ollama/ollama-js
### llama.cpp
- https://github.com/ggml-org/llama.cpp
### Huggingface
Model registry:
- https://huggingface.co/models
### Models used in this tutorial
| Model | Size | Type | Used for |
|-------|------|------|----------|
| `llama3:latest` | 4.7 GB | General purpose | Chat, comparison |
| `llama3.2:latest` | 2.0 GB | General purpose | Chat, summarization, comparison |
| `gemma3:1b` | 815 MB | General purpose | Chat, comparison |
| `command-r7b:latest` | 4.7 GB | RAG-optimized | Document summarization |
| `qwen2.5-coder:7b` | 4.7 GB | Code generation | Writing and explaining code |
Other models mentioned: `codellama:7b`, `deepseek-coder-v2:latest`, `starcoder2:7b`