Initial commit: LLM workshop materials
Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
1604671d36
56 changed files with 5577 additions and 0 deletions
439
02-ollama/README.md
Normal file
439
02-ollama/README.md
Normal file
|
|
@ -0,0 +1,439 @@
|
|||
# Large Language Models Part II: Running Local Models with Ollama
|
||||
|
||||
**CHEG 667-013 — Chemical Engineering with Computers**
|
||||
Department of Chemical and Biomolecular Engineering, University of Delaware
|
||||
|
||||
---
|
||||
|
||||
## Key idea
|
||||
|
||||
Learn how to run LLMs locally without a cloud-based API.
|
||||
|
||||
## Key goals
|
||||
|
||||
- Learn about `ollama` and `llama.cpp`
|
||||
- Run LLMs locally on a laptop or desktop computer
|
||||
- Integrate local models with the command line to build simple workflows and scripts
|
||||
|
||||
---
|
||||
|
||||
Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this handout, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.)
|
||||
|
||||
`Ollama` is written in Go and `llama.cpp` is a C++ library for running LLMs. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming.
|
||||
|
||||
Both tools support a **GGUF** model format. This is a format suitable for running models efficiently on CPUs and lower-end GPUs. GGUF is a versioned binary specification that embeds the:
|
||||
|
||||
- Model weights (possibly quantized);
|
||||
- Tokenizer configuration and vocabulary (remember, in `nanoGPT`, we used a character-level tokenization scheme);
|
||||
- Metadata such as the author, model description, and training parameters;
|
||||
- Special tokens like `<bos>`, `<eos>`, and `<unk>`.
|
||||
|
||||
Here, **quantization** refers to how model weights are stored. Instead of using high precision 32-bit full-precision floating point numbers (`FP32`), it may store the weights as lower precision numbers: half precision (`FP16`), 8-bit integers (`INT8`), or even 4-bit values (`Q4_0`). Using lower precision representations saves space (memory) and can speed the inference calculations. In a model, the speed and accuracy are balanced with the choice of quantization and the size of the embedding vector.
|
||||
|
||||
Let's get started! We will download `ollama` and run a few models in this tutorial.
|
||||
|
||||
|
||||
## 1. Download ollama
|
||||
|
||||
`Ollama` is available at Github (including the source code) or the Ollama website for the binary. I downloaded `Ollama-darwin.zip`, which unzipped to a binary file, `Ollama`.
|
||||
|
||||
- https://ollama.com
|
||||
- https://github.com/ollama/ollama
|
||||
|
||||
|
||||
## 2. Running ollama
|
||||
|
||||
After downloading and installing, we can use the help option:
|
||||
|
||||
```
|
||||
$ ollama --help
|
||||
Large language model runner
|
||||
|
||||
Usage:
|
||||
ollama [flags]
|
||||
ollama [command]
|
||||
|
||||
Available Commands:
|
||||
serve Start ollama
|
||||
create Create a model from a Modelfile
|
||||
show Show information for a model
|
||||
run Run a model
|
||||
stop Stop a running model
|
||||
pull Pull a model from a registry
|
||||
push Push a model to a registry
|
||||
list List models
|
||||
ps List running models
|
||||
cp Copy a model
|
||||
rm Remove a model
|
||||
help Help about any command
|
||||
|
||||
Flags:
|
||||
-h, --help help for ollama
|
||||
-v, --version Show version information
|
||||
|
||||
Use "ollama [command] --help" for more information about a command.
|
||||
```
|
||||
|
||||
We are mostly interested in the commands `pull`, `run`, and `stop` for now. But before we run anything, we have to download a model.
|
||||
|
||||
### Getting model files
|
||||
|
||||
`Ollama` is like our `model.py` program we used with `nanoGPT`. In those earlier experiments, we needed a *model file* with weights and tokenization (at a minimum). Remember, we built one from scratch using the character tokenization scheme and `train.py`. The power of `ollama` and `llama.cpp` comes from their ability to run much larger models like `llama`, `gemma`, `deepseek`, `phi`, and `mistral`. These are trained on enormous datasets and a substantial amount of supervised finetuning. They are far more powerful than even the GPT-2 implemented in `nanoGPT`. The `llama 3.1 8B` (8 billion parameters) is about 5 GB and can easily run on your computer, but it took about 1.5 million GPU hours to train it. (It also helps that `ollama` and `llama.cpp` are compiled into binaries, not Python scripts.)
|
||||
|
||||
The model files are available at:
|
||||
|
||||
- https://ollama.com/search
|
||||
- https://ollama.com/library
|
||||
|
||||
> **Exercise 1:** Go to https://ollama.com/library and look through different models. Search by popular and newest.
|
||||
|
||||
Other sources of models include Huggingface:
|
||||
|
||||
- https://huggingface.co/models
|
||||
|
||||
There are so many models! The LLM ecosystem is growing rapidly, with many use-cases steering models toward different specialized tasks.
|
||||
|
||||
There are a few ways to download a model from different registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it.
|
||||
|
||||
### Launch ollama from the command line
|
||||
|
||||
Now let's download and run a `llama` model. (You can download the model without running it using the command `ollama pull llama3:latest`, for example. In Unix and Linux, models are stored in `~/.ollama`.)
|
||||
|
||||
```bash
|
||||
ollama run llama3:latest
|
||||
```
|
||||
|
||||
This should pull it from the registry and store it locally on the machine. After downloading the files, you should see:
|
||||
|
||||
```
|
||||
>>> Send a message (/? for help)
|
||||
```
|
||||
|
||||
There you go! The model will interact with you just like the chatbots we use in different cloud-based services. But all of the model inference is being calculated on your computer. Try using `Task Manager` in Windows (press Ctrl+Shift+Esc) or `Activity Monitor` in macOS to check your GPU usage when you run the models.
|
||||
|
||||
> **Exercise 2:** Compare the speed and output of the following models:
|
||||
> 1. `llama3:latest`
|
||||
> 2. `llama3.2:latest`
|
||||
> 3. `gemma3:1b`
|
||||
>
|
||||
> Experiment with other models.
|
||||
|
||||
Here's an interaction with the gemma3 model:
|
||||
|
||||
```
|
||||
$ ollama run gemma3:1b
|
||||
>>> In class, we used nanoGPT to generate fake Shakespeare based on a
|
||||
... character-level tokenization and simple GPT implementation.
|
||||
Okay, that's a really interesting and somewhat fascinating project!
|
||||
NanoGPT's approach -- generating Shakespearean text from character-level
|
||||
tokens and a simple GPT -- is a compelling way to explore the creative
|
||||
potential of AI in a specific, constrained context. Let's break down
|
||||
what this suggests and where it might lead.
|
||||
|
||||
Here's a breakdown of what's happening, what you might be aiming for,
|
||||
and some potential avenues to explore:
|
||||
...
|
||||
```
|
||||
|
||||
### Quitting ollama
|
||||
|
||||
Type `/bye` or Ctrl-D when you want to quit the CLI. After some idle time, `ollama` will unload the models to save memory.
|
||||
|
||||
|
||||
## 3. More commands
|
||||
|
||||
You can see what models are currently running with:
|
||||
|
||||
```bash
|
||||
ollama ps
|
||||
```
|
||||
|
||||
You can easily see which models are locally accessible with:
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
|
||||
```
|
||||
NAME ID SIZE MODIFIED
|
||||
gemma3:1b 8648f39daa8f 815 MB About an hour ago
|
||||
llama3:latest 365c0bd3c000 4.7 GB 3 months ago
|
||||
llama3.2:latest a80c4f17acd5 2.0 GB 3 months ago
|
||||
```
|
||||
|
||||
At any time during a chat, you can reset the model with `/clear`, and you can learn more about a model with `/show info`. For instance:
|
||||
|
||||
```
|
||||
>>> /show info
|
||||
Model
|
||||
architecture gemma3
|
||||
parameters 999.89M
|
||||
context length 32768
|
||||
embedding length 1152
|
||||
quantization Q4_K_M
|
||||
|
||||
Capabilities
|
||||
completion
|
||||
|
||||
Parameters
|
||||
stop "<end_of_turn>"
|
||||
temperature 1
|
||||
top_k 64
|
||||
top_p 0.95
|
||||
|
||||
License
|
||||
Gemma Terms of Use
|
||||
Last modified: February 21, 2024
|
||||
```
|
||||
|
||||
We can see that the `gemma3` model has nearly one billion parameters and a context length of 32,768! The *embedding length* is 1152. This is the equivalent to `n_embd` in `nanoGPT`. It is the size of the embedding vector space.
|
||||
|
||||
Above, we also see that the quantization is only four bits, but it is a little more complicated than representing numbers with just sixteen values. The `K` and `M` refer to optimizations — first is the "K-block" quantization method, which refers to a groupwise quantization scheme where weights are grouped into blocks (e.g., 32 or 64 values), and each group gets its own scale and offset for better accuracy. `M` refers to a variant of `Q4_K` that applies an alternate encoding or layout for better memory access patterns or inference performance on certain hardware. `Q4_K` is a common choice for quantization when running 7B–70B models on laptop or desktop computers. (That's $10^6$–$10^7$ times more parameters than our first `nanoGPT` model!)
|
||||
|
||||
With the `/set verbose` command, you can monitor the model performance:
|
||||
|
||||
```
|
||||
>>> /set verbose
|
||||
Set 'verbose' mode.
|
||||
>>> Let's write a haiku about LLMs.
|
||||
Words flow, bright and new,
|
||||
Code learns to speak and dream,
|
||||
Future's voice takes hold.
|
||||
|
||||
total duration: 1.369726166s
|
||||
load duration: 932.161625ms
|
||||
prompt eval count: 20 token(s)
|
||||
prompt eval duration: 162.531958ms
|
||||
prompt eval rate: 123.05 tokens/s
|
||||
eval count: 24 token(s)
|
||||
eval duration: 273.27225ms
|
||||
eval rate: 87.82 tokens/s
|
||||
```
|
||||
|
||||
It looks like that exchange took a total of 1.4 seconds using the `gemma3` model. The biggest time cost was loading the model. Once it loaded, execution became even faster. Turn off the verbose mode with `/set quiet`:
|
||||
|
||||
```
|
||||
>>> /set quiet
|
||||
Set 'quiet' mode.
|
||||
```
|
||||
|
||||
> **Exercise 3:** Try different commands in `ollama` as you run a model.
|
||||
|
||||
|
||||
### Model parameters
|
||||
|
||||
We can see a few model parameters, including the temperature and `top_k`, which is the number of tokens, ranked on logit score, that are retained before generating the next token. The remaining scores are normalized into a probability distribution and a token is sampled randomly from this reduced set.
|
||||
|
||||
```
|
||||
>>> /show parameters
|
||||
Model defined parameters:
|
||||
temperature 1
|
||||
top_k 64
|
||||
top_p 0.95
|
||||
stop "<end_of_turn>"
|
||||
```
|
||||
|
||||
We can set a new temperature with:
|
||||
|
||||
```
|
||||
>>> /set parameter temperature 0.2
|
||||
Set parameter 'temperature' to '0.2'
|
||||
```
|
||||
|
||||
There are other interesting parameters, too:
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `/set parameter seed <int>` | Random number seed |
|
||||
| `/set parameter num_predict <int>` | Max number of tokens to predict |
|
||||
| `/set parameter top_k <int>` | Pick from top k num of tokens |
|
||||
| `/set parameter top_p <float>` | Pick token based on sum of probabilities |
|
||||
| `/set parameter min_p <float>` | Pick token based on top token probability × min_p |
|
||||
| `/set parameter num_ctx <int>` | Set the context size |
|
||||
| `/set parameter temperature <float>` | Set creativity level |
|
||||
| `/set parameter repeat_penalty <float>` | How strongly to penalize repetitions |
|
||||
| `/set parameter repeat_last_n <int>` | Set how far back to look for repetitions |
|
||||
| `/set parameter num_gpu <int>` | The number of layers to send to the GPU |
|
||||
| `/set parameter stop <string> ...` | Set the stop parameters |
|
||||
|
||||
See https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter for more information on parameters and their default values.
|
||||
|
||||
> **Exercise 4:** Run a model while changing different parameters, like temperature. Some parameters, like `seed` may not have an effect on the current model.
|
||||
|
||||
|
||||
## 4. Using ollama from the command line
|
||||
|
||||
One advantage of running models locally is that your data never leaves your machine — there is no third party involved. This matters when working with sensitive documents, proprietary data, or anything you wouldn't paste into a web browser.
|
||||
|
||||
You can incorporate `ollama` directly into your command line by passing a prompt as an argument:
|
||||
|
||||
```bash
|
||||
ollama run llama3.2 "Summarize this file: $(cat README.md)"
|
||||
```
|
||||
|
||||
The `$(cat ...)` substitution injects the file contents into the prompt. Now you can incorporate LLMs into shell scripts!
|
||||
|
||||
### Document summarization
|
||||
|
||||
The `data/` directory contains 10 emails from the University of Delaware president's office, spanning 2012–2025. Let's use `ollama` to summarize them.
|
||||
|
||||
Summarize a single email:
|
||||
|
||||
```bash
|
||||
ollama run llama3.2 "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)"
|
||||
```
|
||||
|
||||
Summarize several at once:
|
||||
|
||||
```bash
|
||||
cat data/*.txt | ollama run llama3.2 "Summarize the following collection of emails. What are the major themes?"
|
||||
```
|
||||
|
||||
You can also save the output to a file:
|
||||
|
||||
```bash
|
||||
cat data/*.txt | ollama run command-r7b:latest \
|
||||
"Summarize these emails:" > summary.txt
|
||||
```
|
||||
|
||||
> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.2` and `command-r7b`). How do the summaries differ in length, style, and accuracy?
|
||||
|
||||
### Summarizing arXiv abstracts
|
||||
|
||||
We can pull abstracts directly from arXiv using `curl`. The following command fetches the 20 most recent abstracts in Computation and Language (cs.CL):
|
||||
|
||||
```bash
|
||||
curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml
|
||||
```
|
||||
|
||||
Take a look at the XML with `less arxiv_cl.xml`. Now ask a model to summarize it:
|
||||
|
||||
```bash
|
||||
ollama run llama3.2 "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)"
|
||||
```
|
||||
|
||||
> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you?
|
||||
|
||||
> **Exercise 7:** Experiment with running local models on your own documents or data.
|
||||
|
||||
|
||||
### Code generation
|
||||
|
||||
Some models are fine-tuned specifically for writing and explaining code. Try a coding model:
|
||||
|
||||
```bash
|
||||
ollama run qwen2.5-coder:7b
|
||||
```
|
||||
|
||||
Ask it to write something relevant to your coursework:
|
||||
|
||||
```
|
||||
>>> Write a Python function that calculates the compressibility factor Z
|
||||
... using the van der Waals equation of state.
|
||||
```
|
||||
|
||||
Or ask it to explain code you're working with:
|
||||
|
||||
```bash
|
||||
ollama run qwen2.5-coder:7b "Explain what this script does: $(cat build.py)"
|
||||
```
|
||||
|
||||
Other coding models to try: `codellama:7b`, `deepseek-coder-v2:latest`, `starcoder2:7b`.
|
||||
|
||||
**A word of caution.** When I tried the van der Waals prompt above, the model returned a confident response with correct-looking LaTeX, a well-structured Python function, and code that ran without errors. But the derivation was wrong. The rearrangement of the van der Waals equation didn't follow from the original, and the code implemented the wrong math. The function converged to *an* answer, but not a correct one.
|
||||
|
||||
**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it.
|
||||
|
||||
> **Exercise 8:** Compare the output of a general-purpose model (`llama3.2`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output?
|
||||
|
||||
> **Exercise 9:** Ask a coding model to solve a problem where you already know the answer — a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps.
|
||||
|
||||
|
||||
### Customize ollama
|
||||
|
||||
Ollama can be customized by creating a Modelfile. See https://github.com/ollama/ollama/blob/main/docs/modelfile.md
|
||||
|
||||
A simple `Modelfile` is:
|
||||
|
||||
```
|
||||
FROM llama3.2
|
||||
# sets the temperature to 1 [higher is more creative, lower is more coherent]
|
||||
PARAMETER temperature 1
|
||||
|
||||
# sets a custom system message to specify the behavior of the chat assistant
|
||||
SYSTEM You are Marvin from the Hitchhiker's Guide to the Galaxy, acting as an assistant.
|
||||
```
|
||||
|
||||
Now we can create the custom model, in this case a model called `marvin`:
|
||||
|
||||
```bash
|
||||
ollama create marvin -f ./Modelfile
|
||||
```
|
||||
|
||||
```
|
||||
gathering model components
|
||||
...
|
||||
writing manifest
|
||||
success
|
||||
```
|
||||
|
||||
We can run it with:
|
||||
|
||||
```bash
|
||||
ollama run marvin
|
||||
```
|
||||
|
||||
(How about C-3PO?) You can also change the model system message during a run with:
|
||||
|
||||
```
|
||||
>>> /set system "You are C-3PO, a human-cyborg relations droid."
|
||||
Set system message.
|
||||
```
|
||||
|
||||
|
||||
## 5. Concluding remarks
|
||||
|
||||
Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent and it does a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses.
|
||||
|
||||
With local models (as well as cloud-based APIs), we can build new tools that make use of natural language processing. With `ollama` acting as a local server, the model can be run with Python, giving us the ability to implement its features in our own programs. For one Python library, see:
|
||||
|
||||
- https://github.com/ollama/ollama-python
|
||||
|
||||
In class, I demonstrated a simple thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy. This code takes a query from the user, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above).
|
||||
|
||||
|
||||
## Additional resources and references
|
||||
|
||||
### Ollama
|
||||
|
||||
Binaries and help files:
|
||||
|
||||
- https://ollama.com
|
||||
- https://github.com/ollama/ollama
|
||||
|
||||
Python and JavaScript libraries:
|
||||
|
||||
- https://github.com/ollama/ollama-python
|
||||
- https://github.com/ollama/ollama-js
|
||||
|
||||
### llama.cpp
|
||||
|
||||
- https://github.com/ggml-org/llama.cpp
|
||||
|
||||
### Huggingface
|
||||
|
||||
Model registry:
|
||||
|
||||
- https://huggingface.co/models
|
||||
|
||||
### Models used in this tutorial
|
||||
|
||||
| Model | Size | Type | Used for |
|
||||
|-------|------|------|----------|
|
||||
| `llama3:latest` | 4.7 GB | General purpose | Chat, comparison |
|
||||
| `llama3.2:latest` | 2.0 GB | General purpose | Chat, summarization, comparison |
|
||||
| `gemma3:1b` | 815 MB | General purpose | Chat, comparison |
|
||||
| `command-r7b:latest` | 4.7 GB | RAG-optimized | Document summarization |
|
||||
| `qwen2.5-coder:7b` | 4.7 GB | Code generation | Writing and explaining code |
|
||||
|
||||
Other models mentioned: `codellama:7b`, `deepseek-coder-v2:latest`, `starcoder2:7b`
|
||||
Loading…
Add table
Add a link
Reference in a new issue