Eric e10e411e41 Update module docs: fix arXiv URL, uv setup, nanoGPT clone path

- Use HTTPS for arXiv API (was returning 301 on HTTP)
- Point module 01 preliminaries to root uv sync instead of separate venv
- Clone nanoGPT into 01-nanogpt/ and add to .gitignore
- Add llama3.1:8B to module 02 models table
- Various editorial updates to modules 01 and 02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-01 22:25:42 -04:00

19 KiB

Raw Permalink Blame History

Large Language Models Part II: Running Local Models with Ollama

CHEG 667-013 — Chemical Engineering with Computers
Department of Chemical and Biomolecular Engineering, University of Delaware

Key idea

Learn how to run LLMs locally without a cloud-based API.

Key goals

Learn about ollama and llama.cpp
Run LLMs locally on a laptop or desktop computer
Integrate local models with the command line to build simple workflows and scripts

Our work with LLMs so far focused on nanoGPT, a Python-based code that can train and run inference on a simple GPT implementation. In this section, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try ollama. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. Ollama and another environment, llama.cpp, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.)

Ollama is written in the language Go and llama.cpp is a C++ library. Both are cross-platform and can be run on Linux, Windows, and macOS. llama.cpp is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming.

Both tools support a GGUF model format. This is a format suitable for running models efficiently on CPUs and lower-end GPUs. GGUF is a versioned binary specification that embeds the:

Model weights (possibly quantized);
Tokenizer configuration and vocabulary (remember, in nanoGPT, we used a character-level tokenization scheme);
Metadata such as the author, model description, and training parameters;
Special tokens like <bos>, <eos>, and <unk>.

Here, quantization refers to how model weights are stored. Instead of using high precision 32-bit full-precision floating point numbers (FP32), it may store the weights as lower precision numbers: half precision (FP16), 8-bit integers (INT8), or even 4-bit values (Q4_0). Using lower precision representations saves space (memory) and can speed the inference calculations. In a model, the speed and accuracy are balanced with the choice of quantization and the size of the embedding vector.

Let's get started! We will download ollama and run a few models in this tutorial.

1. Download ollama

Ollama is available at Github (including the source code) or the Ollama website for the binary. I downloaded Ollama-darwin.zip, which unzipped to a binary file, Ollama.

2. Running ollama

After downloading and installing, we can use the help option:

$ ollama --help
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

We are mostly interested in the commands pull, run, and stop for now. But before we run anything, we have to download a model.

Getting model files

Ollama is like our model.py program we used with nanoGPT. In those earlier experiments, we needed a model file with weights and tokenization (at a minimum). Remember, we built one from scratch using the character tokenization scheme and train.py. The power of ollama and llama.cpp comes from their ability to run much larger models like llama, gemma, deepseek, phi, and mistral. These are trained on enormous datasets and a substantial amount of supervised finetuning. They are far more powerful than even the GPT-2 implemented in nanoGPT. The llama 3.1 8B (8 billion parameters) is about 5 GB and can easily run on your computer, but it took about 1.5 million GPU hours to train it. (It also helps that ollama and llama.cpp are compiled into binaries, not Python scripts.)

The model files are available at:

Exercise 1: Go to https://ollama.com/library and look through different models. Search by popular and newest.

Other sources of models include Huggingface:

https://huggingface.co/models

There are so many models! The LLM ecosystem is growing rapidly, with many use-cases steering models toward different specialized tasks.

There are a few ways to download a model from the registries. Running ollama with the run command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also pull a model without running it.

Launch ollama from the command line

Now let's download and run a llama model. (You can download the model without running it using the command ollama pull llama3:latest, for example. In Unix and Linux, models are stored in ~/.ollama.)

ollama run llama3:latest

This should pull it from the registry and store it locally on the machine. After downloading the files, you should see:

>>> Send a message (/? for help)

There you go! The model will interact with you just like the chatbots we use in different cloud-based services. But all of the model inference is being calculated on your computer. Try using Task Manager in Windows (press Ctrl+Shift+Esc) or Activity Monitor in macOS to check your GPU usage when you run the models.

Exercise 2: Compare the speed and output of the following models:

llama3:latest

llama3.2:latest

gemma3:1b

Note the size of the model, the quantization used, the context length, and other parameters. Experiment with other models.

Here's an interaction with the gemma3 model:

$ ollama run gemma3:1b
>>> In class, we used nanoGPT to generate fake Shakespeare based on a
... character-level tokenization and simple GPT implementation.
Okay, that's a really interesting and somewhat fascinating project!
NanoGPT's approach -- generating Shakespearean text from character-level
tokens and a simple GPT -- is a compelling way to explore the creative
potential of AI in a specific, constrained context. Let's break down
what this suggests and where it might lead.

Here's a breakdown of what's happening, what you might be aiming for,
and some potential avenues to explore:
...

Quitting ollama

Type /bye or Ctrl-D when you want to quit the CLI. After some idle time, ollama will unload the models to save memory.

3. More commands

You can see what models are currently running with:

ollama ps

You can easily see which models are locally accessible with:

ollama list

NAME                        ID              SIZE      MODIFIED
gemma3:1b                   8648f39daa8f    815 MB    About an hour ago
llama3:latest               365c0bd3c000    4.7 GB    3 months ago
llama3.2:latest             a80c4f17acd5    2.0 GB    3 months ago

At any time during a chat, you can reset the model with /clear, and you can learn more about a model with /show info. For instance:

>>> /show info
  Model
    architecture        gemma3
    parameters          999.89M
    context length      32768
    embedding length    1152
    quantization        Q4_K_M

  Capabilities
    completion

  Parameters
    stop           "<end_of_turn>"
    temperature    1
    top_k          64
    top_p          0.95

  License
    Gemma Terms of Use
    Last modified: February 21, 2024

We can see that the gemma3 model has nearly one billion parameters and a context length of 32,768! The embedding length is 1152. This is the equivalent to n_embd in nanoGPT. It is the size of the embedding vector space.

Above, we also see that the quantization is only four bits, but it is a little more complicated than representing numbers with just sixteen values. The K and M refer to optimizations — first is the "K-block" quantization method, which refers to a groupwise quantization scheme where weights are grouped into blocks (e.g., 32 or 64 values), and each group gets its own scale and offset for better accuracy. M refers to a variant of Q4_K that applies an alternate encoding or layout for better memory access patterns or inference performance on certain hardware. Q4_K is a common choice for quantization when running 7B–70B models on laptop or desktop computers. (That's $10^6$–10^7 times more parameters than our first nanoGPT model!)

With the /set verbose command, you can monitor the model performance:

>>> /set verbose
Set 'verbose' mode.
>>> Let's write a haiku about LLMs.
Words flow, bright and new,
Code learns to speak and dream,
Future's voice takes hold.

total duration:       1.369726166s
load duration:        932.161625ms
prompt eval count:    20 token(s)
prompt eval duration: 162.531958ms
prompt eval rate:     123.05 tokens/s
eval count:           24 token(s)
eval duration:        273.27225ms
eval rate:            87.82 tokens/s

It looks like that exchange took a total of 1.4 seconds using the gemma3 model. The biggest time cost was loading the model. Once it loaded, execution became even faster. Turn off the verbose mode with /set quiet:

>>> /set quiet
Set 'quiet' mode.

Exercise 3: Try different commands in ollama as you run a model.

Model parameters

We can see a few model parameters, including the temperature and top_k, which is the number of tokens, ranked on logit score, that are retained before generating the next token. The remaining scores are normalized into a probability distribution and a token is sampled randomly from this reduced set.

>>> /show parameters
Model defined parameters:
temperature                    1
top_k                          64
top_p                          0.95
stop                           "<end_of_turn>"

We can set a new temperature with:

>>> /set parameter temperature 0.2
Set parameter 'temperature' to '0.2'

There are other interesting parameters, too:

Command	Description
`/set parameter seed <int>`	Random number seed
`/set parameter num_predict <int>`	Max number of tokens to predict
`/set parameter top_k <int>`	Pick from top k num of tokens
`/set parameter top_p <float>`	Pick token based on sum of probabilities
`/set parameter min_p <float>`	Pick token based on top token probability × min_p
`/set parameter num_ctx <int>`	Set the context size
`/set parameter temperature <float>`	Set creativity level
`/set parameter repeat_penalty <float>`	How strongly to penalize repetitions
`/set parameter repeat_last_n <int>`	Set how far back to look for repetitions
`/set parameter num_gpu <int>`	The number of layers to send to the GPU
`/set parameter stop <string> ...`	Set the stop parameters

See https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter for more information on parameters and their default values.

Exercise 4: Run a model while changing different parameters, like temperature. Some parameters, like seed may not have an effect on the current model.

4. Using ollama from the command line

One advantage of running models locally is that your data never leaves your machine — there is no third party involved. This matters when working with sensitive documents, proprietary data, or anything you wouldn't paste into a web browser.

You can incorporate ollama directly into your command line by passing a prompt as an argument:

ollama run llama3.1:8B "Summarize this file: $(cat README.md)"

The $(cat ...) substitution injects the file contents into the prompt. Now you can incorporate LLMs into shell scripts!

Document summarization

The data/ directory contains 10 emails from the University of Delaware president's office, spanning 2012–2025. Let's use ollama to summarize them.

Summarize a single email:

ollama run llama3.1:8B "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)"

Summarize several at once:

cat data/*.txt | ollama run llama3.1:8B "Summarize the following collection of emails. What are the major themes?"

You can also save the output to a file:

cat data/*.txt | ollama run command-r7b:latest \
    "Summarize these emails:" > summary.txt

Exercise 5: Summarize the emails in data/ using two different models (e.g., llama3.1:8B and command-r7b). How do the summaries differ in length, style, and accuracy?

Summarizing arXiv abstracts

We can pull abstracts directly from arXiv using curl. The following command fetches the 20 most recent abstracts in Computation and Language (cs.CL):

curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml

Take a look at the XML with less arxiv_cl.xml. Now ask a model to summarize it:

ollama run llama3.1:8B "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)"

Exercise 6: Try different arXiv categories — cs.AI (artificial intelligence), cs.LG (machine learning), or cond-mat.soft (soft matter). What themes does the model find? Do the summaries make sense to you? How large are the files compared to the model context length?

Exercise 7 Run the same model multiple times on a single file. Note the differences in the model output each time.

Exercise 8: Experiment with running local models on your own documents or data.

Exercise 9: Use a local model to write a prompt for a cloud based service.

Code generation

Some models are fine-tuned specifically for writing and explaining code. Try a coding model:

ollama run qwen2.5-coder:7b

Ask it to write something relevant to your coursework:

>>> Write a Python function that calculates the compressibility factor Z
... using the van der Waals equation of state.

Or ask it to explain code you're working with:

ollama run qwen2.5-coder:7b "Explain what this script does: $(cat build.py)"

Other coding models to try: codellama:7b, deepseek-coder-v2:latest, starcoder2:7b.

A word of caution. When I tried the van der Waals prompt above, the model returned a confident response with correct-looking LaTeX, a well-structured Python function, and code that ran without errors. But the derivation was wrong. The rearrangement of the van der Waals equation didn't follow from the original, and the code implemented the wrong math. The function converged to an answer, but not a correct one.

This is a particularly dangerous failure mode for engineers! The output looks authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. (Ask yourself why as you consider the underlying transformer architecture we studied in module 01.) Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it.

Exercise 10: Compare the output of a general-purpose model (llama3.1:8B) and a coding model (qwen2.5-coder:7b) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output?

Exercise 11: Ask a coding model to solve a problem where you already know the answer, such as a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps.

Customize ollama

Ollama can be customized by creating a Modelfile. See https://github.com/ollama/ollama/blob/main/docs/modelfile.md

A simple Modelfile is:

FROM llama3.2
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Marvin from the Hitchhiker's Guide to the Galaxy, acting as an assistant.

Now we can create the custom model, in this case a model called marvin:

ollama create marvin -f ./Modelfile

gathering model components
...
writing manifest
success

We can run it with:

ollama run marvin

(How about C-3PO?) You can also change the model system message during a run with:

>>> /set system "You are C-3PO, a human-cyborg relations droid."
Set system message.

5. Concluding remarks

Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent, and they do a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses.

With local models (as well as cloud-based APIs), we can build new tools that make use of natural language processing. With ollama acting as a local server, the model can be run with Python, giving us the ability to implement its features in our own programs. For one Python library, see:

https://github.com/ollama/ollama-python

Recently, we used this strategy to build a thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy (the next topic in this series). The assistant code takes a query from the user in natural language, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above).

Additional resources and references

Ollama

Binaries and help files:

Python and JavaScript libraries:

llama.cpp

https://github.com/ggml-org/llama.cpp

Huggingface

Model registry:

https://huggingface.co/models

Models used in this tutorial

Model	Size	Type	Used for
`llama3:latest`	4.7 GB	General purpose	Chat, comparison
`llama3.1:8B`	4.9 GB	General purpose	arXiv summarization, comparison
`llama3.2:latest`	2.0 GB	General purpose	Chat, summarization, comparison
`gemma3:1b`	815 MB	General purpose	Chat, comparison
`command-r7b:latest`	4.7 GB	RAG-optimized	Document summarization
`qwen2.5-coder:7b`	4.7 GB	Code generation	Writing and explaining code

Other models mentioned: codellama:7b, deepseek-coder-v2:latest, starcoder2:7b

19 KiB Raw Permalink Blame History Unescape Escape