From e10e411e4172f6f25b0b6929aa90bd7a9728c18c Mon Sep 17 00:00:00 2001 From: Eric Date: Wed, 1 Apr 2026 22:25:42 -0400 Subject: [PATCH] Update module docs: fix arXiv URL, uv setup, nanoGPT clone path - Use HTTPS for arXiv API (was returning 301 on HTTP) - Point module 01 preliminaries to root uv sync instead of separate venv - Clone nanoGPT into 01-nanogpt/ and add to .gitignore - Add llama3.1:8B to module 02 models table - Various editorial updates to modules 01 and 02 Co-Authored-By: Claude Opus 4.6 (1M context) --- .gitignore | 3 +++ 01-nanogpt/README.md | 33 ++++++++++++++++----------------- 02-ollama/README.md | 39 ++++++++++++++++++++++----------------- 3 files changed, 41 insertions(+), 34 deletions(-) diff --git a/.gitignore b/.gitignore index d62165c..f8ad3b8 100644 --- a/.gitignore +++ b/.gitignore @@ -33,6 +33,9 @@ store/ # Personal notes (not part of the workshop) *-notes.md +# Cloned repositories +01-nanogpt/nanoGPT/ + # Legacy directories (not part of the workshop) handouts/ class_demo/ diff --git a/01-nanogpt/README.md b/01-nanogpt/README.md index 057d027..0cc3968 100644 --- a/01-nanogpt/README.md +++ b/01-nanogpt/README.md @@ -18,30 +18,22 @@ We will study how Large Language Models (LLMs) work and discuss some of their us --- -Large Language Models (LLMs) have rapidly integrated into our daily lives. Our goal is to learn a bit about how LLMs work. As you have probably become well aware of throughout your studies, engineers often don't take technical solutions for granted. We generally like to "look under the hood" and see how a system, process, or tool does its job — and whether it is giving us accurate and useful solutions. The material we will cover is largely inspired by the rapid adoption of LLMs to help us solve problems in our engineering practice. +Large Language Models (LLMs) have rapidly become part of our lives. Our goal is to learn a bit about how LLMs work. As you have probably become well aware of throughout your studies, engineers often don't take technical solutions for granted. We generally like to "look under the hood" and see how a system, process, or tool does its job — and whether it is giving us accurate and useful solutions. The material we will cover is largely inspired by the rapid adoption of LLMs to help us solve problems in our engineering practice. We will use a code repository published by Andrej Karpathy called nanoGPT. GPT stands for **G**enerative **P**re-trained **T**ransformer. A transformer is a neural network architecture designed to handle sequences of data using self-attention, which allows it to weigh the importance of different words in a context. The neural network's weights and biases are created beforehand using training and validation datasets (these constitute the training and fine-tuning steps, which often require considerable computational effort, depending on the model size). Generative refers to a model's ability to create new content, rather than just analyzing or classifying existing data. When we generate text, we are running an *inference* on the model. Inference requires much less computational effort. NanoGPT can replicate the function of the GPT-2 model. Building the model from scratch to that level of performance (which is far lower than the current models) would still require a significant investment in computational effort — Karpathy reports using eight NVIDIA A100 GPUs for four days on the task — or 768 GPU hours. In this introduction, our aspirations will be far lower. We should be able to do simpler work with only a CPU. -Hoave you wondered why LLMs tend to use GPUs? The math underlying the transformer architecture is largely based on matrix calculations. Originally, GPUs were developed to quickly calculate matrix transformations associated with high-performance graphics applications. (It's all linear algebra!) These processors have since been adapted into general-purpose engines for the parallel computations used in modern AI algorithms. +Have you wondered why LLMs tend to use GPUs? If you dig deeper into the models, you will find that the math underlying the transformer architecture is largely based on matrix calculations. Originally, GPUs were developed to quickly calculate matrix transformations associated with high-performance graphics applications. (It's all linear algebra!) These processors have since been adapted into general-purpose engines for the parallel computations used in modern AI and machine learning algorithms. ## 1. Preliminaries -Dust off those command line skills! There will be no GUI where we're going. I recommend making a new directory (under WSL if you're using a Windows machine) and setting up a Python virtual environment: +Dust off those command line skills! There will be no GUI where we're going. Set up the Python environment as described in the main [README](../README.md). If you haven't already: ```bash -python -m venv llm -source llm/bin/activate -``` - -You will need to install packages like `numpy` and `pytorch`. If you have [uv](https://docs.astral.sh/uv/) installed, you can use it instead: - -```bash -uv venv llm -source llm/bin/activate -uv pip install numpy torch +uv sync +source .venv/bin/activate ``` @@ -49,9 +41,10 @@ uv pip install numpy torch Karpathy's code is at https://github.com/karpathy/nanoGPT -Download the code using `git`. An alternative is to download a `zip` file from the Github page. (Look for the green `Code` button on the site. Clicking this, you will see `Download ZIP` in the dropdown menu.) +From the `01-nanogpt/` directory, download the code using `git`. An alternative is to download a `zip` file from the Github page. (Look for the green `Code` button on the site. Clicking this, you will see `Download ZIP` in the dropdown menu.) ```bash +cd 01-nanogpt git clone https://github.com/karpathy/nanoGPT ``` @@ -59,16 +52,22 @@ You should now have a nanoGPT directory: ```bash $ ls -nanoGPT/ +README.md nanoGPT/ ``` ## 3. A quick tour -List the directory contents of `./nanoGPT`. You should see something like: +Change into the nanoGPT directory — the remaining commands in this module are run from here: + +```bash +cd nanoGPT +``` + +List the directory contents. You should see something like: ``` -$ ls -l nanoGPT +$ ls -l total 696 -rw-r--r-- 1 furst staff 1072 Apr 17 12:44 LICENSE -rw-r--r-- 1 furst staff 13576 Apr 17 12:44 README.md diff --git a/02-ollama/README.md b/02-ollama/README.md index f3ece37..2f553d4 100644 --- a/02-ollama/README.md +++ b/02-ollama/README.md @@ -17,9 +17,9 @@ Learn how to run LLMs locally without a cloud-based API. --- -Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this handout, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.) +Our work with LLMs so far focused on `nanoGPT`, a Python-based code that can train and run inference on a simple GPT implementation. In this section, we will explore running something between it and API-based models like ChatGPT. Specifically, we will try `ollama`. This is a local runtime environment and model manager that is designed to make it easy to run and interact with LLMs on your own machine. `Ollama` and another environment, `llama.cpp`, are programs primarily targeted at developers, researchers, and hobbyists who want to access LLMs to build and experiment with but don't want to rely on cloud-based APIs. (An API — Application Programming Interface — is a set of defined rules that enables different software systems, such as websites or applications, to communicate with each other and share data in a structured way.) -`Ollama` is written in Go and `llama.cpp` is a C++ library for running LLMs. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming. +`Ollama` is written in the language *Go* and `llama.cpp` is a C++ library. Both are cross-platform and can be run on Linux, Windows, and macOS. `llama.cpp` is a bit lower-level with more control over loading models, quantization, memory usage, batching, and token streaming. Both tools support a **GGUF** model format. This is a format suitable for running models efficiently on CPUs and lower-end GPUs. GGUF is a versioned binary specification that embeds the: @@ -93,7 +93,7 @@ Other sources of models include Huggingface: There are so many models! The LLM ecosystem is growing rapidly, with many use-cases steering models toward different specialized tasks. -There are a few ways to download a model from different registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it. +There are a few ways to download a model from the registries. Running `ollama` with the `run` command and a model file will download the model if a local version isn't available (we will do this in the next section). You can also `pull` a model without running it. ### Launch ollama from the command line @@ -116,7 +116,7 @@ There you go! The model will interact with you just like the chatbots we use in > 2. `llama3.2:latest` > 3. `gemma3:1b` > -> Experiment with other models. +> Note the size of the model, the quantization used, the context length, and other parameters. Experiment with other models. Here's an interaction with the gemma3 model: @@ -268,7 +268,7 @@ One advantage of running models locally is that your data never leaves your mach You can incorporate `ollama` directly into your command line by passing a prompt as an argument: ```bash -ollama run llama3.2 "Summarize this file: $(cat README.md)" +ollama run llama3.1:8B "Summarize this file: $(cat README.md)" ``` The `$(cat ...)` substitution injects the file contents into the prompt. Now you can incorporate LLMs into shell scripts! @@ -280,13 +280,13 @@ The `data/` directory contains 10 emails from the University of Delaware preside Summarize a single email: ```bash -ollama run llama3.2 "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)" +ollama run llama3.1:8B "Summarize the following email in 2-3 sentences: $(cat data/2020_03_29_141635.txt)" ``` Summarize several at once: ```bash -cat data/*.txt | ollama run llama3.2 "Summarize the following collection of emails. What are the major themes?" +cat data/*.txt | ollama run llama3.1:8B "Summarize the following collection of emails. What are the major themes?" ``` You can also save the output to a file: @@ -296,25 +296,29 @@ cat data/*.txt | ollama run command-r7b:latest \ "Summarize these emails:" > summary.txt ``` -> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.2` and `command-r7b`). How do the summaries differ in length, style, and accuracy? +> **Exercise 5:** Summarize the emails in `data/` using two different models (e.g., `llama3.1:8B` and `command-r7b`). How do the summaries differ in length, style, and accuracy? ### Summarizing arXiv abstracts We can pull abstracts directly from arXiv using `curl`. The following command fetches the 20 most recent abstracts in Computation and Language (cs.CL): ```bash -curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml +curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.CL&sortBy=submittedDate&sortOrder=descending&max_results=20" > arxiv_cl.xml ``` Take a look at the XML with `less arxiv_cl.xml`. Now ask a model to summarize it: ```bash -ollama run llama3.2 "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)" +ollama run llama3.1:8B "Here are 20 recent arXiv abstracts in computational linguistics. Summarize the major research themes and trends: $(cat arxiv_cl.xml)" ``` -> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you? +> **Exercise 6:** Try different arXiv categories — `cs.AI` (artificial intelligence), `cs.LG` (machine learning), or `cond-mat.soft` (soft matter). What themes does the model find? Do the summaries make sense to you? How large are the files compared to the model context length? -> **Exercise 7:** Experiment with running local models on your own documents or data. +> **Exercise 7** Run the same model multiple times on a single file. Note the differences in the model output each time. + +> **Exercise 8:** Experiment with running local models on your own documents or data. + +> **Exercise 9:** Use a local model to write a prompt for a cloud based service. ### Code generation @@ -342,11 +346,11 @@ Other coding models to try: `codellama:7b`, `deepseek-coder-v2:latest`, `starcod **A word of caution.** When I tried the van der Waals prompt above, the model returned a confident response with correct-looking LaTeX, a well-structured Python function, and code that ran without errors. But the derivation was wrong. The rearrangement of the van der Waals equation didn't follow from the original, and the code implemented the wrong math. The function converged to *an* answer, but not a correct one. -**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it. +**This is a particularly dangerous failure mode for engineers!** The output *looks* authoritative, uses proper notation, and even runs. But the physics is wrong. LLMs are very good at producing plausible-looking text; they are not reliable at mathematical derivation. (Ask yourself why as you consider the underlying transformer architecture we studied in module 01.) Always verify generated code against your own understanding of the problem. If you can't check it, you shouldn't trust it. -> **Exercise 8:** Compare the output of a general-purpose model (`llama3.2`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output? +> **Exercise 10:** Compare the output of a general-purpose model (`llama3.1:8B`) and a coding model (`qwen2.5-coder:7b`) on the same coding task. Which produces better code? Which gives a better explanation? Can you find errors in either output? -> **Exercise 9:** Ask a coding model to solve a problem where you already know the answer — a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps. +> **Exercise 11:** Ask a coding model to solve a problem where you already know the answer, such as a homework problem you've already completed, or a textbook example. Does the model get it right? Where does it go wrong? Try breaking the problem down into smaller steps. ### Customize ollama @@ -393,13 +397,13 @@ Set system message. ## 5. Concluding remarks -Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent and it does a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses. +Running inference locally on a large language model is surprisingly good. Using (relatively) simple hardware, our machines generate language that is coherent, and they do a good job parsing prompts. The experience demonstrates that the majority of computational effort with LLMs is in training the model — a process that is rapidly becoming increasingly sophisticated and tailored for different uses. With local models (as well as cloud-based APIs), we can build new tools that make use of natural language processing. With `ollama` acting as a local server, the model can be run with Python, giving us the ability to implement its features in our own programs. For one Python library, see: - https://github.com/ollama/ollama-python -In class, I demonstrated a simple thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy. This code takes a query from the user, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above). +Recently, we used this strategy to build a thermodynamics assistant based on a simple Retrieval-Augmented Generation strategy (the next topic in this series). The assistant code takes a query from the user in natural language, encodes it with an embedding model, compares it to previously embedded statements (in my case the index of a thermodynamics book), and returns the information by generating a response with a decoding GPT (one of the models we used above). ## Additional resources and references @@ -431,6 +435,7 @@ Model registry: | Model | Size | Type | Used for | |-------|------|------|----------| | `llama3:latest` | 4.7 GB | General purpose | Chat, comparison | +| `llama3.1:8B` | 4.9 GB | General purpose | arXiv summarization, comparison | | `llama3.2:latest` | 2.0 GB | General purpose | Chat, summarization, comparison | | `gemma3:1b` | 815 MB | General purpose | Chat, comparison | | `command-r7b:latest` | 4.7 GB | RAG-optimized | Document summarization |