From 59e5f86884ebc53533313bf0ac2107b423bd8041 Mon Sep 17 00:00:00 2001 From: Eric Furst Date: Tue, 28 Apr 2026 12:05:08 -0400 Subject: [PATCH] Sync RAG and semantic-search updates from che-computing - 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts - 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus and finding current SOTA models, formal references (Lewis, Booth) - 04-semantic-search: libraries pointer back to Part III, larger corpus subsection, model-update exercise, formal references - 06-neural-networks: add Nielsen reference (recommended by student) - README: vocab.md link, agentic systems in description, Ollama prereq for 02-05 - New: vocab.md (glossary organized by section) Co-Authored-By: Claude Opus 4.6 (1M context) --- 03-rag/README.md | 76 ++++++++++++++++++-- 03-rag/build.py | 8 +++ 03-rag/fetch_arxiv.py | 112 ++++++++++++++++++++++++++++++ 03-rag/query.py | 15 ++-- 04-semantic-search/README.md | 43 +++++++++++- 04-semantic-search/build_store.py | 9 ++- 06-neural-networks/README.md | 1 + README.md | 2 +- vocab.md | 110 +++++++++++++++++++++++++++++ 9 files changed, 359 insertions(+), 17 deletions(-) create mode 100644 03-rag/fetch_arxiv.py create mode 100644 vocab.md diff --git a/03-rag/README.md b/03-rag/README.md index 951c0a4..36b6974 100644 --- a/03-rag/README.md +++ b/03-rag/README.md @@ -100,9 +100,46 @@ Save this as `cache_model.py` and run it: ```bash python cache_model.py ``` +(This is also saved in the Github.) Each script that uses the model will set environmental variables to prevent checking for updates. You can manually update either by running `cache_model.py` or editing the scripts themselves. + +## 2. The libraries we use + +A RAG system is built from three independent layers, each handled by a different library: + +| Layer | Library | What it does | +|-------|---------|--------------| +| **Orchestration** | [LlamaIndex](https://docs.llamaindex.ai/) | Glues the pieces together: chunking, indexing, retrieval, prompt assembly, response synthesis | +| **Embeddings** | [Hugging Face](https://huggingface.co/) (via `sentence-transformers`) | Provides the model that converts text into vectors | +| **Generation** | [Ollama](https://ollama.com/) | Runs the LLM that produces the final answer | + +LlamaIndex used to be a single package; since version 0.10 it has been split into a small `llama-index-core` plus dedicated integration packages. That is why our `pip install` line includes several `llama-index-*` packages -- one for each external thing we plug in (Ollama for the LLM, Hugging Face for embeddings, file readers for local documents). If you find older tutorials online that import from `llama_index` (no `.core`), they predate the split and will not work. + +The two key patterns to recognize in `build.py` and `query.py`: + +**1. Global `Settings`.** Instead of passing the LLM and embedding model into every call, LlamaIndex uses a global `Settings` object: + +```python +Settings.llm = Ollama(model="command-r7b", request_timeout=360.0) +Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5") +``` + +After these two lines, every component (index, query engine, retriever) automatically uses the configured models. This replaced the older `ServiceContext` pattern, which has been removed. + +**2. Environment variables before imports.** At the top of each script: + +```python +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" +os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" +os.environ["HF_HUB_OFFLINE"] = "1" +``` + +These must come *before* `from llama_index...` imports, because the Hugging Face libraries read the environment at import time. `HF_HUB_OFFLINE=1` tells the libraries not to check the Hub for updates on every run -- without it, you will see a "sending unauthenticated requests to the HF Hub" warning and the script may slow down or stall on a poor connection. `SENTENCE_TRANSFORMERS_HOME` controls where embedding models are cached. + +If you ever swap to a different embedding model and need a fresh download, temporarily remove `HF_HUB_OFFLINE` for one run or use a standalone script like `cache_model.py`. -## 2. The documents +## 3. The documents The `data/` directory contains 10 emails from the University of Delaware president's office, spanning 2012–2025 (the same set from Part II). Each is a plain text file with a subject line, date, and body text. @@ -124,7 +161,7 @@ python clean_eml.py This extracts the subject, date, and body from each email and writes a dated `.txt` file to `./data`. -## 3. Building the vector store +## 4. Building the vector store The script `build.py` does the heavy lifting: @@ -154,7 +191,7 @@ We can't embed an entire document as a single vector — it would lose too much > **Exercise 1:** Look at `build.py`. What would happen if you made the chunks much smaller (e.g., 100 tokens)? Much larger (e.g., 2000 tokens)? Think about the tradeoff between precision and context. -## 4. Querying the vector store +## 5. Querying the vector store The script `query.py` loads the stored index, takes your question, and returns a response grounded in the documents: @@ -207,7 +244,7 @@ Notice the **similarity scores** — these are cosine similarities between the q > **Exercise 2:** Run the same query twice. Do you get exactly the same output? Why or why not? -## 5. Understanding the pieces +## 6. Understanding the pieces ### The embedding model @@ -236,7 +273,7 @@ Our custom prompt in `query.py` is more detailed — it asks for structured outp > **Exercise 3:** Modify the prompt in `query.py`. For example, ask the model to respond in the style of a news reporter, or to focus only on dates and events. How does the output change? -## 6. Exercises +## 7. Exercises > **Exercise 4:** Try different embedding models. Replace `BAAI/bge-large-en-v1.5` with `sentence-transformers/all-mpnet-base-v2` in both `build.py` and `query.py`. Rebuild the vector store and compare the results. @@ -246,6 +283,30 @@ Our custom prompt in `query.py` is more detailed — it asks for structured outp > **Exercise 7:** Bring your own documents. Find a collection of text files — research paper abstracts, class notes, or a downloaded text from Project Gutenberg — and build a RAG system over them. What questions can you answer that a plain LLM cannot? +> **Exercise 8 (optional, sets up Part IV):** Build a larger corpus. Ten emails is small enough that retrieval is barely selective — the system returns most of the corpus on every query. The script `fetch_arxiv.py` pulls 100 recent abstracts from a chosen arXiv category and writes one text file per abstract: +> +> ```bash +> python fetch_arxiv.py --category cs.LG --max 100 --output data_arxiv +> ``` +> +> Try other categories: `physics.chem-ph` (chemical physics), `cond-mat.soft` (soft matter), `cs.AI` (artificial intelligence), `cs.CL` (computational linguistics), `physics.flu-dyn` (fluid dynamics). Then update `build.py` to point at your new directory (or symlink it as `./data`), rebuild the vector store, and query it. With a 100-document corpus, retrieval becomes meaningfully selective and the choice of embedding model matters more. +> +> Other corpora to consider: +> - **CCPS process safety case studies** — https://www.aiche.org/ccps/resources (some are openly available as text or PDF) +> - **US Chemical Safety Board incident reports** — https://www.csb.gov/investigations/ +> - **NIST chemistry data sheets** — https://webbook.nist.gov/ +> - **AIChE journal abstracts** — many publishers expose abstracts via their APIs +> +> If your sources are PDFs, install `llama-index-readers-file` (already in `requirements.txt`) and use `SimpleDirectoryReader` — it picks up `.pdf` automatically. + +> **Exercise 9 (optional):** The embedding model `BAAI/bge-large-en-v1.5` and the LLM `command-r7b` were both released in 2024. By the time you read this, newer and likely better models exist. Find a current state-of-the-art: +> +> - Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models +> - Browse [Ollama's model library](https://ollama.com/library) sorted by recent or popular for current LLMs +> - Replace one model at a time in `build.py` and `query.py`, rebuild if the embedding model changes, and compare retrieval quality +> +> Document the model versions and dates in your machine log. Models that "feel old" are part of the engineering reality of working with this stack — what was best last year may not be best today. + ## Additional resources and references @@ -270,5 +331,6 @@ Other embedding model mentioned: `sentence-transformers/all-mpnet-base-v2` ### Further reading -- NIST IR 8579, [*Developing the NCCoE Chatbot: Technical and Security Learnings from the Initial Implementation*](https://csrc.nist.gov/pubs/ir/8579/ipd) ([PDF](https://nvlpubs.nist.gov/nistpubs/ir/2025/NIST.IR.8579.ipd.pdf)) — practical guidance on building a RAG-based chatbot, including architecture and security considerations -- Open WebUI (https://openwebui.com) — a turnkey local RAG interface if you want a GUI +- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In *Advances in Neural Information Processing Systems*, 2020. Curran Associates, Inc., 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html — the foundational RAG paper that introduced the retrieve-and-generate framework we use here. +- Harold Booth. 2025. *Development and Implementation of the NCCoE Chatbot: A Comprehensive Report*. National Institute of Standards and Technology, Gaithersburg, MD. https://doi.org/10.6028/NIST.IR.8579.ipd — practical guidance on building a RAG-based chatbot, including architecture and security considerations. +- Open WebUI (https://openwebui.com) — a turnkey local RAG interface if you want a GUI. diff --git a/03-rag/build.py b/03-rag/build.py index 84ae988..93e0765 100644 --- a/03-rag/build.py +++ b/03-rag/build.py @@ -6,6 +6,14 @@ # August 2025 # E. M. Furst +# Environment vars must be set before importing huggingface/transformers +# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE +# at import time. +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" +os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" +os.environ["HF_HUB_OFFLINE"] = "1" + from llama_index.core import ( SimpleDirectoryReader, VectorStoreIndex, diff --git a/03-rag/fetch_arxiv.py b/03-rag/fetch_arxiv.py new file mode 100644 index 0000000..c5c56bc --- /dev/null +++ b/03-rag/fetch_arxiv.py @@ -0,0 +1,112 @@ +# fetch_arxiv.py +# +# Fetch arXiv abstracts and write each as a separate text file in ./data. +# This builds a larger, more interesting corpus for RAG experiments. +# +# Default: 100 most recent abstracts in cs.LG (machine learning). +# Try other categories: physics.chem-ph, cond-mat.soft, cs.AI, cs.CL, +# physics.flu-dyn, etc. +# +# Usage: +# python fetch_arxiv.py # default: cs.LG, 100 papers +# python fetch_arxiv.py --category cs.AI +# python fetch_arxiv.py --category cs.LG --max 200 --output data_arxiv +# +# CHEG 667-013 + +import argparse +import os +import re +import time +import urllib.parse +import urllib.request +import xml.etree.ElementTree as ET + + +def fetch_abstracts(category, max_results, batch_size=50): + """Fetch arXiv abstracts in batches via the API.""" + base_url = "https://export.arxiv.org/api/query" + ns = {"atom": "http://www.w3.org/2005/Atom"} + entries = [] + + for start in range(0, max_results, batch_size): + n = min(batch_size, max_results - start) + params = { + "search_query": f"cat:{category}", + "sortBy": "submittedDate", + "sortOrder": "descending", + "start": start, + "max_results": n, + } + url = f"{base_url}?{urllib.parse.urlencode(params)}" + print(f" fetching {start+1}-{start+n}...") + with urllib.request.urlopen(url) as resp: + data = resp.read() + root = ET.fromstring(data) + batch = root.findall("atom:entry", ns) + if not batch: + break + entries.extend(batch) + time.sleep(3) # arXiv asks for 3-second delay between requests + return entries + + +def safe_filename(s, max_len=80): + """Convert a title to a filesystem-safe filename.""" + s = re.sub(r"\s+", "_", s.strip()) + s = re.sub(r"[^A-Za-z0-9._-]", "", s) + return s[:max_len] + + +def write_abstract(entry, outdir): + """Extract title, authors, date, abstract; write to a text file.""" + ns = {"atom": "http://www.w3.org/2005/Atom"} + title = entry.find("atom:title", ns).text.strip() + summary = entry.find("atom:summary", ns).text.strip() + published = entry.find("atom:published", ns).text.strip()[:10] + authors = [ + a.find("atom:name", ns).text + for a in entry.findall("atom:author", ns) + ] + arxiv_id = entry.find("atom:id", ns).text.strip().split("/")[-1] + + fname = f"{published}_{safe_filename(title)}.txt" + path = os.path.join(outdir, fname) + body = ( + f"Title: {title}\n" + f"Authors: {', '.join(authors)}\n" + f"Date: {published}\n" + f"arXiv: {arxiv_id}\n" + f"\n" + f"{summary}\n" + ) + with open(path, "w") as f: + f.write(body) + return fname + + +def main(): + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--category", default="cs.LG", + help="arXiv category (default: cs.LG)") + parser.add_argument("--max", type=int, default=100, + help="number of abstracts to fetch (default: 100)") + parser.add_argument("--output", default="data", + help="output directory (default: data)") + args = parser.parse_args() + + os.makedirs(args.output, exist_ok=True) + print(f"Fetching {args.max} abstracts from arXiv:{args.category} -> {args.output}/") + entries = fetch_abstracts(args.category, args.max) + print(f"Got {len(entries)} entries. Writing to {args.output}/...") + for e in entries: + try: + fname = write_abstract(e, args.output) + except Exception as exc: + print(f" skipped one: {exc}") + continue + print(f"Done. {len(os.listdir(args.output))} files in {args.output}/") + + +if __name__ == "__main__": + main() diff --git a/03-rag/query.py b/03-rag/query.py index 42eb740..0d36ba8 100644 --- a/03-rag/query.py +++ b/03-rag/query.py @@ -5,6 +5,14 @@ # August 2025 # E. M. Furst +# Environment vars must be set before importing huggingface/transformers +# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE +# at import time. +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" +os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" +os.environ["HF_HUB_OFFLINE"] = "1" + from llama_index.core import ( load_index_from_storage, StorageContext, @@ -13,12 +21,7 @@ from llama_index.core import ( from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.ollama import Ollama from llama_index.core.prompts import PromptTemplate -import os, time - -# -# Globals -# -os.environ["TOKENIZERS_PARALLELISM"] = "false" +import time # Embedding model used in vector store (this should match the one in build.py) embed_model = HuggingFaceEmbedding(cache_folder="./models", diff --git a/04-semantic-search/README.md b/04-semantic-search/README.md index 908a478..d11e5e6 100644 --- a/04-semantic-search/README.md +++ b/04-semantic-search/README.md @@ -103,9 +103,45 @@ Make sure `ollama` is running and `command-r7b` is available: ollama pull command-r7b ``` +### Libraries and environment variables + +This section uses the same three-layer architecture introduced in [Part III](../03-rag/README.md#2-the-libraries-we-use), LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: `llama-index-retrievers-bm25` for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model. + +The same `Settings`-based configuration applies, and the same environment-variable pattern is used at the top of every script: + +```python +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" +os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" +os.environ["HF_HUB_OFFLINE"] = "1" +``` + +These come *before* any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. `HF_HUB_OFFLINE=1` prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See [Part III, section 2](../03-rag/README.md#2-the-libraries-we-use) for the full explanation. + ## 3. Building the vector store +### A larger corpus + +The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus. + +If you did Exercise 8 in Part III, you already have one. Otherwise, run: + +```bash +python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data +``` + +This populates `./data` with 100 recent papers from `cs.LG` (machine learning). Other relevant categories: + +- `physics.chem-ph` -- chemical physics +- `cond-mat.soft` -- soft matter +- `physics.flu-dyn` -- fluid dynamics +- `cs.AI` -- artificial intelligence + +You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- `SimpleDirectoryReader` reads them automatically when `llama-index-readers-file` is installed. + +### Building the index + The `build_store.py` script works like the one in Part III, with a few differences: - **Smaller chunks**: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap @@ -249,6 +285,8 @@ This is a fallback when you know exactly what you're looking for and don't need > **Exercise 8:** Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents? +> **Exercise 9:** The cross-encoder we use (`cross-encoder/ms-marco-MiniLM-L-12-v2`) and the embedding model (`BAAI/bge-large-en-v1.5`) date from 2022-2024. Newer models are likely available. Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today. + ## Additional resources and references @@ -272,5 +310,6 @@ This is a fallback when you know exactly what you're looking for and don't need ### Further reading -- Robertson & Zaragoza, *The Probabilistic Relevance Framework: BM25 and Beyond* (2009) — the theory behind BM25 -- Nogueira & Cho, *Passage Re-ranking with BERT* (2019) — cross-encoder re-ranking applied to information retrieval +- Stephen Robertson and Hugo Zaragoza. 2009. *The Probabilistic Relevance Framework: BM25 and Beyond*. Foundations and Trends in Information Retrieval 3, 4 (April 2009), 333-389. https://doi.org/10.1561/1500000019 — the theoretical basis for BM25. +- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 — cross-encoder re-ranking applied to information retrieval; the approach we use in `query_hybrid.py`. +- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*, 3982-3992. https://arxiv.org/abs/1908.10084 — the foundational paper for `sentence-transformers`, the library behind both our embedding and cross-encoder models. diff --git a/04-semantic-search/build_store.py b/04-semantic-search/build_store.py index add3db3..0631622 100644 --- a/04-semantic-search/build_store.py +++ b/04-semantic-search/build_store.py @@ -9,6 +9,14 @@ # E. M. Furst # Used Sonnet 4.5 to suggest changes; Opus 4.6 for incremental update +# Environment vars must be set before importing huggingface/transformers +# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE +# at import time. +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" +os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" +os.environ["HF_HUB_OFFLINE"] = "1" + from llama_index.core import ( SimpleDirectoryReader, StorageContext, @@ -21,7 +29,6 @@ from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core.node_parser import SentenceSplitter import argparse import datetime -import os import time # Shared constants diff --git a/06-neural-networks/README.md b/06-neural-networks/README.md index 2eb8263..ec492e0 100644 --- a/06-neural-networks/README.md +++ b/06-neural-networks/README.md @@ -254,5 +254,6 @@ The fundamental loop — forward pass, compute loss, backpropagate, update weigh ### Reading - Zhang, Lipton, Li & Smola, *Dive into Deep Learning* — interactive, with runnable code in PyTorch: https://d2l.ai +- Nielsen, *Neural Networks and Deep Learning* — clear introduction with interactive plots and exercises: http://neuralnetworksanddeeplearning.com/ - Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/ - 3Blue1Brown, *Neural Networks* video series: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi — excellent visual intuition for how neural networks learn diff --git a/README.md b/README.md index c14fb2b..1bf8274 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ git clone https://lem.che.udel.edu/git/furst/llm-workshop.git cd llm-workshop ``` -Each section has its own `README.md` with a full walkthrough, exercises, and any code or data needed. +Each section has its own `README.md` with a full walkthrough, exercises, and any code or data needed. See [vocab.md](vocab.md) for a glossary of key terms organized by section. ### Python environment diff --git a/vocab.md b/vocab.md new file mode 100644 index 0000000..a2689af --- /dev/null +++ b/vocab.md @@ -0,0 +1,110 @@ +# Vocabulary + +Key terms organized by the section where they are first introduced. + +--- + +## Section 01: nanoGPT + +| Term | Definition | +|------|-----------| +| **GPT** | Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence. | +| **Transformer** | A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence. | +| **Self-attention** | A mechanism that lets the model consider relationships between all tokens in the context window when making predictions. | +| **Token** | The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens. | +| **Tokenization** | Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches. | +| **Byte pair encoding (BPE)** | A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs. | +| **Vocabulary size** | The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables. | +| **Embedding** | A vector representation of a token. Each token ID is mapped to a vector of fixed size (`n_embd`). Similar tokens end up with similar vectors. | +| **Context window** | The number of tokens the model can "see" when predicting the next one (`block_size` in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations. | +| **Attention head** | One of several parallel attention mechanisms in a transformer layer (`n_head`). Each head can learn to attend to different patterns. | +| **Parameters** | The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions. | +| **Weights and biases** | The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned. | +| **Training** | The process of adjusting parameters to minimize loss on a dataset. Requires significant compute. | +| **Inference** | Running a trained model to generate output. Much cheaper than training. | +| **Loss** | A number measuring how wrong the model's predictions are. Training aims to minimize it. | +| **Validation** | Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set. | +| **Epoch** | One complete pass through the training dataset. | +| **Iteration** | One training step, typically on a batch (subset) of data. | +| **Checkpoint** | A saved snapshot of the model's parameters at a particular point during training. | +| **Temperature** | A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable. | +| **Seed** | A value that initializes random number generation. Same seed produces same output, useful for reproducibility. | +| **Dropout** | A regularization technique that randomly disables neurons during training to prevent overfitting. | +| **Fine-tuning** | Additional training on a pre-trained model using a smaller, specialized dataset. | + +## Section 02: Ollama + +| Term | Definition | +|------|-----------| +| **Ollama** | A local runtime for running LLMs on your own machine without cloud APIs. | +| **llama.cpp** | A C++ library for efficient local LLM inference. Ollama builds on it. | +| **GGUF** | A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference. | +| **Quantization** | Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff. | +| **FP32 / FP16 / INT8 / Q4** | Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference. | +| **Logit** | The raw score a model assigns to each possible next token before converting to probabilities. | +| **Top-k sampling** | A decoding strategy that considers only the k highest-scoring tokens when generating. | +| **Top-p sampling** | A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95). | +| **System prompt** | Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime. | +| **Modelfile** | A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt. | +| **API** | Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses. | + +## Section 03: RAG + +| Term | Definition | +|------|-----------| +| **RAG (Retrieval-Augmented Generation)** | A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data. | +| **Chunking** | Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens. | +| **Chunk overlap** | Tokens shared between consecutive chunks, so sentences at boundaries are not lost. | +| **Vector store** | An indexed collection of embedded chunks, searchable by vector similarity. | +| **Cosine similarity** | A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query. | +| **Semantic search** | Search based on meaning rather than exact keyword matching, enabled by embeddings. | +| **LlamaIndex** | A Python framework for building RAG systems: chunking, embedding, indexing, and querying. | +| **Node** | In LlamaIndex, a parsed text segment ready for embedding and indexing. | +| **Context** | The retrieved chunks passed to the LLM as background information for answering a query. | +| **Generator** | The LLM component in a RAG system that reads retrieved context and composes a response. | + +## Section 04: Semantic Search + +| Term | Definition | +|------|-----------| +| **Hybrid retrieval** | Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches. | +| **Dense retrieval** | Vector-based search using embeddings. Good at finding semantically similar text even with different wording. | +| **Sparse retrieval** | Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms. | +| **BM25** | "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length. | +| **Cross-encoder** | A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower. | +| **Re-ranking** | A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality. | +| **Candidate pool** | The initial set of retrieved chunks before re-ranking narrows them down. | + +## Section 05: Tool Use and Agentic Systems + +| Term | Definition | +|------|-----------| +| **Agentic system** | A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot. | +| **Tool calling (function calling)** | The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself. | +| **Orchestration** | The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done. | +| **Memory** | Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system. | +| **Type hints** | Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures. | +| **Docstring** | Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM. | + +## Section 06: Neural Networks + +| Term | Definition | +|------|-----------| +| **Neural network** | A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function. | +| **Machine learning (ML)** | Training models to learn patterns from data rather than programming rules by hand. LLMs are one example. | +| **Forward pass** | Computing the output of a network from its inputs, layer by layer. | +| **Pre-activation** | The weighted sum plus bias before the activation function is applied: z = w*x + b. | +| **Activation function** | A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function. | +| **Hidden layer** | A layer between input and output. Called "hidden" because its values are not directly observed. | +| **Backpropagation** | Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step. | +| **Gradient** | The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way. | +| **Gradient descent** | The algorithm for updating parameters: w = w - learning_rate * gradient. | +| **Learning rate** | How big each gradient descent step is. Too large: training oscillates. Too small: training is slow. | +| **Mean squared error (MSE)** | A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting. | +| **Cross-entropy loss** | A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction. | +| **Overfitting** | When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss. | +| **Train/validation split** | Holding out some data to test generalization. The model trains on one set and is evaluated on the other. | +| **Early stopping** | Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's `train.py` does. | +| **Normalization** | Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features. | +| **Automatic differentiation** | PyTorch's ability to compute all gradients automatically via `loss.backward()`, replacing hand-coded backpropagation. | +| **Adam optimizer** | An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent. |