llm-workshop/vocab.md
Eric Furst 59e5f86884 Sync RAG and semantic-search updates from che-computing
- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts
- 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus
  and finding current SOTA models, formal references (Lewis, Booth)
- 04-semantic-search: libraries pointer back to Part III, larger corpus
  subsection, model-update exercise, formal references
- 06-neural-networks: add Nielsen reference (recommended by student)
- README: vocab.md link, agentic systems in description, Ollama prereq for 02-05
- New: vocab.md (glossary organized by section)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 12:05:08 -04:00

11 KiB

Vocabulary

Key terms organized by the section where they are first introduced.


Section 01: nanoGPT

Term Definition
GPT Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence.
Transformer A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence.
Self-attention A mechanism that lets the model consider relationships between all tokens in the context window when making predictions.
Token The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens.
Tokenization Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches.
Byte pair encoding (BPE) A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs.
Vocabulary size The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables.
Embedding A vector representation of a token. Each token ID is mapped to a vector of fixed size (n_embd). Similar tokens end up with similar vectors.
Context window The number of tokens the model can "see" when predicting the next one (block_size in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations.
Attention head One of several parallel attention mechanisms in a transformer layer (n_head). Each head can learn to attend to different patterns.
Parameters The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions.
Weights and biases The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned.
Training The process of adjusting parameters to minimize loss on a dataset. Requires significant compute.
Inference Running a trained model to generate output. Much cheaper than training.
Loss A number measuring how wrong the model's predictions are. Training aims to minimize it.
Validation Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set.
Epoch One complete pass through the training dataset.
Iteration One training step, typically on a batch (subset) of data.
Checkpoint A saved snapshot of the model's parameters at a particular point during training.
Temperature A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable.
Seed A value that initializes random number generation. Same seed produces same output, useful for reproducibility.
Dropout A regularization technique that randomly disables neurons during training to prevent overfitting.
Fine-tuning Additional training on a pre-trained model using a smaller, specialized dataset.

Section 02: Ollama

Term Definition
Ollama A local runtime for running LLMs on your own machine without cloud APIs.
llama.cpp A C++ library for efficient local LLM inference. Ollama builds on it.
GGUF A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference.
Quantization Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff.
FP32 / FP16 / INT8 / Q4 Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference.
Logit The raw score a model assigns to each possible next token before converting to probabilities.
Top-k sampling A decoding strategy that considers only the k highest-scoring tokens when generating.
Top-p sampling A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95).
System prompt Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime.
Modelfile A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt.
API Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses.

Section 03: RAG

Term Definition
RAG (Retrieval-Augmented Generation) A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data.
Chunking Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens.
Chunk overlap Tokens shared between consecutive chunks, so sentences at boundaries are not lost.
Vector store An indexed collection of embedded chunks, searchable by vector similarity.
Cosine similarity A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query.
Semantic search Search based on meaning rather than exact keyword matching, enabled by embeddings.
LlamaIndex A Python framework for building RAG systems: chunking, embedding, indexing, and querying.
Node In LlamaIndex, a parsed text segment ready for embedding and indexing.
Context The retrieved chunks passed to the LLM as background information for answering a query.
Generator The LLM component in a RAG system that reads retrieved context and composes a response.
Term Definition
Hybrid retrieval Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches.
Dense retrieval Vector-based search using embeddings. Good at finding semantically similar text even with different wording.
Sparse retrieval Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms.
BM25 "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length.
Cross-encoder A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower.
Re-ranking A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality.
Candidate pool The initial set of retrieved chunks before re-ranking narrows them down.

Section 05: Tool Use and Agentic Systems

Term Definition
Agentic system A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot.
Tool calling (function calling) The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself.
Orchestration The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done.
Memory Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system.
Type hints Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures.
Docstring Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM.

Section 06: Neural Networks

Term Definition
Neural network A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function.
Machine learning (ML) Training models to learn patterns from data rather than programming rules by hand. LLMs are one example.
Forward pass Computing the output of a network from its inputs, layer by layer.
Pre-activation The weighted sum plus bias before the activation function is applied: z = w*x + b.
Activation function A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function.
Hidden layer A layer between input and output. Called "hidden" because its values are not directly observed.
Backpropagation Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step.
Gradient The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way.
Gradient descent The algorithm for updating parameters: w = w - learning_rate * gradient.
Learning rate How big each gradient descent step is. Too large: training oscillates. Too small: training is slow.
Mean squared error (MSE) A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting.
Cross-entropy loss A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction.
Overfitting When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss.
Train/validation split Holding out some data to test generalization. The model trains on one set and is evaluated on the other.
Early stopping Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's train.py does.
Normalization Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features.
Automatic differentiation PyTorch's ability to compute all gradients automatically via loss.backward(), replacing hand-coded backpropagation.
Adam optimizer An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent.