llm-workshop/vocab.md

# Vocabulary

Key terms organized by the section where they are first introduced.

---

## Section 01: nanoGPT

| Term | Definition |
|------|-----------|
| **GPT** | Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence. |
| **Transformer** | A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence. |
| **Self-attention** | A mechanism that lets the model consider relationships between all tokens in the context window when making predictions. |
| **Token** | The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens. |
| **Tokenization** | Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches. |
| **Byte pair encoding (BPE)** | A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs. |
| **Vocabulary size** | The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables. |
| **Embedding** | A vector representation of a token. Each token ID is mapped to a vector of fixed size (`n_embd`). Similar tokens end up with similar vectors. |
| **Context window** | The number of tokens the model can "see" when predicting the next one (`block_size` in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations. |
| **Attention head** | One of several parallel attention mechanisms in a transformer layer (`n_head`). Each head can learn to attend to different patterns. |
| **Parameters** | The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions. |
| **Weights and biases** | The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned. |
| **Training** | The process of adjusting parameters to minimize loss on a dataset. Requires significant compute. |
| **Inference** | Running a trained model to generate output. Much cheaper than training. |
| **Loss** | A number measuring how wrong the model's predictions are. Training aims to minimize it. |
| **Validation** | Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set. |
| **Epoch** | One complete pass through the training dataset. |
| **Iteration** | One training step, typically on a batch (subset) of data. |
| **Checkpoint** | A saved snapshot of the model's parameters at a particular point during training. |
| **Temperature** | A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable. |
| **Seed** | A value that initializes random number generation. Same seed produces same output, useful for reproducibility. |
| **Dropout** | A regularization technique that randomly disables neurons during training to prevent overfitting. |
| **Fine-tuning** | Additional training on a pre-trained model using a smaller, specialized dataset. |

## Section 02: Ollama

| Term | Definition |
|------|-----------|
| **Ollama** | A local runtime for running LLMs on your own machine without cloud APIs. |
| **llama.cpp** | A C++ library for efficient local LLM inference. Ollama builds on it. |
| **GGUF** | A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference. |
| **Quantization** | Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff. |
| **FP32 / FP16 / INT8 / Q4** | Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference. |
| **Logit** | The raw score a model assigns to each possible next token before converting to probabilities. |
| **Top-k sampling** | A decoding strategy that considers only the k highest-scoring tokens when generating. |
| **Top-p sampling** | A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95). |
| **System prompt** | Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime. |
| **Modelfile** | A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt. |
| **API** | Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses. |

## Section 03: RAG

| Term | Definition |
|------|-----------|
| **RAG (Retrieval-Augmented Generation)** | A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data. |
| **Chunking** | Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens. |
| **Chunk overlap** | Tokens shared between consecutive chunks, so sentences at boundaries are not lost. |
| **Vector store** | An indexed collection of embedded chunks, searchable by vector similarity. |
| **Cosine similarity** | A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query. |
| **Semantic search** | Search based on meaning rather than exact keyword matching, enabled by embeddings. |
| **LlamaIndex** | A Python framework for building RAG systems: chunking, embedding, indexing, and querying. |
| **Node** | In LlamaIndex, a parsed text segment ready for embedding and indexing. |
| **Context** | The retrieved chunks passed to the LLM as background information for answering a query. |
| **Generator** | The LLM component in a RAG system that reads retrieved context and composes a response. |

## Section 04: Semantic Search

| Term | Definition |
|------|-----------|
| **Hybrid retrieval** | Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches. |
| **Dense retrieval** | Vector-based search using embeddings. Good at finding semantically similar text even with different wording. |
| **Sparse retrieval** | Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms. |
| **BM25** | "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length. |
| **Cross-encoder** | A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower. |
| **Re-ranking** | A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality. |
| **Candidate pool** | The initial set of retrieved chunks before re-ranking narrows them down. |

## Section 05: Tool Use and Agentic Systems

| Term | Definition |
|------|-----------|
| **Agentic system** | A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot. |
| **Tool calling (function calling)** | The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself. |
| **Orchestration** | The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done. |
| **Memory** | Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system. |
| **Type hints** | Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures. |
| **Docstring** | Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM. |

## Section 06: Neural Networks

| Term | Definition |
|------|-----------|
| **Neural network** | A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function. |
| **Machine learning (ML)** | Training models to learn patterns from data rather than programming rules by hand. LLMs are one example. |
| **Forward pass** | Computing the output of a network from its inputs, layer by layer. |
| **Pre-activation** | The weighted sum plus bias before the activation function is applied: z = w*x + b. |
| **Activation function** | A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function. |
| **Hidden layer** | A layer between input and output. Called "hidden" because its values are not directly observed. |
| **Backpropagation** | Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step. |
| **Gradient** | The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way. |
| **Gradient descent** | The algorithm for updating parameters: w = w - learning_rate * gradient. |
| **Learning rate** | How big each gradient descent step is. Too large: training oscillates. Too small: training is slow. |
| **Mean squared error (MSE)** | A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting. |
| **Cross-entropy loss** | A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction. |
| **Overfitting** | When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss. |
| **Train/validation split** | Holding out some data to test generalization. The model trains on one set and is evaluated on the other. |
| **Early stopping** | Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's `train.py` does. |
| **Normalization** | Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features. |
| **Automatic differentiation** | PyTorch's ability to compute all gradients automatically via `loss.backward()`, replacing hand-coded backpropagation. |
| **Adam optimizer** | An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent. |