llm-workshop/vocab.md

# Vocabulary

Key terms organized by the section where they are first introduced.

---

## Section 01: nanoGPT

| Term | Definition |
|------|-----------|
| **GPT** | Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence. |
| **Transformer** | A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence. |
| **Self-attention** | A mechanism that lets the model consider relationships between all tokens in the context window when making predictions. |
| **Token** | The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens. |
| **Tokenization** | Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches. |
| **Byte pair encoding (BPE)** | A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs. |
| **Vocabulary size** | The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables. |
| **Embedding** | A vector representation of a token. Each token ID is mapped to a vector of fixed size (`n_embd`). Similar tokens end up with similar vectors. |
| **Context window** | The number of tokens the model can "see" when predicting the next one (`block_size` in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations. |
| **Attention head** | One of several parallel attention mechanisms in a transformer layer (`n_head`). Each head can learn to attend to different patterns. |
| **Parameters** | The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions. |
| **Weights and biases** | The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned. |
| **Training** | The process of adjusting parameters to minimize loss on a dataset. Requires significant compute. |
| **Inference** | Running a trained model to generate output. Much cheaper than training. |
| **Loss** | A number measuring how wrong the model's predictions are. Training aims to minimize it. |
| **Validation** | Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set. |
| **Epoch** | One complete pass through the training dataset. |
| **Iteration** | One training step, typically on a batch (subset) of data. |
| **Checkpoint** | A saved snapshot of the model's parameters at a particular point during training. |
| **Temperature** | A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable. |
| **Seed** | A value that initializes random number generation. Same seed produces same output, useful for reproducibility. |
| **Dropout** | A regularization technique that randomly disables neurons during training to prevent overfitting. |
| **Fine-tuning** | Additional training on a pre-trained model using a smaller, specialized dataset. |

## Section 02: Ollama

| Term | Definition |
|------|-----------|
| **Ollama** | A local runtime for running LLMs on your own machine without cloud APIs. |
| **llama.cpp** | A C++ library for efficient local LLM inference. Ollama builds on it. |
| **GGUF** | A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference. |
| **Quantization** | Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff. |
| **FP32 / FP16 / INT8 / Q4** | Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference. |
| **Logit** | The raw score a model assigns to each possible next token before converting to probabilities. |
| **Top-k sampling** | A decoding strategy that considers only the k highest-scoring tokens when generating. |
| **Top-p sampling** | A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95). |
| **System prompt** | Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime. |
| **Modelfile** | A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt. |
| **API** | Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses. |
| **Embedding length** | The dimensionality of a model's internal vector representation of each token. Same idea as `n_embd` in nanoGPT. Larger embedding length captures more meaning at the cost of memory. |
| **Repeat penalty** | A parameter that discourages the model from repeating tokens it has recently produced. Helps avoid loops. |
| **Min-p sampling** | A sampling strategy that keeps tokens whose probability is at least `min_p` times the top token's probability. |
| **Hallucination** | When a model produces confident-looking output that is factually wrong. The base model is doing what it always does (predicting plausible tokens); grounding via retrieval or tool use reduces it. |

## Section 03: RAG

| Term | Definition |
|------|-----------|
| **RAG (Retrieval-Augmented Generation)** | A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data. |
| **Chunking** | Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens. |
| **Chunk overlap** | Tokens shared between consecutive chunks, so sentences at boundaries are not lost. |
| **Vector store** | An indexed collection of embedded chunks, searchable by vector similarity. |
| **Cosine similarity** | A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query. |
| **Semantic search** | Search based on meaning rather than exact keyword matching, enabled by embeddings. |
| **LlamaIndex** | A Python framework for building RAG systems: chunking, embedding, indexing, and querying. Split since v0.10 into `llama-index-core` plus integration packages. |
| **Settings** | LlamaIndex's global configuration object. Setting `Settings.llm` and `Settings.embed_model` once configures all downstream components. Replaced the deprecated `ServiceContext`. |
| **Node** | In LlamaIndex, a parsed text segment ready for embedding and indexing. |
| **Context** | The retrieved chunks passed to the LLM as background information for answering a query. |
| **Generator** | The LLM component in a RAG system that reads retrieved context and composes a response. |
| **Embedding model** | A model whose job is to convert text to vectors. Different from the generator (LLM). We use `BAAI/bge-large-en-v1.5`. |
| **Hugging Face Hub** | A registry of open-source models (embeddings, LLMs, cross-encoders). Models download automatically on first use. |
| **`sentence-transformers`** | A Python library that loads and runs sentence/embedding models from Hugging Face. Used under the hood by LlamaIndex's `HuggingFaceEmbedding`. |
| **`HF_HUB_OFFLINE`** | An environment variable that tells Hugging Face libraries not to check the Hub for updates. Set it (along with `TOKENIZERS_PARALLELISM` and `SENTENCE_TRANSFORMERS_HOME`) *before* importing LlamaIndex, because the libraries read the environment at import time. |

## Section 04: Semantic Search

| Term | Definition |
|------|-----------|
| **Hybrid retrieval** | Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches. |
| **Dense retrieval** | Vector-based search using embeddings. Good at finding semantically similar text even with different wording. |
| **Sparse retrieval** | Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms. |
| **BM25** | "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length. |
| **Cross-encoder** | A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower. |
| **Bi-encoder** | A model that encodes query and document separately into vectors, then compares them. Embedding models are bi-encoders. Fast at scale; less accurate per pair than a cross-encoder. |
| **Re-ranking** | A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality. |
| **Candidate pool** | The initial set of retrieved chunks before re-ranking narrows them down. |
| **MTEB** | Massive Text Embedding Benchmark. A public leaderboard at https://huggingface.co/spaces/mteb/leaderboard for comparing embedding and re-ranking models. Useful for finding current state-of-the-art. |

## Section 05: Tool Use and Agentic Systems

| Term | Definition |
|------|-----------|
| **Agentic system** | A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot. |
| **Tool calling (function calling)** | The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself. |
| **Orchestration** | The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done. |
| **Memory** | Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system. |
| **Type hints** | Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures. |
| **Docstring** | Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM. |
| **LLM-as-interface** | The framing that an LLM in a modern agentic system is the natural-language interface to tools and data, not the engine that produces final answers. The LLM interprets requests and orchestrates; the tools do the work. |
| **Reasoning layer** | The LLM's role in interpreting ambiguous requests, deciding which tool to use, handling unexpected results, and explaining outcomes. Reasoning here is *in language*, not in mathematics. |
| **ReAct** | "Reasoning + Acting." A pattern where the LLM alternates between reasoning steps (in natural language) and tool actions, observing each result before deciding the next step. The default agent type for local models in LlamaIndex. |

## Section 06: Neural Networks

| Term | Definition |
|------|-----------|
| **Neural network** | A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function. |
| **Machine learning (ML)** | Training models to learn patterns from data rather than programming rules by hand. LLMs are one example. |
| **Forward pass** | Computing the output of a network from its inputs, layer by layer. |
| **Pre-activation** | The weighted sum plus bias before the activation function is applied: z = w*x + b. |
| **Activation function** | A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function. |
| **Hidden layer** | A layer between input and output. Called "hidden" because its values are not directly observed. |
| **Backpropagation** | Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step. |
| **Gradient** | The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way. |
| **Gradient descent** | The algorithm for updating parameters: w = w - learning_rate * gradient. |
| **Learning rate** | How big each gradient descent step is. Too large: training oscillates. Too small: training is slow. |
| **Mean squared error (MSE)** | A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting. |
| **Cross-entropy loss** | A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction. |
| **Overfitting** | When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss. |
| **Train/validation split** | Holding out some data to test generalization. The model trains on one set and is evaluated on the other. |
| **Early stopping** | Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's `train.py` does. |
| **Normalization** | Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features. |
| **Automatic differentiation** | PyTorch's ability to compute all gradients automatically via `loss.backward()`, replacing hand-coded backpropagation. |
| **Adam optimizer** | An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent. |