- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts - 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus and finding current SOTA models, formal references (Lewis, Booth) - 04-semantic-search: libraries pointer back to Part III, larger corpus subsection, model-update exercise, formal references - 06-neural-networks: add Nielsen reference (recommended by student) - README: vocab.md link, agentic systems in description, Ollama prereq for 02-05 - New: vocab.md (glossary organized by section) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 KiB
11 KiB
Vocabulary
Key terms organized by the section where they are first introduced.
Section 01: nanoGPT
| Term | Definition |
|---|---|
| GPT | Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence. |
| Transformer | A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence. |
| Self-attention | A mechanism that lets the model consider relationships between all tokens in the context window when making predictions. |
| Token | The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens. |
| Tokenization | Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches. |
| Byte pair encoding (BPE) | A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs. |
| Vocabulary size | The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables. |
| Embedding | A vector representation of a token. Each token ID is mapped to a vector of fixed size (n_embd). Similar tokens end up with similar vectors. |
| Context window | The number of tokens the model can "see" when predicting the next one (block_size in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations. |
| Attention head | One of several parallel attention mechanisms in a transformer layer (n_head). Each head can learn to attend to different patterns. |
| Parameters | The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions. |
| Weights and biases | The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned. |
| Training | The process of adjusting parameters to minimize loss on a dataset. Requires significant compute. |
| Inference | Running a trained model to generate output. Much cheaper than training. |
| Loss | A number measuring how wrong the model's predictions are. Training aims to minimize it. |
| Validation | Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set. |
| Epoch | One complete pass through the training dataset. |
| Iteration | One training step, typically on a batch (subset) of data. |
| Checkpoint | A saved snapshot of the model's parameters at a particular point during training. |
| Temperature | A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable. |
| Seed | A value that initializes random number generation. Same seed produces same output, useful for reproducibility. |
| Dropout | A regularization technique that randomly disables neurons during training to prevent overfitting. |
| Fine-tuning | Additional training on a pre-trained model using a smaller, specialized dataset. |
Section 02: Ollama
| Term | Definition |
|---|---|
| Ollama | A local runtime for running LLMs on your own machine without cloud APIs. |
| llama.cpp | A C++ library for efficient local LLM inference. Ollama builds on it. |
| GGUF | A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference. |
| Quantization | Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff. |
| FP32 / FP16 / INT8 / Q4 | Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference. |
| Logit | The raw score a model assigns to each possible next token before converting to probabilities. |
| Top-k sampling | A decoding strategy that considers only the k highest-scoring tokens when generating. |
| Top-p sampling | A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95). |
| System prompt | Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime. |
| Modelfile | A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt. |
| API | Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses. |
Section 03: RAG
| Term | Definition |
|---|---|
| RAG (Retrieval-Augmented Generation) | A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data. |
| Chunking | Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens. |
| Chunk overlap | Tokens shared between consecutive chunks, so sentences at boundaries are not lost. |
| Vector store | An indexed collection of embedded chunks, searchable by vector similarity. |
| Cosine similarity | A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query. |
| Semantic search | Search based on meaning rather than exact keyword matching, enabled by embeddings. |
| LlamaIndex | A Python framework for building RAG systems: chunking, embedding, indexing, and querying. |
| Node | In LlamaIndex, a parsed text segment ready for embedding and indexing. |
| Context | The retrieved chunks passed to the LLM as background information for answering a query. |
| Generator | The LLM component in a RAG system that reads retrieved context and composes a response. |
Section 04: Semantic Search
| Term | Definition |
|---|---|
| Hybrid retrieval | Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches. |
| Dense retrieval | Vector-based search using embeddings. Good at finding semantically similar text even with different wording. |
| Sparse retrieval | Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms. |
| BM25 | "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length. |
| Cross-encoder | A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower. |
| Re-ranking | A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality. |
| Candidate pool | The initial set of retrieved chunks before re-ranking narrows them down. |
Section 05: Tool Use and Agentic Systems
| Term | Definition |
|---|---|
| Agentic system | A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot. |
| Tool calling (function calling) | The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself. |
| Orchestration | The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done. |
| Memory | Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system. |
| Type hints | Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures. |
| Docstring | Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM. |
Section 06: Neural Networks
| Term | Definition |
|---|---|
| Neural network | A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function. |
| Machine learning (ML) | Training models to learn patterns from data rather than programming rules by hand. LLMs are one example. |
| Forward pass | Computing the output of a network from its inputs, layer by layer. |
| Pre-activation | The weighted sum plus bias before the activation function is applied: z = w*x + b. |
| Activation function | A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function. |
| Hidden layer | A layer between input and output. Called "hidden" because its values are not directly observed. |
| Backpropagation | Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step. |
| Gradient | The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way. |
| Gradient descent | The algorithm for updating parameters: w = w - learning_rate * gradient. |
| Learning rate | How big each gradient descent step is. Too large: training oscillates. Too small: training is slow. |
| Mean squared error (MSE) | A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting. |
| Cross-entropy loss | A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction. |
| Overfitting | When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss. |
| Train/validation split | Holding out some data to test generalization. The model trains on one set and is evaluated on the other. |
| Early stopping | Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's train.py does. |
| Normalization | Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features. |
| Automatic differentiation | PyTorch's ability to compute all gradients automatically via loss.backward(), replacing hand-coded backpropagation. |
| Adam optimizer | An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent. |