# Vocabulary Key terms organized by the section where they are first introduced. --- ## Section 01: nanoGPT | Term | Definition | |------|-----------| | **GPT** | Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence. | | **Transformer** | A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence. | | **Self-attention** | A mechanism that lets the model consider relationships between all tokens in the context window when making predictions. | | **Token** | The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens. | | **Tokenization** | Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches. | | **Byte pair encoding (BPE)** | A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs. | | **Vocabulary size** | The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables. | | **Embedding** | A vector representation of a token. Each token ID is mapped to a vector of fixed size (`n_embd`). Similar tokens end up with similar vectors. | | **Context window** | The number of tokens the model can "see" when predicting the next one (`block_size` in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations. | | **Attention head** | One of several parallel attention mechanisms in a transformer layer (`n_head`). Each head can learn to attend to different patterns. | | **Parameters** | The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions. | | **Weights and biases** | The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned. | | **Training** | The process of adjusting parameters to minimize loss on a dataset. Requires significant compute. | | **Inference** | Running a trained model to generate output. Much cheaper than training. | | **Loss** | A number measuring how wrong the model's predictions are. Training aims to minimize it. | | **Validation** | Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set. | | **Epoch** | One complete pass through the training dataset. | | **Iteration** | One training step, typically on a batch (subset) of data. | | **Checkpoint** | A saved snapshot of the model's parameters at a particular point during training. | | **Temperature** | A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable. | | **Seed** | A value that initializes random number generation. Same seed produces same output, useful for reproducibility. | | **Dropout** | A regularization technique that randomly disables neurons during training to prevent overfitting. | | **Fine-tuning** | Additional training on a pre-trained model using a smaller, specialized dataset. | ## Section 02: Ollama | Term | Definition | |------|-----------| | **Ollama** | A local runtime for running LLMs on your own machine without cloud APIs. | | **llama.cpp** | A C++ library for efficient local LLM inference. Ollama builds on it. | | **GGUF** | A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference. | | **Quantization** | Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff. | | **FP32 / FP16 / INT8 / Q4** | Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference. | | **Logit** | The raw score a model assigns to each possible next token before converting to probabilities. | | **Top-k sampling** | A decoding strategy that considers only the k highest-scoring tokens when generating. | | **Top-p sampling** | A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95). | | **System prompt** | Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime. | | **Modelfile** | A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt. | | **API** | Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses. | ## Section 03: RAG | Term | Definition | |------|-----------| | **RAG (Retrieval-Augmented Generation)** | A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data. | | **Chunking** | Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens. | | **Chunk overlap** | Tokens shared between consecutive chunks, so sentences at boundaries are not lost. | | **Vector store** | An indexed collection of embedded chunks, searchable by vector similarity. | | **Cosine similarity** | A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query. | | **Semantic search** | Search based on meaning rather than exact keyword matching, enabled by embeddings. | | **LlamaIndex** | A Python framework for building RAG systems: chunking, embedding, indexing, and querying. | | **Node** | In LlamaIndex, a parsed text segment ready for embedding and indexing. | | **Context** | The retrieved chunks passed to the LLM as background information for answering a query. | | **Generator** | The LLM component in a RAG system that reads retrieved context and composes a response. | ## Section 04: Semantic Search | Term | Definition | |------|-----------| | **Hybrid retrieval** | Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches. | | **Dense retrieval** | Vector-based search using embeddings. Good at finding semantically similar text even with different wording. | | **Sparse retrieval** | Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms. | | **BM25** | "Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length. | | **Cross-encoder** | A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower. | | **Re-ranking** | A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality. | | **Candidate pool** | The initial set of retrieved chunks before re-ranking narrows them down. | ## Section 05: Tool Use and Agentic Systems | Term | Definition | |------|-----------| | **Agentic system** | A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot. | | **Tool calling (function calling)** | The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself. | | **Orchestration** | The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done. | | **Memory** | Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system. | | **Type hints** | Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures. | | **Docstring** | Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM. | ## Section 06: Neural Networks | Term | Definition | |------|-----------| | **Neural network** | A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function. | | **Machine learning (ML)** | Training models to learn patterns from data rather than programming rules by hand. LLMs are one example. | | **Forward pass** | Computing the output of a network from its inputs, layer by layer. | | **Pre-activation** | The weighted sum plus bias before the activation function is applied: z = w*x + b. | | **Activation function** | A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function. | | **Hidden layer** | A layer between input and output. Called "hidden" because its values are not directly observed. | | **Backpropagation** | Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step. | | **Gradient** | The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way. | | **Gradient descent** | The algorithm for updating parameters: w = w - learning_rate * gradient. | | **Learning rate** | How big each gradient descent step is. Too large: training oscillates. Too small: training is slow. | | **Mean squared error (MSE)** | A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting. | | **Cross-entropy loss** | A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction. | | **Overfitting** | When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss. | | **Train/validation split** | Holding out some data to test generalization. The model trains on one set and is evaluated on the other. | | **Early stopping** | Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's `train.py` does. | | **Normalization** | Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features. | | **Automatic differentiation** | PyTorch's ability to compute all gradients automatically via `loss.backward()`, replacing hand-coded backpropagation. | | **Adam optimizer** | An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent. |