Eric Furst 59e5f86884 Sync RAG and semantic-search updates from che-computing

- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts
- 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus
  and finding current SOTA models, formal references (Lewis, Booth)
- 04-semantic-search: libraries pointer back to Part III, larger corpus
  subsection, model-update exercise, formal references
- 06-neural-networks: add Nielsen reference (recommended by student)
- README: vocab.md link, agentic systems in description, Ollama prereq for 02-05
- New: vocab.md (glossary organized by section)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-28 12:05:08 -04:00

11 KiB

Raw Blame History

Vocabulary

Key terms organized by the section where they are first introduced.

Section 01: nanoGPT

Term	Definition
GPT	Generative Pre-trained Transformer. A model architecture that generates text by predicting the next token in a sequence.
Transformer	A neural network architecture that uses self-attention to weigh the importance of different tokens in a sequence.
Self-attention	A mechanism that lets the model consider relationships between all tokens in the context window when making predictions.
Token	The basic unit a language model operates on. In nanoGPT's Shakespeare experiment, each character is a token (vocab size 65). GPT-2 uses byte pair encoding to create subword tokens (vocab size ~50,000). Llama 3.1 also uses BPE but with a much larger vocabulary of 128,000 tokens.
Tokenization	Breaking text into tokens. Character-level (each letter is a token) or subword (byte pair encoding) are common approaches.
Byte pair encoding (BPE)	A tokenization method that merges frequently occurring character pairs into single tokens, building an efficient vocabulary. Used by GPT-2, Llama 3, and most modern LLMs.
Vocabulary size	The number of unique tokens in the tokenization scheme. Ranges from 65 (character-level Shakespeare) to ~128,000 (Llama 3.1). Larger vocabularies represent text more efficiently but require larger embedding tables.
Embedding	A vector representation of a token. Each token ID is mapped to a vector of fixed size (`n_embd`). Similar tokens end up with similar vectors.
Context window	The number of tokens the model can "see" when predicting the next one (`block_size` in nanoGPT). Larger context allows richer understanding but costs more memory and leads to longer calculations.
Attention head	One of several parallel attention mechanisms in a transformer layer (`n_head`). Each head can learn to attend to different patterns.
Parameters	The trainable numbers in a model (weights and biases). A small nanoGPT has ~800K; GPT-2 has 124M; modern LLMs have billions.
Weights and biases	The two types of parameters. Weights scale inputs; biases shift them. Together they define what the model has learned.
Training	The process of adjusting parameters to minimize loss on a dataset. Requires significant compute.
Inference	Running a trained model to generate output. Much cheaper than training.
Loss	A number measuring how wrong the model's predictions are. Training aims to minimize it.
Validation	Testing the model on data it was not trained on, to check whether it generalizes or has memorized the training set.
Epoch	One complete pass through the training dataset.
Iteration	One training step, typically on a batch (subset) of data.
Checkpoint	A saved snapshot of the model's parameters at a particular point during training.
Temperature	A parameter controlling randomness in text generation. Higher values produce more varied output; lower values are more predictable.
Seed	A value that initializes random number generation. Same seed produces same output, useful for reproducibility.
Dropout	A regularization technique that randomly disables neurons during training to prevent overfitting.
Fine-tuning	Additional training on a pre-trained model using a smaller, specialized dataset.

Section 02: Ollama

Term	Definition
Ollama	A local runtime for running LLMs on your own machine without cloud APIs.
llama.cpp	A C++ library for efficient local LLM inference. Ollama builds on it.
GGUF	A binary model format that packages weights, tokenizer, and metadata into a single file optimized for local inference.
Quantization	Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to save memory and speed up inference, with some accuracy tradeoff.
FP32 / FP16 / INT8 / Q4	Precision levels for storing weights: 32-bit float, 16-bit float, 8-bit integer, 4-bit. Lower precision = smaller model, faster inference.
Logit	The raw score a model assigns to each possible next token before converting to probabilities.
Top-k sampling	A decoding strategy that considers only the k highest-scoring tokens when generating.
Top-p sampling	A decoding strategy that considers tokens until their cumulative probability reaches p (e.g., 0.95).
System prompt	Instructions that shape the model's behavior, role, or constraints. Set in a Modelfile or at runtime.
Modelfile	A configuration file for Ollama that defines a custom model: base model, parameters, and system prompt.
API	Application Programming Interface. A defined way for programs to communicate. Ollama provides an API for sending prompts and receiving responses.

Section 03: RAG

Term	Definition
RAG (Retrieval-Augmented Generation)	A strategy where relevant documents are retrieved and placed in the prompt before the LLM generates a response, grounding it in specific data.
Chunking	Splitting documents into shorter segments for embedding. Typical sizes: 256--500 tokens.
Chunk overlap	Tokens shared between consecutive chunks, so sentences at boundaries are not lost.
Vector store	An indexed collection of embedded chunks, searchable by vector similarity.
Cosine similarity	A measure of similarity between two vectors based on the angle between them. Used to find the most relevant chunks for a query.
Semantic search	Search based on meaning rather than exact keyword matching, enabled by embeddings.
LlamaIndex	A Python framework for building RAG systems: chunking, embedding, indexing, and querying.
Node	In LlamaIndex, a parsed text segment ready for embedding and indexing.
Context	The retrieved chunks passed to the LLM as background information for answering a query.
Generator	The LLM component in a RAG system that reads retrieved context and composes a response.

Section 04: Semantic Search

Term	Definition
Hybrid retrieval	Combining vector similarity (semantic) and keyword matching (BM25) to catch both meaning-based and exact-term matches.
Dense retrieval	Vector-based search using embeddings. Good at finding semantically similar text even with different wording.
Sparse retrieval	Keyword-based search (like BM25). Good at finding exact names, dates, and technical terms.
BM25	"Best Matching 25." A classical algorithm that scores documents by term frequency, adjusted for document length.
Cross-encoder	A model that reads query and document together to produce a relevance score. More accurate than embeddings alone, but slower.
Re-ranking	A second pass that scores a candidate pool more carefully (typically with a cross-encoder) to improve retrieval quality.
Candidate pool	The initial set of retrieved chunks before re-ranking narrows them down.

Section 05: Tool Use and Agentic Systems

Term	Definition
Agentic system	A program where an LLM serves as a natural-language interface to tools, data, and actions. What you are using when you use ChatGPT, Claude, or Copilot.
Tool calling (function calling)	The LLM generates a structured request to call a function; the surrounding system executes it and feeds the result back. The LLM never runs code itself.
Orchestration	The control loop in an agentic system: the LLM decides what to do, the system does it, the result comes back, repeat until done.
Memory	Stored conversation history re-injected into prompts to maintain context across turns. The LLM itself is stateless; memory is managed by the system.
Type hints	Python annotations specifying parameter and return types. Used by tool-calling systems to understand function signatures.
Docstring	Documentation inside a Python function describing what it does. Tool-calling systems use docstrings to explain tools to the LLM.

Section 06: Neural Networks

Term	Definition
Neural network	A model made of layers of neurons connected by weights and biases. Learns by adjusting these parameters to minimize a loss function.
Machine learning (ML)	Training models to learn patterns from data rather than programming rules by hand. LLMs are one example.
Forward pass	Computing the output of a network from its inputs, layer by layer.
Pre-activation	The weighted sum plus bias before the activation function is applied: z = w*x + b.
Activation function	A nonlinear function (tanh, ReLU, sigmoid) applied after the pre-activation. Without it, stacking layers would just produce another linear function.
Hidden layer	A layer between input and output. Called "hidden" because its values are not directly observed.
Backpropagation	Computing how each parameter affects the loss by applying the chain rule backward through the network. About 2/3 of compute per training step.
Gradient	The partial derivative of the loss with respect to a parameter. Points in the direction of steepest increase; we step the opposite way.
Gradient descent	The algorithm for updating parameters: w = w - learning_rate * gradient.
Learning rate	How big each gradient descent step is. Too large: training oscillates. Too small: training is slow.
Mean squared error (MSE)	A loss function: the average squared difference between predictions and targets. The same metric used in curve fitting.
Cross-entropy loss	A loss function for classification (predicting one of many categories). Used in LLMs for next-token prediction.
Overfitting	When a model memorizes training data (including noise) instead of learning the underlying pattern. Detected by rising validation loss.
Train/validation split	Holding out some data to test generalization. The model trains on one set and is evaluated on the other.
Early stopping	Saving the model at the lowest validation loss and stopping training there. Prevents overfitting. This is what nanoGPT's `train.py` does.
Normalization	Scaling inputs and outputs to a standard range (e.g., [0, 1]) before training, so gradients are well-behaved across features.
Automatic differentiation	PyTorch's ability to compute all gradients automatically via `loss.backward()`, replacing hand-coded backpropagation.
Adam optimizer	An adaptive learning rate optimizer that adjusts step sizes per parameter. Converges faster than plain gradient descent.

11 KiB Raw Blame History

Vocabulary

Section 01: nanoGPT

Section 02: Ollama

Section 03: RAG

Section 04: Semantic Search

Section 05: Tool Use and Agentic Systems

Section 06: Neural Networks

11 KiB

Raw Blame History