Cleanup edits to module 01 and 05

walkthroughs.
2026-04-02 12:55:14 -04:00 · 2026-04-02 12:55:14 -04:00 · 896570f71c
commit 896570f71c
parent e10e411e41
2 changed files with 17 additions and 19 deletions
--- a/01-nanogpt/README.md
+++ b/01-nanogpt/README.md
@ -85,7 +85,7 @@ drwxr-xr-x  5 furst  staff     160 Apr 17 12:44 data/

 Here's a quick run-down on some of the files and directories:

- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into smaller units that a machine learning model can process.)
+- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into units that a machine learning model can process.)
 - `/config` — scripts to train or finetune the model, depending on the tokenization method used.
 - `train.py` — a Python script that trains the model. This will build the weights and biases of the transformer.
 - `sample.py` — a Python script that runs inference on the model. This is a "prompt" script that will cause the model to begin generating text.
@ -124,7 +124,7 @@ total 6576
 -rw-r--r--  1 furst  staff   223080 Apr 17 14:54 val.bin
 ```

-The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format — it can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
+The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format. It can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.

 > **Exercise 1:** The `prepare.py` script downloads and tokenizes a version of *Tiny Shakespeare*. How big is the text file? Use the command `wc` to find the number of lines, words, and characters. Examine the text with the command `less`.

--- a/05-neural-networks/README.md
+++ b/05-neural-networks/README.md
@ -19,24 +19,20 @@ Build a neural network from scratch to understand the core mechanics behind LLMs

 ---

-Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have millions of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
+Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.

 In this section, we step back from language and build a neural network ourselves — small enough to understand every weight, but powerful enough to learn a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.

-Our task: fit the heat capacity $C_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:
+Our task: fit the ideal gas heat capacity $C^*_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:

-$$C_p(T) = a + bT + cT^2 + dT^3$$
+$$C^*_p(T) = a + bT + cT^2 + dT^3$$

 Can a neural network learn this relationship directly from data?


 ## 1. Setup

-Use the virtual environment from Part I — `numpy` and `torch` are already installed. You may need to add `matplotlib`:
-
-```bash
-pip install matplotlib
-```
+All dependencies (`numpy`, `torch`, `matplotlib`) are installed by `uv sync`. (See the main [README](../README.md).)

 ## 2. The data

@ -54,7 +50,7 @@ T_K,Cp_kJ_per_kgK
 ...
 ```

-The curve is smooth and nonlinear — $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case: simple enough for a small network, but not a straight line.
+The curve is smooth and nonlinear. $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case. It is simple enough for a small network, but not a straight line.


 ## 3. Architecture of a one-hidden-layer network
@ -73,13 +69,13 @@ Here's what happens at each step:

 $$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$

-where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity** — without it, stacking layers would just produce another linear function, no matter how many layers we use.
+where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.

 **Step 2: Output layer.** The output is a weighted sum of the hidden activations:

 $$\hat{y} = \sum_j W_j \cdot a_j + b_{\text{out}}$$

-This is a linear combination — no activation on the output, since we want to predict a continuous value.
+This is a linear combination. There is no activation on the output, since we want to predict a continuous value.

 ### Counting parameters

@ -90,9 +86,9 @@ With 10 hidden neurons:
 - `b2`: 1 bias (output)
 - **Total: 31 parameters**

-That's 31 parameters for 35 data points — almost a 1:1 ratio, which should make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.
+That's 31 parameters for 35 data points, or almost a 1:1 ratio, which might make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.

-Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea — layers of weights and activations — just scaled enormously.
+Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea with layers of weights and activations, but just scaled enormously. (In order to "fit" language!)


 ## 4. Training
@ -101,7 +97,7 @@ Training means finding the values of all 31 parameters that make the network's p

 ### Loss function

-We need a number that says "how wrong is the network?" The **mean squared error** (MSE) is a natural choice:
+We need a number that says "how wrong is the network?" for a given set of paratmers. The **mean squared error** (MSE) is a natural choice here:

 $$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$

@ -115,6 +111,8 @@ $$\frac{\partial L}{\partial W_j} = \frac{1}{N} \sum_i 2(\hat{y}_i - y_i) \cdot

 The numpy implementation in `nn_numpy.py` computes every gradient explicitly. This is the part that PyTorch automates.

+It's worth noting that backprop is roughly 2/3 of the compute per training step. This ratio holds fairly consistently from small networks up through large transformers, and is one reason inference (forward pass only) is so much cheaper than training.
+
 ### Gradient descent

 Once we have the gradients, we update each weight:
@ -177,7 +175,7 @@ Compare the two scripts side by side. The key differences:
 | Weight update | `W -= lr * dW` | `optimizer.step()` |
 | Lines of code | ~80 | ~40 |

-PyTorch's `loss.backward()` computes all the gradients we wrote out by hand — automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.
+PyTorch's `loss.backward()` computes all the gradients we wrote out by hand, automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.

 The `nn.Sequential` definition:

@ -189,7 +187,7 @@ model = nn.Sequential(
 )
 ```

-looks simple here, but it's the same API used in nanoGPT's `model.py` — just with more layers, attention mechanisms, and a much larger vocabulary.
+looks simple here, but it uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions) just with more layers, attention mechanisms, and a much larger vocabulary.

 > **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?

@ -226,7 +224,7 @@ In practice, we combat overfitting with:

 ## 9. Connecting back to LLMs

-Everything you've built here scales up to large language models:
+Everything you've built here in this section scales up to large language models:

 | This tutorial | nanoGPT / LLMs |
 |---|---|