Cleanup edits to module 01 and 05

walkthroughs.
This commit is contained in:
Eric 2026-04-02 12:55:14 -04:00
commit 896570f71c
2 changed files with 17 additions and 19 deletions

View file

@ -19,24 +19,20 @@ Build a neural network from scratch to understand the core mechanics behind LLMs
---
Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have millions of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
In this section, we step back from language and build a neural network ourselves — small enough to understand every weight, but powerful enough to learn a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.
Our task: fit the heat capacity $C_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:
Our task: fit the ideal gas heat capacity $C^*_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:
$$C_p(T) = a + bT + cT^2 + dT^3$$
$$C^*_p(T) = a + bT + cT^2 + dT^3$$
Can a neural network learn this relationship directly from data?
## 1. Setup
Use the virtual environment from Part I — `numpy` and `torch` are already installed. You may need to add `matplotlib`:
```bash
pip install matplotlib
```
All dependencies (`numpy`, `torch`, `matplotlib`) are installed by `uv sync`. (See the main [README](../README.md).)
## 2. The data
@ -54,7 +50,7 @@ T_K,Cp_kJ_per_kgK
...
```
The curve is smooth and nonlinear $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case: simple enough for a small network, but not a straight line.
The curve is smooth and nonlinear. $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case. It is simple enough for a small network, but not a straight line.
## 3. Architecture of a one-hidden-layer network
@ -73,13 +69,13 @@ Here's what happens at each step:
$$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$
where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity** — without it, stacking layers would just produce another linear function, no matter how many layers we use.
where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.
**Step 2: Output layer.** The output is a weighted sum of the hidden activations:
$$\hat{y} = \sum_j W_j \cdot a_j + b_{\text{out}}$$
This is a linear combination no activation on the output, since we want to predict a continuous value.
This is a linear combination. There is no activation on the output, since we want to predict a continuous value.
### Counting parameters
@ -90,9 +86,9 @@ With 10 hidden neurons:
- `b2`: 1 bias (output)
- **Total: 31 parameters**
That's 31 parameters for 35 data points — almost a 1:1 ratio, which should make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.
That's 31 parameters for 35 data points, or almost a 1:1 ratio, which might make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.
Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea — layers of weights and activations — just scaled enormously.
Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea with layers of weights and activations, but just scaled enormously. (In order to "fit" language!)
## 4. Training
@ -101,7 +97,7 @@ Training means finding the values of all 31 parameters that make the network's p
### Loss function
We need a number that says "how wrong is the network?" The **mean squared error** (MSE) is a natural choice:
We need a number that says "how wrong is the network?" for a given set of paratmers. The **mean squared error** (MSE) is a natural choice here:
$$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$
@ -115,6 +111,8 @@ $$\frac{\partial L}{\partial W_j} = \frac{1}{N} \sum_i 2(\hat{y}_i - y_i) \cdot
The numpy implementation in `nn_numpy.py` computes every gradient explicitly. This is the part that PyTorch automates.
It's worth noting that backprop is roughly 2/3 of the compute per training step. This ratio holds fairly consistently from small networks up through large transformers, and is one reason inference (forward pass only) is so much cheaper than training.
### Gradient descent
Once we have the gradients, we update each weight:
@ -177,7 +175,7 @@ Compare the two scripts side by side. The key differences:
| Weight update | `W -= lr * dW` | `optimizer.step()` |
| Lines of code | ~80 | ~40 |
PyTorch's `loss.backward()` computes all the gradients we wrote out by hand automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.
PyTorch's `loss.backward()` computes all the gradients we wrote out by hand, automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.
The `nn.Sequential` definition:
@ -189,7 +187,7 @@ model = nn.Sequential(
)
```
looks simple here, but it's the same API used in nanoGPT's `model.py` just with more layers, attention mechanisms, and a much larger vocabulary.
looks simple here, but it uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions) just with more layers, attention mechanisms, and a much larger vocabulary.
> **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?
@ -226,7 +224,7 @@ In practice, we combat overfitting with:
## 9. Connecting back to LLMs
Everything you've built here scales up to large language models:
Everything you've built here in this section scales up to large language models:
| This tutorial | nanoGPT / LLMs |
|---|---|