NN lecture updates

- add noisy data fit to README - add noisy data notebook - add noisy standalone python script - References and edits to README
2026-04-06 15:54:41 -04:00 · 2026-04-06 15:54:41 -04:00 · 2902e34256
commit 2902e34256
parent 896570f71c
3 changed files with 388 additions and 44 deletions
--- a/05-neural-networks/README.md
+++ b/05-neural-networks/README.md
@ -19,9 +19,9 @@ Build a neural network from scratch to understand the core mechanics behind LLMs

 ---

-Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
+Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting*, often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.

-In this section, we step back from language and build a neural network ourselves — small enough to understand every weight, but powerful enough to learn a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.
+In this section, we step back from language and build a neural network ourselves. It will be small enough to understand every weight, but powerful enough to "learn" a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.

 Our task: fit the ideal gas heat capacity $C^*_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:

@ -34,6 +34,17 @@ Can a neural network learn this relationship directly from data?

 All dependencies (`numpy`, `torch`, `matplotlib`) are installed by `uv sync`. (See the main [README](../README.md).)

+### Notebooks and scripts
+
+The hands-on work for this section lives in two Jupyter notebooks:
+
+- **`nn_workshop.ipynb`** — build and train the network (polynomial baseline, numpy from scratch, PyTorch)
+- **`nn_noisy_workshop.ipynb`** — add noise, observe overfitting, learn about train/validation splits and early stopping
+
+Open them with `jupyter notebook` or in VS Code. The notebooks are designed to be worked through in class, with discussion prompts at key points.
+
+Standalone Python scripts (`nn_numpy.py`, `nn_torch.py`, `nn_noisy.py`) contain the same code as the notebooks in a clean, single-file format. These are useful as a reference.
+
 ## 2. The data

 The file `data/n2_cp.csv` contains 35 data points: the isobaric heat capacity of N₂ gas at 1 bar from 300 K to 2000 K, from the NIST WebBook.
@ -69,7 +80,7 @@ Here's what happens at each step:

 $$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$

-where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.
+where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) acts on the pre-activation value $z_j$ and introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.

 **Step 2: Output layer.** The output is a weighted sum of the hidden activations:

@ -80,9 +91,9 @@ This is a linear combination. There is no activation on the output, since we wan
 ### Counting parameters

 With 10 hidden neurons:
- `W1`: 10 weights (input -> hidden)
+- `W1`: 10 weights, $w_j$ (input -> hidden)
 - `b1`: 10 biases (hidden)
- `W2`: 10 weights (hidden -> output)
+- `W2`: 10 weights, $W_j$ (hidden -> output)
 - `b2`: 1 bias (output)
 - **Total: 31 parameters**

@ -97,11 +108,11 @@ Training means finding the values of all 31 parameters that make the network's p

 ### Loss function

-We need a number that says "how wrong is the network?" for a given set of paratmers. The **mean squared error** (MSE) is a natural choice here:
+We need a number that says "how wrong is the network?" for a given set of parameters. The **mean squared error** (MSE) is a natural choice here:

 $$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$

-This is the same kind of loss we watched decrease during nanoGPT training in Part I (though nanoGPT uses cross-entropy loss, which is appropriate for classification over a vocabulary).
+This is the same MSE you've used in non-linear curve fitting — the sum of squared residuals divided by the number of points. The only difference is that here the "model" is a neural network instead of a polynomial or equation of state. It is also the same kind of loss we watched decrease during nanoGPT training in Part I (though nanoGPT uses cross-entropy loss, which is appropriate for classification over a vocabulary).

 ### Backpropagation

@ -126,46 +137,26 @@ One full pass through these three steps (forward -> loss -> backward -> update)
 In nanoGPT, the training loop in `train.py` does exactly the same thing, but with the AdamW optimizer (a fancier version of gradient descent) and batches of data instead of the full dataset.


-## 5. Running the numpy version
+## 5. The numpy version

-```bash
-python nn_numpy.py
-```
+Work through sections 1–3 of `nn_workshop.ipynb` to build and train the network from scratch in numpy. You should see the training loss drop rapidly in the first 1000 epochs before leveling off, and the network's prediction closely tracking the NIST data points.

 ```
 Epoch     0  Loss: 0.283941
 Epoch   500  Loss: 0.001253
 Epoch  1000  Loss: 0.000412
-Epoch  1500  Loss: 0.000178
-Epoch  2000  Loss: 0.000082
-Epoch  2500  Loss: 0.000040
-Epoch  3000  Loss: 0.000021
-Epoch  3500  Loss: 0.000012
-Epoch  4000  Loss: 0.000008
-Epoch  4500  Loss: 0.000005
+...
 Epoch  4999  Loss: 0.000004
-
-Final loss: 0.000004
-Network: 1 input -> 10 hidden (tanh) -> 1 output
-Total parameters: 31
 ```

-The script produces a plot (`nn_fit.png`) showing the fit and the training loss curve. You should see the network's prediction closely tracking the NIST data points, and the loss dropping rapidly in the first 1000 epochs before leveling off.
-
-> **Exercise 1:** Read through `nn_numpy.py` carefully. Identify where each of the following happens: (a) forward pass, (b) loss calculation, (c) backpropagation, (d) gradient descent update. Annotate your copy with comments.
+> **Exercise 1:** Read through the numpy training cells carefully. Identify where each of the following happens: (a) forward pass, (b) loss calculation, (c) backpropagation, (d) gradient descent update.

 > **Exercise 2:** Change the number of hidden neurons `H`. Try 2, 5, 10, 20, 50. How does the fit change? How many parameters does each network have? At what point does adding more neurons stop helping?


 ## 6. The PyTorch version

-Now look at `nn_torch.py`. It does the same thing, but in about half the code:
-
-```bash
-python nn_torch.py
-```
-
-Compare the two scripts side by side. The key differences:
+Now work through section 4 of `nn_workshop.ipynb`. The same network, but in about half the code. Compare the numpy and PyTorch cells side by side. The key differences:

 | | numpy version | PyTorch version |
 |---|---|---|
@ -177,17 +168,9 @@ Compare the two scripts side by side. The key differences:

 PyTorch's `loss.backward()` computes all the gradients we wrote out by hand, automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.

-The `nn.Sequential` definition:
+The `nn.Sequential` definition uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions), just with more layers, attention mechanisms, and a much larger vocabulary.

-```python
-model = nn.Sequential(
-    nn.Linear(1, H),    # input -> hidden (W1, b1)
-    nn.Tanh(),           # activation
-    nn.Linear(H, 1),    # hidden -> output (W2, b2)
-)
-```
-
-looks simple here, but it uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions) just with more layers, attention mechanisms, and a much larger vocabulary.
+Section 5 of the notebook compares all three approaches (polynomial, numpy NN, PyTorch NN) on the same plot, and section 6 tests how they extrapolate outside the training range.

 > **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?

@ -204,7 +187,7 @@ Both scripts normalize the input ($T$) and output ($C_p$) to the range [0, 1] be

 Try it yourself:

-> **Exercise 5:** Comment out the normalization in `nn_numpy.py` (use `T_raw` and `Cp_raw` directly). What happens to the training loss? Can you fix it by changing the learning rate?
+> **Exercise 5:** In the notebook, comment out the normalization (use `T_raw` and `Cp_raw` directly). What happens to the training loss? Can you fix it by changing the learning rate?


 ## 8. Overfitting
@ -215,6 +198,24 @@ With 31 parameters and 35 data points, our network is close to the edge. What ha

 This is **overfitting** — the network memorizes the training data but fails to generalize. It's the same concept we discussed in Part I when nanoGPT's validation loss started increasing while the training loss kept decreasing.

+### Overfitting with noisy data
+
+The clean NIST data masks the overfitting problem. The network learns a smooth function because the data *is* smooth. Real experimental data has noise. What happens then?
+
+Open **`nn_noisy_workshop.ipynb`** to explore this. The notebook adds Gaussian noise to the $C_p$ data and introduces a **train/validation split**: 26 points for training, 9 held out for validation.
+
+Watch the two loss curves as you work through it. Training loss keeps dropping as the network gets better and better at fitting the noisy training points. But at some point, the **validation loss stops decreasing and starts increasing**. This is the overfitting signal: the network is learning the noise, not the underlying physics.
+
+The epoch where validation loss is lowest is where you'd want to stop training. This is **early stopping**, and it's exactly what nanoGPT's `train.py` does in our LLM lesson. The program saves a checkpoint whenever the validation loss reaches a new minimum. If training runs too long past that point, the model gets worse at predicting new data, even as it gets better at memorizing the training data.
+
+> **Exercise 7:** Work through `nn_noisy_workshop.ipynb` with the default `noise_scale = 0.02`. Where does the validation loss start increasing? How does the best-model fit compare to the true NIST data?
+
+> **Exercise 8:** Increase `noise_scale` to 0.05 and then 0.1. How does the fit change? At what noise level does the network produce clearly unphysical predictions?
+
+> **Exercise 9:** With `noise_scale = 0.05`, try increasing `H` to 50. The network now has 151 parameters for 26 training points. Does overfitting get better or worse? Why?
+
+> **Exercise 10:** Compare the final model (trained to the end) with the best model (saved at the lowest validation loss). The notebook does this in section 6. Which is closer to the true curve? Why?
+
 In practice, we combat overfitting with:
 - More data
 - Regularization (dropout — remember this parameter from nanoGPT?)
@ -252,5 +253,6 @@ The fundamental loop — forward pass, compute loss, backpropagate, update weigh

 ### Reading

- The "backpropagation" chapter in Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/
+- Zhang, Lipton, Li & Smola, *Dive into Deep Learning* — interactive, with runnable code in PyTorch: https://d2l.ai
+- Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/
 - 3Blue1Brown, *Neural Networks* video series: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi — excellent visual intuition for how neural networks learn