Reorder: tool use is now 05, neural networks is 06

The LLM arc completes at section 05 (agentic systems), with neural networks as a standalone ML deep-dive in section 06. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 10:54:03 -04:00 · 2026-04-11 10:54:03 -04:00 · cab2ebfd9d
commit cab2ebfd9d
parent aee8ecd7b8
11 changed files with 384 additions and 4 deletions
--- a/06-neural-networks/README.md
+++ b/06-neural-networks/README.md
@ -0,0 +1,258 @@
+# Large Language Models Part VI: Building a Neural Network
+
+**CHEG 667-013 — Chemical Engineering with Computers**  
+Department of Chemical and Biomolecular Engineering, University of Delaware
+
+---
+
+## Key idea
+
+Build a neural network from scratch to understand the core mechanics behind LLMs.
+
+## Key goals
+
+- See concretely what "weights and biases" are and how they're organized
+- Understand the forward pass, loss function, and gradient descent
+- Implement backpropagation by hand in numpy
+- See how PyTorch automates the same process
+- Connect these concepts to what you've already seen in nanoGPT
+
+---
+
+Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting*, often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
+
+In this section, we step back from language and build a neural network ourselves. It will be small enough to understand every weight, but powerful enough to "learn" a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.
+
+Our task: fit the ideal gas heat capacity $C^*_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:
+
+$$C^*_p(T) = a + bT + cT^2 + dT^3$$
+
+Can a neural network learn this relationship directly from data?
+
+
+## 1. Setup
+
+All dependencies (`numpy`, `torch`, `matplotlib`) are installed by `uv sync`. (See the main [README](../README.md).)
+
+### Notebooks and scripts
+
+The hands-on work for this section lives in two Jupyter notebooks:
+
+- **`nn_workshop.ipynb`** — build and train the network (polynomial baseline, numpy from scratch, PyTorch)
+- **`nn_noisy_workshop.ipynb`** — add noise, observe overfitting, learn about train/validation splits and early stopping
+
+Open them with `jupyter notebook` or in VS Code. The notebooks are designed to be worked through in class, with discussion prompts at key points.
+
+Standalone Python scripts (`nn_numpy.py`, `nn_torch.py`, `nn_noisy.py`) contain the same code as the notebooks in a clean, single-file format. These are useful as a reference.
+
+## 2. The data
+
+The file `data/n2_cp.csv` contains 35 data points: the isobaric heat capacity of N₂ gas at 1 bar from 300 K to 2000 K, from the NIST WebBook.
+
+```bash
+head data/n2_cp.csv
+```
+
+```
+T_K,Cp_kJ_per_kgK
+300.00,1.0413
+350.00,1.0423
+400.00,1.0450
+...
+```
+
+The curve is smooth and nonlinear. $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case. It is simple enough for a small network, but not a straight line.
+
+
+## 3. Architecture of a one-hidden-layer network
+
+Our network has three layers:
+
+```
+Input (1 neuron: T)  ->  Hidden (10 neurons)  ->  Output (1 neuron: Cp)
+```
+
+Here's what happens at each step:
+
+### Forward pass
+
+**Step 1: Hidden layer.** Each of the 10 hidden neurons computes a weighted sum of the input plus a bias, then applies an *activation function*:
+
+$$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$
+
+where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) acts on the pre-activation value $z_j$ and introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.
+
+**Step 2: Output layer.** The output is a weighted sum of the hidden activations:
+
+$$\hat{y} = \sum_j W_j \cdot a_j + b_{\text{out}}$$
+
+This is a linear combination. There is no activation on the output, since we want to predict a continuous value.
+
+### Counting parameters
+
+With 10 hidden neurons:
+- `W1`: 10 weights, $w_j$ (input -> hidden)
+- `b1`: 10 biases (hidden)
+- `W2`: 10 weights, $W_j$ (hidden -> output)
+- `b2`: 1 bias (output)
+- **Total: 31 parameters**
+
+That's 31 parameters for 35 data points, or almost a 1:1 ratio, which might make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.
+
+Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea with layers of weights and activations, but just scaled enormously. (In order to "fit" language!)
+
+
+## 4. Training
+
+Training means finding the values of all 31 parameters that make the network's predictions match the data. This requires three things:
+
+### Loss function
+
+We need a number that says "how wrong is the network?" for a given set of parameters. The **mean squared error** (MSE) is a natural choice here:
+
+$$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$
+
+This is the same MSE you've used in non-linear curve fitting — the sum of squared residuals divided by the number of points. The only difference is that here the "model" is a neural network instead of a polynomial or equation of state. It is also the same kind of loss we watched decrease during nanoGPT training in Part I (though nanoGPT uses cross-entropy loss, which is appropriate for classification over a vocabulary).
+
+### Backpropagation
+
+To improve the weights, we need to know how each weight affects the loss. **Backpropagation** computes these gradients by applying the chain rule, working backward from the loss through each layer. For example, the gradient of the loss with respect to an output weight $W_j$ is:
+
+$$\frac{\partial L}{\partial W_j} = \frac{1}{N} \sum_i 2(\hat{y}_i - y_i) \cdot a_{ij}$$
+
+The numpy implementation in `nn_numpy.py` computes every gradient explicitly. This is the part that PyTorch automates.
+
+It's worth noting that backprop is roughly 2/3 of the compute per training step. This ratio holds fairly consistently from small networks up through large transformers, and is one reason inference (forward pass only) is so much cheaper than training.
+
+### Gradient descent
+
+Once we have the gradients, we update each weight:
+
+$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$
+
+where $\eta$ is the **learning rate** — a small number (0.01 in our code) that controls how big each step is. Too large and training oscillates; too small and it's painfully slow.
+
+One full pass through these three steps (forward -> loss -> backward -> update) is one **epoch**. We train for 5000 epochs.
+
+In nanoGPT, the training loop in `train.py` does exactly the same thing, but with the AdamW optimizer (a fancier version of gradient descent) and batches of data instead of the full dataset.
+
+
+## 5. The numpy version
+
+Work through sections 1–3 of `nn_workshop.ipynb` to build and train the network from scratch in numpy. You should see the training loss drop rapidly in the first 1000 epochs before leveling off, and the network's prediction closely tracking the NIST data points.
+
+```
+Epoch     0  Loss: 0.283941
+Epoch   500  Loss: 0.001253
+Epoch  1000  Loss: 0.000412
+...
+Epoch  4999  Loss: 0.000004
+```
+
+> **Exercise 1:** Read through the numpy training cells carefully. Identify where each of the following happens: (a) forward pass, (b) loss calculation, (c) backpropagation, (d) gradient descent update.
+
+> **Exercise 2:** Change the number of hidden neurons `H`. Try 2, 5, 10, 20, 50. How does the fit change? How many parameters does each network have? At what point does adding more neurons stop helping?
+
+
+## 6. The PyTorch version
+
+Now work through section 4 of `nn_workshop.ipynb`. The same network, but in about half the code. Compare the numpy and PyTorch cells side by side. The key differences:
+
+| | numpy version | PyTorch version |
+|---|---|---|
+| Define layers | Manual weight matrices | `nn.Linear(1, H)` |
+| Forward pass | `X @ W1 + b1`, `np.tanh(...)` | `model(X)` |
+| Backprop | Hand-coded chain rule | `loss.backward()` |
+| Weight update | `W -= lr * dW` | `optimizer.step()` |
+| Lines of code | ~80 | ~40 |
+
+PyTorch's `loss.backward()` computes all the gradients we wrote out by hand, automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.
+
+The `nn.Sequential` definition uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions), just with more layers, attention mechanisms, and a much larger vocabulary.
+
+Section 5 of the notebook compares all three approaches (polynomial, numpy NN, PyTorch NN) on the same plot, and section 6 tests how they extrapolate outside the training range.
+
+> **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?
+
+> **Exercise 4:** Replace the Adam optimizer with plain SGD: `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare? Try increasing the learning rate. What happens?
+
+
+## 7. Normalization
+
+Both scripts normalize the input ($T$) and output ($C_p$) to the range [0, 1] before training. This is important:
+
+- Raw $T$ values range from 300 to 2000, while $C_p$ ranges from 1.04 to 1.28
+- With unnormalized data, the gradients for the input weights would be hundreds of times larger than for the output weights
+- The network would struggle to learn — or need a much smaller learning rate
+
+Try it yourself:
+
+> **Exercise 5:** In the notebook, comment out the normalization (use `T_raw` and `Cp_raw` directly). What happens to the training loss? Can you fix it by changing the learning rate?
+
+
+## 8. Overfitting
+
+With 31 parameters and 35 data points, our network is close to the edge. What happens with more parameters than data?
+
+> **Exercise 6:** Increase `H` to 100 (giving 301 parameters — nearly 10× the number of data points). Train for 20,000 epochs. Plot the fit. Does it match the training data well? Now generate predictions at $T$ = 275 K and $T$ = 2100 K (outside the training range). Are they reasonable?
+
+This is **overfitting** — the network memorizes the training data but fails to generalize. It's the same concept we discussed in Part I when nanoGPT's validation loss started increasing while the training loss kept decreasing.
+
+### Overfitting with noisy data
+
+The clean NIST data masks the overfitting problem. The network learns a smooth function because the data *is* smooth. Real experimental data has noise. What happens then?
+
+Open **`nn_noisy_workshop.ipynb`** to explore this. The notebook adds Gaussian noise to the $C_p$ data and introduces a **train/validation split**: 26 points for training, 9 held out for validation.
+
+Watch the two loss curves as you work through it. Training loss keeps dropping as the network gets better and better at fitting the noisy training points. But at some point, the **validation loss stops decreasing and starts increasing**. This is the overfitting signal: the network is learning the noise, not the underlying physics.
+
+The epoch where validation loss is lowest is where you'd want to stop training. This is **early stopping**, and it's exactly what nanoGPT's `train.py` does in our LLM lesson. The program saves a checkpoint whenever the validation loss reaches a new minimum. If training runs too long past that point, the model gets worse at predicting new data, even as it gets better at memorizing the training data.
+
+> **Exercise 7:** Work through `nn_noisy_workshop.ipynb` with the default `noise_scale = 0.02`. Where does the validation loss start increasing? How does the best-model fit compare to the true NIST data?
+
+> **Exercise 8:** Increase `noise_scale` to 0.05 and then 0.1. How does the fit change? At what noise level does the network produce clearly unphysical predictions?
+
+> **Exercise 9:** With `noise_scale = 0.05`, try increasing `H` to 50. The network now has 151 parameters for 26 training points. Does overfitting get better or worse? Why?
+
+> **Exercise 10:** Compare the final model (trained to the end) with the best model (saved at the lowest validation loss). The notebook does this in section 6. Which is closer to the true curve? Why?
+
+In practice, we combat overfitting with:
+- More data
+- Regularization (dropout — remember this parameter from nanoGPT?)
+- Early stopping (stop training when validation loss starts increasing)
+- Keeping the model appropriately sized for the data
+
+
+## 9. Connecting back to LLMs
+
+Everything you've built here in this section scales up to large language models:
+
+| This tutorial | nanoGPT / LLMs |
+|---|---|
+| 31 parameters | 800K – 70B+ parameters |
+| 1 hidden layer | 4 – 96+ layers |
+| tanh activation | GELU activation |
+| MSE loss | Cross-entropy loss |
+| Plain gradient descent | AdamW optimizer |
+| Numpy arrays | PyTorch tensors (on GPU) |
+| Fitting $C_p(T)$ | Predicting next tokens |
+
+The fundamental loop — forward pass, compute loss, backpropagate, update weights — is identical. The difference is scale: more layers, more data, more compute, and architectural innovations like self-attention.
+
+
+## Additional resources and references
+
+### NIST Chemistry WebBook
+
+- https://webbook.nist.gov/ — thermophysical property data used in this tutorial
+
+### PyTorch
+
+- Tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html
+- `nn.Module` documentation: https://pytorch.org/docs/stable/nn.html
+
+### Reading
+
+- Zhang, Lipton, Li & Smola, *Dive into Deep Learning* — interactive, with runnable code in PyTorch: https://d2l.ai
+- Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/
+- 3Blue1Brown, *Neural Networks* video series: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi — excellent visual intuition for how neural networks learn
--- a/06-neural-networks/data/n2_cp.csv
+++ b/06-neural-networks/data/n2_cp.csv
@ -0,0 +1,36 @@
+T_K,Cp_kJ_per_kgK
+300.00,1.0413
+350.00,1.0423
+400.00,1.0450
+450.00,1.0497
+500.00,1.0564
+550.00,1.0650
+600.00,1.0751
+650.00,1.0863
+700.00,1.0981
+750.00,1.1102
+800.00,1.1223
+850.00,1.1342
+900.00,1.1457
+950.00,1.1568
+1000.0,1.1674
+1050.0,1.1774
+1100.0,1.1868
+1150.0,1.1957
+1200.0,1.2040
+1250.0,1.2118
+1300.0,1.2191
+1350.0,1.2260
+1400.0,1.2323
+1450.0,1.2383
+1500.0,1.2439
+1550.0,1.2491
+1600.0,1.2540
+1650.0,1.2586
+1700.0,1.2630
+1750.0,1.2670
+1800.0,1.2708
+1850.0,1.2744
+1900.0,1.2778
+1950.0,1.2810
+2000.0,1.2841
--- a/06-neural-networks/nn_noisy.py
+++ b/06-neural-networks/nn_noisy.py
@ -0,0 +1,163 @@
+# nn_noisy.py
+#
+# What happens when we train a neural network on noisy data?
+# This script adds Gaussian noise to the Cp data, trains with a
+# train/validation split, and plots both loss curves to show overfitting.
+#
+# CHEG 667-013
+# E. M. Furst
+
+import torch
+import torch.nn as nn
+import numpy as np
+import matplotlib.pyplot as plt
+
+# ── Load data ─────────────────────────────────────────────────
+
+data = np.loadtxt("data/n2_cp.csv", delimiter=",", skiprows=1)
+T_raw = data[:, 0]
+Cp_raw = data[:, 1]
+
+# ── Add noise ─────────────────────────────────────────────────
+
+noise_scale = 0.02  # kJ/kg/K — try 0.01, 0.02, 0.05, 0.1
+rng = np.random.default_rng(seed=42)
+Cp_noisy = Cp_raw + rng.normal(scale=noise_scale, size=Cp_raw.size)
+
+# ── Train/validation split ────────────────────────────────────
+#
+# Hold out every 4th point for validation.  This gives us 26 training
+# points and 9 validation points — enough to see the overfitting signal.
+
+val_mask = np.zeros(len(T_raw), dtype=bool)
+val_mask[::4] = True
+train_mask = ~val_mask
+
+T_train, Cp_train = T_raw[train_mask], Cp_noisy[train_mask]
+T_val, Cp_val = T_raw[val_mask], Cp_noisy[val_mask]
+
+# ── Normalize to [0, 1] using training set statistics ─────────
+
+T_min, T_max = T_train.min(), T_train.max()
+Cp_min, Cp_max = Cp_train.min(), Cp_train.max()
+
+def normalize_T(T):
+    return (T - T_min) / (T_max - T_min)
+
+def normalize_Cp(Cp):
+    return (Cp - Cp_min) / (Cp_max - Cp_min)
+
+def denormalize_Cp(Cp_norm):
+    return Cp_norm * (Cp_max - Cp_min) + Cp_min
+
+X_train = torch.tensor(normalize_T(T_train), dtype=torch.float32).reshape(-1, 1)
+Y_train = torch.tensor(normalize_Cp(Cp_train), dtype=torch.float32).reshape(-1, 1)
+X_val = torch.tensor(normalize_T(T_val), dtype=torch.float32).reshape(-1, 1)
+Y_val = torch.tensor(normalize_Cp(Cp_val), dtype=torch.float32).reshape(-1, 1)
+
+# ── Define the network ────────────────────────────────────────
+
+H = 10  # try 10, 20, 50 — watch what happens
+
+model = nn.Sequential(
+    nn.Linear(1, H),
+    nn.Tanh(),
+    nn.Linear(H, 1),
+)
+
+n_params = sum(p.numel() for p in model.parameters())
+print(f"Network: 1 -> {H} (tanh) -> 1")
+print(f"Parameters: {n_params}")
+print(f"Training points: {len(T_train)}")
+print(f"Validation points: {len(T_val)}")
+print(f"Noise scale: {noise_scale} kJ/kg/K\n")
+
+# ── Training ──────────────────────────────────────────────────
+
+optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
+loss_fn = nn.MSELoss()
+
+epochs = 10000
+log_interval = 1000
+train_losses = []
+val_losses = []
+best_val_loss = float('inf')
+best_epoch = 0
+
+for epoch in range(epochs):
+    # --- Training step ---
+    model.train()
+    Y_pred = model(X_train)
+    train_loss = loss_fn(Y_pred, Y_train)
+
+    optimizer.zero_grad()
+    train_loss.backward()
+    optimizer.step()
+
+    # --- Validation step (no gradient computation) ---
+    model.eval()
+    with torch.no_grad():
+        val_pred = model(X_val)
+        val_loss = loss_fn(val_pred, Y_val)
+
+    train_losses.append(train_loss.item())
+    val_losses.append(val_loss.item())
+
+    # Track the best validation loss — same idea as nanoGPT's train.py
+    if val_loss.item() < best_val_loss:
+        best_val_loss = val_loss.item()
+        best_epoch = epoch
+
+    if epoch % log_interval == 0 or epoch == epochs - 1:
+        print(f"Epoch {epoch:5d}  Train: {train_loss.item():.6f}  "
+              f"Val: {val_loss.item():.6f}")
+
+print(f"\nBest validation loss: {best_val_loss:.6f} at epoch {best_epoch}")
+
+# ── Results ───────────────────────────────────────────────────
+
+T_fine = torch.linspace(0, 1, 200).reshape(-1, 1)
+model.eval()
+with torch.no_grad():
+    Cp_pred_norm = model(T_fine)
+
+T_fine_K = T_fine.numpy() * (T_max - T_min) + T_min
+Cp_pred = denormalize_Cp(Cp_pred_norm.numpy())
+
+# ── Plot ──────────────────────────────────────────────────────
+
+fig, axes = plt.subplots(1, 3, figsize=(16, 5))
+
+# Left: the fit
+ax = axes[0]
+ax.plot(T_train, Cp_train, 'ko', markersize=6, label='Train (noisy)')
+ax.plot(T_val, Cp_val, 'bs', markersize=6, label='Validation (noisy)')
+ax.plot(T_raw, Cp_raw, 'g--', linewidth=1, alpha=0.7, label='True (NIST)')
+ax.plot(T_fine_K, Cp_pred, 'r-', linewidth=2, label=f'NN ({H} neurons)')
+ax.set_xlabel('Temperature (K)')
+ax.set_ylabel('$C_p$ (kJ/kg/K)')
+ax.set_title(f'Noisy $C_p(T)$ — noise = {noise_scale}')
+ax.legend(fontsize=8)
+
+# Middle: training loss
+ax = axes[1]
+ax.semilogy(train_losses, label='Train loss')
+ax.set_xlabel('Epoch')
+ax.set_ylabel('MSE')
+ax.set_title('Training Loss')
+ax.legend()
+
+# Right: train vs. validation loss
+ax = axes[2]
+ax.semilogy(train_losses, label='Train loss')
+ax.semilogy(val_losses, label='Validation loss')
+ax.axvline(best_epoch, color='gray', linestyle='--', alpha=0.5,
+           label=f'Best val (epoch {best_epoch})')
+ax.set_xlabel('Epoch')
+ax.set_ylabel('MSE')
+ax.set_title('Train vs. Validation Loss')
+ax.legend(fontsize=8)
+
+plt.tight_layout()
+plt.savefig('nn_fit_noisy.png', dpi=150)
+plt.show()
--- a/06-neural-networks/nn_noisy_workshop.ipynb
+++ b/06-neural-networks/nn_noisy_workshop.ipynb
@ -0,0 +1,179 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e917878",
+   "source": "# Overfitting and Noisy Data\n\n**CHEG 667-013 — LLMs for Engineers**\n\nIn the previous notebook, we fit clean NIST data for $C_p(T)$ and everything worked beautifully. But real experimental data has noise. What happens to our neural network then?\n\nThis notebook explores:\n1. What noisy data looks like compared to the true signal\n2. Why a neural network can memorize noise instead of learning physics\n3. How a **train/validation split** reveals overfitting\n4. The connection to **early stopping** in nanoGPT's `train.py`",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d115b276",
+   "source": "## 1. Load the clean data and add noise\n\nWe start from the same NIST data as before, then corrupt it with Gaussian noise to simulate experimental error.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "102ce412",
+   "source": "import numpy as np\nimport matplotlib.pyplot as plt\nimport torch\nimport torch.nn as nn\n\n# Load clean NIST data\ndata = np.loadtxt(\"data/n2_cp.csv\", delimiter=\",\", skiprows=1)\nT_raw = data[:, 0]\nCp_raw = data[:, 1]\n\n# Add noise — try changing this value: 0.01, 0.02, 0.05, 0.1\nnoise_scale = 0.02  # kJ/kg/K\n\nrng = np.random.default_rng(seed=42)\nCp_noisy = Cp_raw + rng.normal(scale=noise_scale, size=Cp_raw.size)\n\nplt.figure(figsize=(8, 5))\nplt.plot(T_raw, Cp_raw, 'g--', linewidth=1.5, label='True (NIST)')\nplt.plot(T_raw, Cp_noisy, 'ko', markersize=6, label=f'Noisy (σ = {noise_scale})')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Clean vs. noisy data')\nplt.legend()\nplt.show()\n\nprint(f\"Cp range: {Cp_raw.min():.4f} – {Cp_raw.max():.4f} kJ/kg/K\")\nprint(f\"Noise scale: {noise_scale} kJ/kg/K ({noise_scale / (Cp_raw.max() - Cp_raw.min()) * 100:.1f}% of signal range)\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35ff0d60",
+   "source": "### Pause and discuss\n\nLook at the plot. The noise is small compared to the overall trend — you can still clearly see the shape of $C_p(T)$. But it's enough to cause problems, as we'll see.\n\n**Question:** If you were fitting a polynomial to this data, would you expect it to work well? What about a very high-degree polynomial?",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41f358d1",
+   "source": "## 2. Train/validation split\n\nHere's the key idea: if we train on *all* the data, we have no way to tell whether the network learned the real trend or just memorized the noise. We need to **hold out** some data that the network never sees during training, then check whether its predictions are good on that held-out data.\n\nWe'll use every 4th point for validation (9 points) and the rest for training (26 points).",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "88dbe11a",
+   "source": "# Split: every 4th point is validation\nval_mask = np.zeros(len(T_raw), dtype=bool)\nval_mask[::4] = True\ntrain_mask = ~val_mask\n\nT_train, Cp_train = T_raw[train_mask], Cp_noisy[train_mask]\nT_val, Cp_val = T_raw[val_mask], Cp_noisy[val_mask]\n\nplt.figure(figsize=(8, 5))\nplt.plot(T_train, Cp_train, 'ko', markersize=6, label=f'Training ({len(T_train)} pts)')\nplt.plot(T_val, Cp_val, 'bs', markersize=8, label=f'Validation ({len(T_val)} pts)')\nplt.plot(T_raw, Cp_raw, 'g--', linewidth=1, alpha=0.5, label='True (NIST)')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Train/validation split')\nplt.legend()\nplt.show()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0bbc02c5",
+   "source": "## 3. Normalize and prepare tensors\n\nAs before, we normalize to [0, 1] — but using only the **training set** statistics. The validation set must be normalized with the same min/max values so the network sees a consistent scale.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "01974a6f",
+   "source": "# Normalize using training set statistics only\nT_min, T_max = T_train.min(), T_train.max()\nCp_min, Cp_max = Cp_train.min(), Cp_train.max()\n\ndef normalize_T(T):\n    return (T - T_min) / (T_max - T_min)\n\ndef normalize_Cp(Cp):\n    return (Cp - Cp_min) / (Cp_max - Cp_min)\n\ndef denormalize_Cp(Cp_norm):\n    return Cp_norm * (Cp_max - Cp_min) + Cp_min\n\nX_train = torch.tensor(normalize_T(T_train), dtype=torch.float32).reshape(-1, 1)\nY_train = torch.tensor(normalize_Cp(Cp_train), dtype=torch.float32).reshape(-1, 1)\nX_val = torch.tensor(normalize_T(T_val), dtype=torch.float32).reshape(-1, 1)\nY_val = torch.tensor(normalize_Cp(Cp_val), dtype=torch.float32).reshape(-1, 1)\n\nprint(f\"Training: {X_train.shape[0]} points\")\nprint(f\"Validation: {X_val.shape[0]} points\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3cffc860",
+   "source": "## 4. Train the network — watching both loss curves\n\nThis is the critical part. We track *two* loss values at every epoch:\n- **Training loss** — how well the network fits the data it's learning from\n- **Validation loss** — how well it predicts data it has *never seen*\n\nWe also save the best validation loss and the epoch where it occurred. This is the same strategy used in nanoGPT's `train.py` — it saves a checkpoint at the lowest validation loss.\n\n**Before running this cell, make a prediction:** Will both curves keep going down together? Or will they diverge?",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "7cf0f639",
+   "source": "# Define network — same architecture as before\nH = 10\n\nmodel = nn.Sequential(\n    nn.Linear(1, H),\n    nn.Tanh(),\n    nn.Linear(H, 1),\n)\n\nn_params = sum(p.numel() for p in model.parameters())\nprint(f\"Network: 1 -> {H} (tanh) -> 1  ({n_params} parameters)\")\nprint(f\"Training on {X_train.shape[0]} points\\n\")\n\n# Training setup\noptimizer = torch.optim.Adam(model.parameters(), lr=0.01)\nloss_fn = nn.MSELoss()\n\nepochs = 10000\nlog_interval = 1000\ntrain_losses = []\nval_losses = []\nbest_val_loss = float('inf')\nbest_epoch = 0\nbest_state = None\n\nfor epoch in range(epochs):\n    # Training step\n    model.train()\n    Y_pred = model(X_train)\n    train_loss = loss_fn(Y_pred, Y_train)\n\n    optimizer.zero_grad()\n    train_loss.backward()\n    optimizer.step()\n\n    # Validation step (no gradients needed)\n    model.eval()\n    with torch.no_grad():\n        val_pred = model(X_val)\n        val_loss = loss_fn(val_pred, Y_val)\n\n    train_losses.append(train_loss.item())\n    val_losses.append(val_loss.item())\n\n    # Save the best model — just like nanoGPT's train.py\n    if val_loss.item() < best_val_loss:\n        best_val_loss = val_loss.item()\n        best_epoch = epoch\n        best_state = {k: v.clone() for k, v in model.state_dict().items()}\n\n    if epoch % log_interval == 0 or epoch == epochs - 1:\n        print(f\"Epoch {epoch:5d}  Train: {train_loss.item():.6f}  Val: {val_loss.item():.6f}\")\n\nprint(f\"\\nBest validation loss: {best_val_loss:.6f} at epoch {best_epoch}\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57c169d3",
+   "source": "## 5. The overfitting signal\n\nNow let's plot the two loss curves side by side. This is the most important plot in the notebook.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "ad150ddc",
+   "source": "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n\n# Left: training loss only\nax1.semilogy(train_losses)\nax1.set_xlabel('Epoch')\nax1.set_ylabel('MSE')\nax1.set_title('Training loss alone — looks great!')\n\n# Right: both curves together\nax2.semilogy(train_losses, label='Train loss')\nax2.semilogy(val_losses, label='Validation loss', color='orange')\nax2.axvline(best_epoch, color='gray', linestyle='--', alpha=0.5,\n            label=f'Best validation (epoch {best_epoch})')\nax2.set_xlabel('Epoch')\nax2.set_ylabel('MSE')\nax2.set_title('Train vs. validation — the full picture')\nax2.legend()\n\nplt.tight_layout()\nplt.show()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bfe1a00",
+   "source": "### Pause and discuss\n\nLook at the two panels:\n\n- **Left panel:** If we only tracked training loss, we'd think everything was fine. The loss keeps going down!\n- **Right panel:** The validation loss tells a different story. It decreases at first (the network is learning the real trend), then **starts increasing** (the network is learning the noise).\n\nThe dashed line marks the **best validation epoch** — the point where the network knows the most about the true relationship and the least about the noise. Training beyond that point makes the model *worse* at predicting new data.\n\n**This is exactly what happens in LLM training.** In nanoGPT, `train.py` evaluates the model on a validation set periodically and saves a checkpoint whenever validation loss reaches a new low. If you train too long, the model memorizes the training text rather than learning general language patterns.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c51917f2",
+   "source": "## 6. See the overfitting in the fit itself\n\nLet's compare the **final model** (trained to the end) with the **best model** (saved at the lowest validation loss). We also plot the true NIST curve for reference.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "3f830870",
+   "source": "# Predictions from the final model (overtrained)\nT_fine = torch.linspace(0, 1, 200).reshape(-1, 1)\nmodel.eval()\nwith torch.no_grad():\n    Cp_final_norm = model(T_fine)\nT_fine_K = T_fine.numpy() * (T_max - T_min) + T_min\nCp_final = denormalize_Cp(Cp_final_norm.numpy())\n\n# Predictions from the best model (early stopping)\nmodel.load_state_dict(best_state)\nmodel.eval()\nwith torch.no_grad():\n    Cp_best_norm = model(T_fine)\nCp_best = denormalize_Cp(Cp_best_norm.numpy())\n\nfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n\n# Left: final model\nax1.plot(T_train, Cp_train, 'ko', markersize=5, label='Train (noisy)')\nax1.plot(T_val, Cp_val, 'bs', markersize=6, label='Validation (noisy)')\nax1.plot(T_raw, Cp_raw, 'g--', linewidth=1.5, alpha=0.7, label='True (NIST)')\nax1.plot(T_fine_K, Cp_final, 'r-', linewidth=2, label=f'Final model (epoch {epochs-1})')\nax1.set_xlabel('Temperature (K)')\nax1.set_ylabel('$C_p$ (kJ/kg/K)')\nax1.set_title('Final model — trained too long')\nax1.legend(fontsize=8)\n\n# Right: best model\nax2.plot(T_train, Cp_train, 'ko', markersize=5, label='Train (noisy)')\nax2.plot(T_val, Cp_val, 'bs', markersize=6, label='Validation (noisy)')\nax2.plot(T_raw, Cp_raw, 'g--', linewidth=1.5, alpha=0.7, label='True (NIST)')\nax2.plot(T_fine_K, Cp_best, 'r-', linewidth=2, label=f'Best model (epoch {best_epoch})')\nax2.set_xlabel('Temperature (K)')\nax2.set_ylabel('$C_p$ (kJ/kg/K)')\nax2.set_title('Best model — early stopping')\nax2.legend(fontsize=8)\n\nplt.tight_layout()\nplt.show()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b5c35d0",
+   "source": "### Pause and discuss\n\nCompare the two panels. The final model (left) tries to pass through every noisy training point, producing wiggles that don't reflect the true physics. The best model (right) produces a smoother curve closer to the true NIST data.\n\n**The network doesn't know what's signal and what's noise.** It just minimizes the loss. If you let it train long enough, it will fit everything — including the noise. The validation set is our only way of detecting this.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bf32bca",
+   "source": "## 7. How noise level affects overfitting\n\nLet's wrap the whole pipeline into a function and run it for several noise levels to see the effect systematically.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "c45e3027",
+   "source": "def train_noisy(noise_scale, H=10, epochs=10000, seed=42):\n    \"\"\"Train on noisy data and return results for plotting.\"\"\"\n    rng = np.random.default_rng(seed=seed)\n    Cp_noisy = Cp_raw + rng.normal(scale=noise_scale, size=Cp_raw.size)\n\n    # Split\n    Cp_tr, Cp_va = Cp_noisy[train_mask], Cp_noisy[val_mask]\n    T_mn, T_mx = T_raw[train_mask].min(), T_raw[train_mask].max()\n    Cp_mn, Cp_mx = Cp_tr.min(), Cp_tr.max()\n\n    X_tr = torch.tensor((T_raw[train_mask] - T_mn) / (T_mx - T_mn), dtype=torch.float32).reshape(-1, 1)\n    Y_tr = torch.tensor((Cp_tr - Cp_mn) / (Cp_mx - Cp_mn), dtype=torch.float32).reshape(-1, 1)\n    X_va = torch.tensor((T_raw[val_mask] - T_mn) / (T_mx - T_mn), dtype=torch.float32).reshape(-1, 1)\n    Y_va = torch.tensor((Cp_va - Cp_mn) / (Cp_mx - Cp_mn), dtype=torch.float32).reshape(-1, 1)\n\n    mdl = nn.Sequential(nn.Linear(1, H), nn.Tanh(), nn.Linear(H, 1))\n    opt = torch.optim.Adam(mdl.parameters(), lr=0.01)\n    loss_fn = nn.MSELoss()\n\n    t_losses, v_losses = [], []\n    best_vl, best_ep, best_st = float('inf'), 0, None\n\n    for ep in range(epochs):\n        mdl.train()\n        pred = mdl(X_tr)\n        tl = loss_fn(pred, Y_tr)\n        opt.zero_grad(); tl.backward(); opt.step()\n\n        mdl.eval()\n        with torch.no_grad():\n            vl = loss_fn(mdl(X_va), Y_va)\n\n        t_losses.append(tl.item())\n        v_losses.append(vl.item())\n        if vl.item() < best_vl:\n            best_vl, best_ep = vl.item(), ep\n            best_st = {k: v.clone() for k, v in mdl.state_dict().items()}\n\n    # Get best-model predictions\n    mdl.load_state_dict(best_st)\n    mdl.eval()\n    T_f = torch.linspace(0, 1, 200).reshape(-1, 1)\n    with torch.no_grad():\n        Cp_pred = mdl(T_f).numpy() * (Cp_mx - Cp_mn) + Cp_mn\n    T_f_K = T_f.numpy() * (T_mx - T_mn) + T_mn\n\n    return dict(noise=noise_scale, t_losses=t_losses, v_losses=v_losses,\n                best_epoch=best_ep, T_fine=T_f_K, Cp_pred=Cp_pred,\n                Cp_train=Cp_tr, Cp_val=Cp_va)\n\n\n# Run for several noise levels\nnoise_levels = [0.005, 0.02, 0.05, 0.1]\nresults = [train_noisy(ns) for ns in noise_levels]",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "id": "49ff5b58",
+   "source": "fig, axes = plt.subplots(2, len(noise_levels), figsize=(16, 9))\n\nfor i, r in enumerate(results):\n    # Top row: fits\n    ax = axes[0, i]\n    ax.plot(T_raw[train_mask], r['Cp_train'], 'ko', markersize=4)\n    ax.plot(T_raw[val_mask], r['Cp_val'], 'bs', markersize=5)\n    ax.plot(T_raw, Cp_raw, 'g--', linewidth=1, alpha=0.5)\n    ax.plot(r['T_fine'], r['Cp_pred'], 'r-', linewidth=2)\n    ax.set_title(f\"σ = {r['noise']}\")\n    ax.set_xlabel('T (K)')\n    if i == 0:\n        ax.set_ylabel('$C_p$ (kJ/kg/K)')\n\n    # Bottom row: loss curves\n    ax = axes[1, i]\n    ax.semilogy(r['t_losses'], label='Train')\n    ax.semilogy(r['v_losses'], label='Val', color='orange')\n    ax.axvline(r['best_epoch'], color='gray', linestyle='--', alpha=0.5)\n    ax.set_title(f\"Best epoch: {r['best_epoch']}\")\n    ax.set_xlabel('Epoch')\n    if i == 0:\n        ax.set_ylabel('MSE')\n    if i == len(noise_levels) - 1:\n        ax.legend(fontsize=8)\n\nplt.suptitle('Effect of noise level on overfitting (H = 10)', fontsize=14, y=1.02)\nplt.tight_layout()\nplt.show()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e9c16100",
+   "source": "### Pause and discuss\n\nNotice the trend across the four columns:\n- **Low noise (σ = 0.005):** The curves barely diverge. Early stopping happens late. The fit is essentially correct.\n- **Moderate noise (σ = 0.02–0.05):** The divergence is clear. Early stopping matters.\n- **High noise (σ = 0.1):** The validation loss diverges quickly and dramatically. The noise is comparable to the signal itself.\n\n**Key insight:** The same model architecture can be fine or catastrophically overfit depending on the quality of the data. This is why data quality matters so much in ML — and why LLM training requires carefully curated datasets, not just more data.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b796024",
+   "source": "## 8. Does model size make it worse?\n\nWith clean data, we noted that 31 parameters for 35 data points was borderline. With noisy data and a train/validation split, we only have 26 training points. What happens if we increase `H`?",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "2cd8da66",
+   "source": "# Compare different model sizes with the same noise level\nhidden_sizes = [5, 10, 25, 50]\nnoise = 0.05  # moderate noise to make the effect visible\n\nresults_H = [train_noisy(noise, H=h) for h in hidden_sizes]\n\nfig, axes = plt.subplots(2, len(hidden_sizes), figsize=(16, 9))\n\nfor i, (h, r) in enumerate(zip(hidden_sizes, results_H)):\n    n_params = 2 * h + h + 1\n\n    ax = axes[0, i]\n    ax.plot(T_raw[train_mask], r['Cp_train'], 'ko', markersize=4)\n    ax.plot(T_raw[val_mask], r['Cp_val'], 'bs', markersize=5)\n    ax.plot(T_raw, Cp_raw, 'g--', linewidth=1, alpha=0.5)\n    ax.plot(r['T_fine'], r['Cp_pred'], 'r-', linewidth=2)\n    ax.set_title(f\"H = {h} ({n_params} params)\")\n    ax.set_xlabel('T (K)')\n    if i == 0:\n        ax.set_ylabel('$C_p$ (kJ/kg/K)')\n\n    ax = axes[1, i]\n    ax.semilogy(r['t_losses'], label='Train')\n    ax.semilogy(r['v_losses'], label='Val', color='orange')\n    ax.axvline(r['best_epoch'], color='gray', linestyle='--', alpha=0.5)\n    ax.set_title(f\"Best epoch: {r['best_epoch']}\")\n    ax.set_xlabel('Epoch')\n    if i == 0:\n        ax.set_ylabel('MSE')\n    if i == len(hidden_sizes) - 1:\n        ax.legend(fontsize=8)\n\nplt.suptitle(f'Effect of model size on overfitting (σ = {noise})', fontsize=14, y=1.02)\nplt.tight_layout()\nplt.show()\n\nfor h, r in zip(hidden_sizes, results_H):\n    print(f\"H = {h:3d}  params = {2*h+h+1:4d}  best epoch = {r['best_epoch']}\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1fdea24c",
+   "source": "## 9. Summary: what we learned\n\n| Concept | In this notebook | In LLM training |\n|---------|-----------------|-----------------|\n| **Overfitting** | Network memorizes noisy $C_p$ data points | Model memorizes training text instead of learning language |\n| **Validation loss** | Increases while training loss decreases | Same — the signal that training should stop |\n| **Early stopping** | Save model at best validation epoch | `train.py` saves checkpoint at lowest validation loss |\n| **Model size** | More neurons = faster overfitting | More parameters = needs more (and better) data |\n| **Data quality** | More noise = earlier overfitting | Poor training data = poor model, no matter the size |\n\nThe validation loss is the most important diagnostic in training. It's the only thing that tells you whether the model is learning something general or just memorizing. When you see LLM papers report \"training loss\" and \"validation loss\", this is exactly what they mean.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "id": "552d4795",
+   "source": "## 10. Exercises\n\nTry these in new cells below:\n\n1. **Different random seeds.** Change `seed=42` to other values in the `train_noisy` function. Does the best epoch change? Does the overall pattern (validation loss rising) persist?\n\n2. **Regularization by reducing model size.** With `noise_scale = 0.05`, try `H = 3`. This gives only 10 parameters — too few to memorize 26 points. Does it overfit? Is the fit still reasonable?\n\n3. **More data helps.** Instead of holding out every 4th point, try every 8th (fewer validation, more training). Does overfitting get better or worse? Why?\n\n4. **Polynomial comparison.** Fit a high-degree polynomial (degree 10 or 15) to the noisy data using `np.polyfit`. How does it compare to the neural network? Does it also overfit?",
+   "metadata": {}
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/06-neural-networks/nn_numpy.py
+++ b/06-neural-networks/nn_numpy.py
@ -0,0 +1,156 @@
+# nn_numpy.py
+#
+# A neural network with one hidden layer, built from scratch using numpy.
+# Fits Cp(T) data for nitrogen gas at 1 bar (NIST WebBook).
+#
+# This demonstrates the core mechanics of a neural network:
+#   - Forward pass: input -> hidden layer -> activation -> output
+#   - Loss calculation (mean squared error)
+#   - Backpropagation: computing gradients of the loss w.r.t. each weight
+#   - Gradient descent: updating weights to minimize loss
+#
+# CHEG 667-013
+# E. M. Furst
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+# ── Load and prepare data ──────────────────────────────────────
+
+data = np.loadtxt("data/n2_cp.csv", delimiter=",", skiprows=1)
+T_raw = data[:, 0]   # Temperature (K)
+Cp_raw = data[:, 1]  # Heat capacity (kJ/kg/K)
+
+# Normalize inputs and outputs to [0, 1] range.
+# Neural networks train better when values are small and centered.
+T_min, T_max = T_raw.min(), T_raw.max()
+Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()
+
+T = (T_raw - T_min) / (T_max - T_min)      # shape: (N,)
+Cp = (Cp_raw - Cp_min) / (Cp_max - Cp_min)  # shape: (N,)
+
+# Reshape for matrix operations: each sample is a row
+X = T.reshape(-1, 1)    # (N, 1) -- input matrix
+Y = Cp.reshape(-1, 1)   # (N, 1) -- target matrix
+
+N = X.shape[0]  # number of data points
+
+# ── Network architecture ───────────────────────────────────────
+#
+#   Input (1) --> Hidden (H neurons, tanh) --> Output (1)
+#
+# The hidden layer has H neurons. Each neuron computes:
+#   z = w * x + b       (weighted sum)
+#   a = tanh(z)          (activation -- introduces nonlinearity)
+#
+# The output layer combines the hidden activations:
+#   y_pred = W2 @ a + b2
+
+H = 10  # number of neurons in the hidden layer
+
+# Initialize weights randomly (small values)
+# W1: (1, H) -- connects input to each hidden neuron
+# b1: (1, H) -- one bias per hidden neuron
+# W2: (H, 1) -- connects hidden neurons to output
+# b2: (1, 1) -- output bias
+np.random.seed(42)
+W1 = np.random.randn(1, H) * 0.5
+b1 = np.zeros((1, H))
+W2 = np.random.randn(H, 1) * 0.5
+b2 = np.zeros((1, 1))
+
+# ── Training parameters ───────────────────────────────────────
+
+learning_rate = 0.01
+epochs = 5000
+log_interval = 500
+
+# ── Training loop ─────────────────────────────────────────────
+
+losses = []
+
+for epoch in range(epochs):
+
+    # ── Forward pass ──────────────────────────────────────────
+    # Step 1: hidden layer pre-activation
+    Z1 = X @ W1 + b1          # (N, H)
+
+    # Step 2: hidden layer activation (tanh)
+    A1 = np.tanh(Z1)          # (N, H)
+
+    # Step 3: output layer (linear -- no activation)
+    Y_pred = A1 @ W2 + b2     # (N, 1)
+
+    # ── Loss ──────────────────────────────────────────────────
+    # Mean squared error
+    error = Y_pred - Y         # (N, 1)
+    loss = np.mean(error ** 2)
+    losses.append(loss)
+
+    # ── Backpropagation ───────────────────────────────────────
+    # Compute gradients by applying the chain rule, working
+    # backward from the loss to each weight.
+
+    # Gradient of loss w.r.t. output
+    dL_dYpred = 2 * error / N              # (N, 1)
+
+    # Gradients for output layer weights
+    dL_dW2 = A1.T @ dL_dYpred              # (H, 1)
+    dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True)  # (1, 1)
+
+    # Gradient flowing back through the hidden layer
+    dL_dA1 = dL_dYpred @ W2.T             # (N, H)
+
+    # Derivative of tanh: d/dz tanh(z) = 1 - tanh(z)^2
+    dL_dZ1 = dL_dA1 * (1 - A1 ** 2)       # (N, H)
+
+    # Gradients for hidden layer weights
+    dL_dW1 = X.T @ dL_dZ1                  # (1, H)
+    dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True)  # (1, H)
+
+    # ── Gradient descent ──────────────────────────────────────
+    # Update each weight in the direction that reduces the loss
+    W2 -= learning_rate * dL_dW2
+    b2 -= learning_rate * dL_db2
+    W1 -= learning_rate * dL_dW1
+    b1 -= learning_rate * dL_db1
+
+    if epoch % log_interval == 0 or epoch == epochs - 1:
+        print(f"Epoch {epoch:5d}  Loss: {loss:.6f}")
+
+# ── Results ────────────────────────────────────────────────────
+
+# Predict on a fine grid for smooth plotting
+T_fine = np.linspace(0, 1, 200).reshape(-1, 1)
+A1_fine = np.tanh(T_fine @ W1 + b1)
+Cp_pred_norm = A1_fine @ W2 + b2
+
+# Convert back to physical units
+T_fine_K = T_fine * (T_max - T_min) + T_min
+Cp_pred = Cp_pred_norm * (Cp_max - Cp_min) + Cp_min
+
+# ── Plot ───────────────────────────────────────────────────────
+
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
+
+# Left: fit
+ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')
+ax1.plot(T_fine_K, Cp_pred, 'r-', linewidth=2, label=f'NN ({H} neurons)')
+ax1.set_xlabel('Temperature (K)')
+ax1.set_ylabel('$C_p$ (kJ/kg/K)')
+ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')
+ax1.legend()
+
+# Right: training loss
+ax2.semilogy(losses)
+ax2.set_xlabel('Epoch')
+ax2.set_ylabel('Mean Squared Error')
+ax2.set_title('Training Loss')
+
+plt.tight_layout()
+plt.savefig('nn_fit.png', dpi=150)
+plt.show()
+
+print(f"\nFinal loss: {losses[-1]:.6f}")
+print(f"Network: {1} input -> {H} hidden (tanh) -> {1} output")
+print(f"Total parameters: {W1.size + b1.size + W2.size + b2.size}")
--- a/06-neural-networks/nn_torch.py
+++ b/06-neural-networks/nn_torch.py
@ -0,0 +1,99 @@
+# nn_torch.py
+#
+# The same neural network as nn_numpy.py, but using PyTorch.
+# Compare this to the numpy version to see what the framework handles for you:
+#   - Automatic differentiation (no manual backprop)
+#   - Built-in optimizers (Adam instead of hand-coded gradient descent)
+#   - GPU support (if available)
+#
+# CHEG 667-013
+# E. M. Furst
+
+import torch
+import torch.nn as nn
+import numpy as np
+import matplotlib.pyplot as plt
+
+# ── Load and prepare data ──────────────────────────────────────
+
+data = np.loadtxt("data/n2_cp.csv", delimiter=",", skiprows=1)
+T_raw = data[:, 0]
+Cp_raw = data[:, 1]
+
+# Normalize to [0, 1]
+T_min, T_max = T_raw.min(), T_raw.max()
+Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()
+
+X = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)
+Y = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)
+
+# ── Define the network ─────────────────────────────────────────
+#
+# nn.Sequential stacks layers in order. Compare this to nanoGPT's
+# model.py, which uses the same PyTorch building blocks (nn.Linear,
+# activation functions) but with many more layers.
+
+H = 10  # hidden neurons
+
+model = nn.Sequential(
+    nn.Linear(1, H),    # input -> hidden (W1, b1)
+    nn.Tanh(),           # activation
+    nn.Linear(H, 1),    # hidden -> output (W2, b2)
+)
+
+print(f"Model:\n{model}")
+print(f"Total parameters: {sum(p.numel() for p in model.parameters())}\n")
+
+# ── Training ───────────────────────────────────────────────────
+
+optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
+loss_fn = nn.MSELoss()
+
+epochs = 5000
+log_interval = 500
+losses = []
+
+for epoch in range(epochs):
+    # Forward pass -- PyTorch tracks operations for automatic differentiation
+    Y_pred = model(X)
+    loss = loss_fn(Y_pred, Y)
+    losses.append(loss.item())
+
+    # Backward pass -- PyTorch computes all gradients automatically
+    optimizer.zero_grad()   # reset gradients from previous step
+    loss.backward()         # compute gradients via automatic differentiation
+    optimizer.step()        # update weights (Adam optimizer)
+
+    if epoch % log_interval == 0 or epoch == epochs - 1:
+        print(f"Epoch {epoch:5d}  Loss: {loss.item():.6f}")
+
+# ── Results ────────────────────────────────────────────────────
+
+# Predict on a fine grid
+T_fine = torch.linspace(0, 1, 200).reshape(-1, 1)
+with torch.no_grad():  # no gradient tracking needed for inference
+    Cp_pred_norm = model(T_fine)
+
+# Convert back to physical units
+T_fine_K = T_fine.numpy() * (T_max - T_min) + T_min
+Cp_pred = Cp_pred_norm.numpy() * (Cp_max - Cp_min) + Cp_min
+
+# ── Plot ───────────────────────────────────────────────────────
+
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
+
+ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')
+ax1.plot(T_fine_K, Cp_pred, 'r-', linewidth=2, label=f'NN ({H} neurons)')
+ax1.set_xlabel('Temperature (K)')
+ax1.set_ylabel('$C_p$ (kJ/kg/K)')
+ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar — PyTorch')
+ax1.legend()
+
+ax2.semilogy(losses)
+ax2.set_xlabel('Epoch')
+ax2.set_ylabel('Mean Squared Error')
+ax2.set_title('Training Loss')
+
+plt.tight_layout()
+plt.savefig('nn_fit_torch.png', dpi=150)
+plt.show()
--- a/06-neural-networks/nn_workshop.ipynb
+++ b/06-neural-networks/nn_workshop.ipynb
@ -0,0 +1,428 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "xbsmj1hcj1g",
+   "metadata": {},
+   "source": [
+    "# Building a Neural Network: $C_p(T)$ for Nitrogen\n",
+    "\n",
+    "**CHEG 667-013 — LLMs for Engineers**\n",
+    "\n",
+    "In this notebook we fit the heat capacity of N₂ gas using three approaches:\n",
+    "1. A polynomial fit (the classical approach)\n",
+    "2. A neural network built from scratch in numpy\n",
+    "3. The same network in PyTorch\n",
+    "\n",
+    "This makes the ML concepts behind LLMs — weights, loss, gradient descent, overfitting — concrete and tangible."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "szrl41l3xbq",
+   "metadata": {},
+   "source": [
+    "## 1. Load and plot the data\n",
+    "\n",
+    "The data is from the [NIST Chemistry WebBook](https://webbook.nist.gov/): isobaric heat capacity of N₂ at 1 bar, 300–2000 K."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "t4lqkcoeyil",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "data = np.loadtxt(\"data/n2_cp.csv\", delimiter=\",\", skiprows=1)\n",
+    "T_raw = data[:, 0]   # Temperature (K)\n",
+    "Cp_raw = data[:, 1]  # Cp (kJ/kg/K)\n",
+    "\n",
+    "plt.figure(figsize=(8, 5))\n",
+    "plt.plot(T_raw, Cp_raw, 'ko', markersize=6)\n",
+    "plt.xlabel('Temperature (K)')\n",
+    "plt.ylabel('$C_p$ (kJ/kg/K)')\n",
+    "plt.title('$C_p(T)$ for N$_2$ at 1 bar — NIST WebBook')\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"{len(T_raw)} data points, T range: {T_raw.min():.0f} – {T_raw.max():.0f} K\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1jyrgsvp7op",
+   "metadata": {},
+   "source": [
+    "## 2. Polynomial fit (baseline)\n",
+    "\n",
+    "Textbooks fit $C_p(T)$ with a polynomial: $C_p = a + bT + cT^2 + dT^3$. This is a **4-parameter** model. Let's fit it with `numpy.polyfit` and see how well it does."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4smvu4z2oro",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit a cubic polynomial\n",
+    "coeffs = np.polyfit(T_raw, Cp_raw, 3)\n",
+    "poly = np.poly1d(coeffs)\n",
+    "\n",
+    "T_fine = np.linspace(T_raw.min(), T_raw.max(), 200)\n",
+    "Cp_poly = poly(T_fine)\n",
+    "\n",
+    "# Compute residuals\n",
+    "Cp_poly_at_data = poly(T_raw)\n",
+    "mse_poly = np.mean((Cp_poly_at_data - Cp_raw) ** 2)\n",
+    "\n",
+    "plt.figure(figsize=(8, 5))\n",
+    "plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "plt.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Cubic polynomial (4 params)')\n",
+    "plt.xlabel('Temperature (K)')\n",
+    "plt.ylabel('$C_p$ (kJ/kg/K)')\n",
+    "plt.title('Polynomial fit')\n",
+    "plt.legend()\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"Polynomial coefficients: {coeffs}\")\n",
+    "print(f\"MSE: {mse_poly:.8f}\")\n",
+    "print(f\"Parameters: 4\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97y7mrcekji",
+   "metadata": {},
+   "source": [
+    "## 3. Neural network from scratch (numpy)\n",
+    "\n",
+    "Now let's build a one-hidden-layer neural network. The architecture:\n",
+    "\n",
+    "```\n",
+    "Input (1: T) -> Hidden (10 neurons, tanh) -> Output (1: Cp)\n",
+    "```\n",
+    "\n",
+    "We need to:\n",
+    "1. **Normalize** the data to [0, 1] so the network trains efficiently\n",
+    "2. **Forward pass**: compute predictions from input through each layer\n",
+    "3. **Loss**: mean squared error between predictions and data\n",
+    "4. **Backpropagation**: compute gradients of the loss w.r.t. each weight using the chain rule\n",
+    "5. **Gradient descent**: update weights in the direction that reduces the loss\n",
+    "\n",
+    "This is exactly what nanoGPT's `train.py` does — just at a much larger scale."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "365o7bqbwkr",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Normalize inputs and outputs to [0, 1]\n",
+    "T_min, T_max = T_raw.min(), T_raw.max()\n",
+    "Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()\n",
+    "\n",
+    "T = (T_raw - T_min) / (T_max - T_min)\n",
+    "Cp = (Cp_raw - Cp_min) / (Cp_max - Cp_min)\n",
+    "\n",
+    "X = T.reshape(-1, 1)    # (N, 1) input matrix\n",
+    "Y = Cp.reshape(-1, 1)   # (N, 1) target matrix\n",
+    "N = X.shape[0]\n",
+    "\n",
+    "# Network setup\n",
+    "H = 10  # hidden neurons\n",
+    "\n",
+    "np.random.seed(42)\n",
+    "W1 = np.random.randn(1, H) * 0.5   # input -> hidden weights\n",
+    "b1 = np.zeros((1, H))               # hidden biases\n",
+    "W2 = np.random.randn(H, 1) * 0.5   # hidden -> output weights\n",
+    "b2 = np.zeros((1, 1))               # output bias\n",
+    "\n",
+    "print(f\"Parameters: W1({W1.shape}) + b1({b1.shape}) + W2({W2.shape}) + b2({b2.shape})\")\n",
+    "print(f\"Total: {W1.size + b1.size + W2.size + b2.size} parameters for {N} data points\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5w1ezs9t2w6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Training loop\n",
+    "learning_rate = 0.01\n",
+    "epochs = 5000\n",
+    "log_interval = 500\n",
+    "losses_np = []\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    # Forward pass\n",
+    "    Z1 = X @ W1 + b1           # hidden pre-activation  (N, H)\n",
+    "    A1 = np.tanh(Z1)           # hidden activation       (N, H)\n",
+    "    Y_pred = A1 @ W2 + b2      # output                  (N, 1)\n",
+    "\n",
+    "    # Loss (mean squared error)\n",
+    "    error = Y_pred - Y\n",
+    "    loss = np.mean(error ** 2)\n",
+    "    losses_np.append(loss)\n",
+    "\n",
+    "    # Backpropagation (chain rule, working backward)\n",
+    "    dL_dYpred = 2 * error / N\n",
+    "    dL_dW2 = A1.T @ dL_dYpred\n",
+    "    dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True)\n",
+    "    dL_dA1 = dL_dYpred @ W2.T\n",
+    "    dL_dZ1 = dL_dA1 * (1 - A1 ** 2)   # tanh derivative\n",
+    "    dL_dW1 = X.T @ dL_dZ1\n",
+    "    dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True)\n",
+    "\n",
+    "    # Gradient descent update\n",
+    "    W2 -= learning_rate * dL_dW2\n",
+    "    b2 -= learning_rate * dL_db2\n",
+    "    W1 -= learning_rate * dL_dW1\n",
+    "    b1 -= learning_rate * dL_db1\n",
+    "\n",
+    "    if epoch % log_interval == 0 or epoch == epochs - 1:\n",
+    "        print(f\"Epoch {epoch:5d}  Loss: {loss:.6f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "onel9r0kjk",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Predict on a fine grid and convert back to physical units\n",
+    "T_fine_norm = np.linspace(0, 1, 200).reshape(-1, 1)\n",
+    "A1_fine = np.tanh(T_fine_norm @ W1 + b1)\n",
+    "Cp_nn_norm = A1_fine @ W2 + b2\n",
+    "Cp_nn = Cp_nn_norm * (Cp_max - Cp_min) + Cp_min\n",
+    "T_fine_K = T_fine_norm * (T_max - T_min) + T_min\n",
+    "\n",
+    "# MSE in original units for comparison with polynomial\n",
+    "Cp_nn_at_data = np.tanh(X @ W1 + b1) @ W2 + b2\n",
+    "Cp_nn_at_data = Cp_nn_at_data * (Cp_max - Cp_min) + Cp_min\n",
+    "mse_nn = np.mean((Cp_nn_at_data.flatten() - Cp_raw) ** 2)\n",
+    "\n",
+    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
+    "\n",
+    "ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params, MSE={mse_poly:.2e})')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r-', linewidth=2, label=f'NN numpy (31 params, MSE={mse_nn:.2e})')\n",
+    "ax1.set_xlabel('Temperature (K)')\n",
+    "ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
+    "ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
+    "ax1.legend()\n",
+    "\n",
+    "ax2.semilogy(losses_np)\n",
+    "ax2.set_xlabel('Epoch')\n",
+    "ax2.set_ylabel('MSE (normalized)')\n",
+    "ax2.set_title('Training loss — numpy NN')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea9z35qm9u8",
+   "metadata": {},
+   "source": [
+    "## 4. Neural network in PyTorch\n",
+    "\n",
+    "The same network, but PyTorch handles backpropagation automatically. Compare the training loop above to the one below — `loss.backward()` replaces all of our manual gradient calculations.\n",
+    "\n",
+    "This is the same API used in nanoGPT's `model.py` — `nn.Linear`, activation functions, `optimizer.step()`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3qxnrtyxqgz",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "# Prepare data as PyTorch tensors\n",
+    "X_t = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)\n",
+    "Y_t = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)\n",
+    "\n",
+    "# Define the network\n",
+    "model = nn.Sequential(\n",
+    "    nn.Linear(1, H),    # input -> hidden (W1, b1)\n",
+    "    nn.Tanh(),           # activation\n",
+    "    nn.Linear(H, 1),    # hidden -> output (W2, b2)\n",
+    ")\n",
+    "\n",
+    "print(model)\n",
+    "print(f\"Total parameters: {sum(p.numel() for p in model.parameters())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ydl3ycnypps",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Train\n",
+    "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n",
+    "loss_fn = nn.MSELoss()\n",
+    "losses_torch = []\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    Y_pred_t = model(X_t)\n",
+    "    loss = loss_fn(Y_pred_t, Y_t)\n",
+    "    losses_torch.append(loss.item())\n",
+    "\n",
+    "    optimizer.zero_grad()   # reset gradients\n",
+    "    loss.backward()         # automatic differentiation\n",
+    "    optimizer.step()        # update weights\n",
+    "\n",
+    "    if epoch % log_interval == 0 or epoch == epochs - 1:\n",
+    "        print(f\"Epoch {epoch:5d}  Loss: {loss.item():.6f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bg0kvnk4ho",
+   "metadata": {},
+   "source": [
+    "## 5. Compare all three approaches"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h2dfstoh8gd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# PyTorch predictions\n",
+    "T_fine_t = torch.linspace(0, 1, 200).reshape(-1, 1)\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_norm = model(T_fine_t)\n",
+    "Cp_torch = Cp_torch_norm.numpy() * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "# MSE for PyTorch model\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_at_data = model(X_t).numpy() * (Cp_max - Cp_min) + Cp_min\n",
+    "mse_torch = np.mean((Cp_torch_at_data.flatten() - Cp_raw) ** 2)\n",
+    "\n",
+    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
+    "\n",
+    "# Left: all three fits\n",
+    "ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params)')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r--', linewidth=2, label=f'NN numpy (31 params)')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_torch.flatten(), 'g-', linewidth=2, alpha=0.8, label=f'NN PyTorch (31 params)')\n",
+    "ax1.set_xlabel('Temperature (K)')\n",
+    "ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
+    "ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
+    "ax1.legend()\n",
+    "\n",
+    "# Right: training loss comparison\n",
+    "ax2.semilogy(losses_np, label='numpy (gradient descent)')\n",
+    "ax2.semilogy(losses_torch, label='PyTorch (Adam)')\n",
+    "ax2.set_xlabel('Epoch')\n",
+    "ax2.set_ylabel('MSE (normalized)')\n",
+    "ax2.set_title('Training loss comparison')\n",
+    "ax2.legend()\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"MSE — Polynomial: {mse_poly:.2e}  |  NN numpy: {mse_nn:.2e}  |  NN PyTorch: {mse_torch:.2e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "xyw3sr20brn",
+   "metadata": {},
+   "source": [
+    "## 6. Extrapolation\n",
+    "\n",
+    "How do the models behave *outside* the training range? This is a key test — and where the differences become stark."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fi3iq2sjh6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extrapolate beyond the training range\n",
+    "T_extrap = np.linspace(100, 2500, 300)\n",
+    "T_extrap_norm = ((T_extrap - T_min) / (T_max - T_min)).reshape(-1, 1)\n",
+    "\n",
+    "# Polynomial extrapolation\n",
+    "Cp_poly_extrap = poly(T_extrap)\n",
+    "\n",
+    "# Numpy NN extrapolation\n",
+    "A1_extrap = np.tanh(T_extrap_norm @ W1 + b1)\n",
+    "Cp_nn_extrap = (A1_extrap @ W2 + b2) * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "# PyTorch NN extrapolation\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_extrap = model(torch.tensor(T_extrap_norm, dtype=torch.float32)).numpy()\n",
+    "Cp_torch_extrap = Cp_torch_extrap * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "plt.plot(T_extrap, Cp_poly_extrap, 'b-', linewidth=2, label='Polynomial')\n",
+    "plt.plot(T_extrap, Cp_nn_extrap.flatten(), 'r--', linewidth=2, label='NN numpy')\n",
+    "plt.plot(T_extrap, Cp_torch_extrap.flatten(), 'g-', linewidth=2, alpha=0.8, label='NN PyTorch')\n",
+    "plt.axvline(T_raw.min(), color='gray', linestyle=':', alpha=0.5, label='Training range')\n",
+    "plt.axvline(T_raw.max(), color='gray', linestyle=':', alpha=0.5)\n",
+    "plt.xlabel('Temperature (K)')\n",
+    "plt.ylabel('$C_p$ (kJ/kg/K)')\n",
+    "plt.title('Extrapolation beyond training data')\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "yb2s18keiw",
+   "metadata": {},
+   "source": [
+    "## 7. Exercises\n",
+    "\n",
+    "Try these in new cells below:\n",
+    "\n",
+    "1. **Change the number of hidden neurons** (`H`). Try 2, 5, 20, 50. How does the fit change? At what point does adding neurons stop helping?\n",
+    "\n",
+    "2. **Activation functions**: In the PyTorch model, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change?\n",
+    "\n",
+    "3. **Optimizer comparison**: Replace `Adam` with `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare?\n",
+    "\n",
+    "4. **Remove normalization**: Use `T_raw` and `Cp_raw` directly (no scaling to [0,1]). What happens? Can you fix it by adjusting the learning rate?\n",
+    "\n",
+    "5. **Overfitting**: Set `H = 100` and train for 20,000 epochs. Does it fit the training data well? Look at the extrapolation — is it reasonable?\n",
+    "\n",
+    "6. **Higher-order polynomial**: Try `np.polyfit(T_raw, Cp_raw, 10)`. How does it compare to the cubic? How does it extrapolate?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}