README updates, textbook polynomial cell, self-contained notebook

Same set of changes as che-computing-dev/LLMs: - 03/04/05 READMEs: uv add workflow, required model caching - 05-tool-use: add Setup section, requirements.txt - 06-neural-networks: textbook cubic polynomial comparison cell - 06-neural-networks: add nn_workshop_colab.ipynb (self-contained, inline data) - vocab.md: catch up with terms from 02-05 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 10:18:10 -04:00 · 2026-05-04 10:18:10 -04:00 · f7d2b48f5a
commit f7d2b48f5a
parent a1f9d4d5ed
7 changed files with 534 additions and 23 deletions
--- a/06-neural-networks/nn_workshop.ipynb
+++ b/06-neural-networks/nn_workshop.ipynb
@ -93,6 +93,20 @@
    "print(f\"Parameters: 4\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "07673988",
+   "source": "### Comparing to a textbook polynomial\n\nMost thermodynamics textbooks tabulate $C_p$ correlations of the same cubic form, but with **fixed coefficients** drawn from a broader fit and reported in **molar** units, $C_p$ in J/(mol·K). For nitrogen, a typical reference set is:\n\n| Coefficient | Value |\n|-------------|-------|\n| $a$ | 28.883 |\n| $b$ | $-0.157 \\times 10^{-2}$ |\n| $c$ | $0.808 \\times 10^{-5}$ |\n| $d$ | $-2.871 \\times 10^{-9}$ |\n\n**Valid range: 273–1800 K.** Outside this range the textbook polynomial is being extrapolated and is not guaranteed by the original fit.\n\nTo compare with our NIST mass-basis data, we divide by the molar mass of N$_2$ (28.014 g/mol) — convenient because J/(mol·K) ÷ g/mol = kJ/(kg·K).",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "f21c64a3",
+   "source": "# Reference polynomial from textbook (Cp in J/(mol·K))\n# Form: Cp = a + bT + cT^2 + dT^3\n# Valid range: 273-1800 K\n\na_ref = 28.883\nb_ref = -0.157e-2\nc_ref = 0.808e-5\nd_ref = -2.871e-9\nM_N2 = 28.014  # g/mol\n\nT_REF_MIN, T_REF_MAX = 273.0, 1800.0\n\ndef cp_ref_molar(T):\n    \"\"\"Textbook cubic in J/(mol·K).\"\"\"\n    return a_ref + b_ref * T + c_ref * T**2 + d_ref * T**3\n\ndef cp_ref(T):\n    \"\"\"Same fit converted to kJ/(kg·K) to match the NIST data.\"\"\"\n    return cp_ref_molar(T) / M_N2\n\n# Evaluate the textbook polynomial across the full plot range\nCp_ref_full = cp_ref(T_fine)\n\n# MSE on the data points that fall within the textbook's stated valid range\nin_range = (T_raw >= T_REF_MIN) & (T_raw <= T_REF_MAX)\nCp_ref_at_data = cp_ref(T_raw[in_range])\nmse_ref = np.mean((Cp_ref_at_data - Cp_raw[in_range]) ** 2)\n\n# Plot both fits across the full data range; shade the textbook's valid range\nplt.figure(figsize=(8, 5))\nplt.axvspan(T_REF_MIN, T_REF_MAX, alpha=0.07, color='green',\n            label='Textbook valid range (273–1800 K)')\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nplt.plot(T_fine, Cp_ref_full, 'g--', linewidth=2,\n         label=f'Textbook polynomial (4 params, MSE={mse_ref:.2e})')\nplt.plot(T_fine, Cp_poly, 'b-', linewidth=2,\n         label=f'numpy.polyfit (4 params, MSE={mse_poly:.2e})')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Two cubic fits: textbook coefficients vs. fit to these 35 points')\nplt.legend(fontsize=9)\nplt.show()\n\nprint(f\"Textbook polynomial MSE (within 273-1800 K, {in_range.sum()} points): {mse_ref:.6e}\")\nprint(f\"numpy.polyfit MSE (all {len(T_raw)} points):                     {mse_poly:.6e}\")\nprint()\nprint(\"Note: the textbook polynomial is plotted across the full range, but its\")\nprint(\"authors only claim validity from 273 to 1800 K (shaded band).\")\nprint(\"Beyond 1800 K, the curve is an extrapolation — useful as a teaching point\")\nprint(\"about trusting correlations only over their stated range.\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
  {
   "cell_type": "markdown",
   "id": "97y7mrcekji",
@ -425,4 +439,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 5
-}
+}
--- a/06-neural-networks/nn_workshop_colab.ipynb
+++ b/06-neural-networks/nn_workshop_colab.ipynb
@ -0,0 +1,415 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "xbsmj1hcj1g",
+   "metadata": {},
+   "source": "# Building a Neural Network: $C_p(T)$ for Nitrogen — self-contained edition\n\n**CHEG 667-013 — LLMs for Engineers**\n\nThis is a **self-contained version** of the notebook. The 35 NIST data points are embedded directly so it runs anywhere without needing the workshop repository. Suitable for Google Colab, JupyterHub, or any environment with `numpy`, `matplotlib`, and `torch`.\n\nIn this notebook we fit the heat capacity of N₂ gas using three approaches:\n1. A polynomial fit (the classical approach)\n2. A neural network built from scratch in numpy\n3. The same network in PyTorch\n\nThis makes the ML concepts behind LLMs — weights, loss, gradient descent, overfitting — concrete and tangible."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "szrl41l3xbq",
+   "metadata": {},
+   "source": [
+    "## 1. Load and plot the data\n",
+    "\n",
+    "The data is from the [NIST Chemistry WebBook](https://webbook.nist.gov/): isobaric heat capacity of N₂ at 1 bar, 300–2000 K."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "t4lqkcoeyil",
+   "metadata": {},
+   "outputs": [],
+   "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\n# NIST WebBook: ideal gas Cp of N2 at 1 bar, 300-2000 K\n# https://webbook.nist.gov/\nnist_data = np.array([\n    [ 300.0, 1.0413], [ 350.0, 1.0423], [ 400.0, 1.0450], [ 450.0, 1.0497],\n    [ 500.0, 1.0564], [ 550.0, 1.0650], [ 600.0, 1.0751], [ 650.0, 1.0863],\n    [ 700.0, 1.0981], [ 750.0, 1.1102], [ 800.0, 1.1223], [ 850.0, 1.1342],\n    [ 900.0, 1.1457], [ 950.0, 1.1568], [1000.0, 1.1674], [1050.0, 1.1774],\n    [1100.0, 1.1868], [1150.0, 1.1957], [1200.0, 1.2040], [1250.0, 1.2118],\n    [1300.0, 1.2191], [1350.0, 1.2260], [1400.0, 1.2323], [1450.0, 1.2383],\n    [1500.0, 1.2439], [1550.0, 1.2491], [1600.0, 1.2540], [1650.0, 1.2586],\n    [1700.0, 1.2630], [1750.0, 1.2670], [1800.0, 1.2708], [1850.0, 1.2744],\n    [1900.0, 1.2778], [1950.0, 1.2810], [2000.0, 1.2841],\n])\n\nT_raw = nist_data[:, 0]    # Temperature (K)\nCp_raw = nist_data[:, 1]   # Cp (kJ/kg/K)\n\nplt.figure(figsize=(8, 5))\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6)\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('$C_p(T)$ for N$_2$ at 1 bar — NIST WebBook')\nplt.show()\n\nprint(f\"{len(T_raw)} data points, T range: {T_raw.min():.0f} – {T_raw.max():.0f} K\")"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1jyrgsvp7op",
+   "metadata": {},
+   "source": [
+    "## 2. Polynomial fit (baseline)\n",
+    "\n",
+    "Textbooks fit $C_p(T)$ with a polynomial: $C_p = a + bT + cT^2 + dT^3$. This is a **4-parameter** model. Let's fit it with `numpy.polyfit` and see how well it does."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4smvu4z2oro",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit a cubic polynomial\n",
+    "coeffs = np.polyfit(T_raw, Cp_raw, 3)\n",
+    "poly = np.poly1d(coeffs)\n",
+    "\n",
+    "T_fine = np.linspace(T_raw.min(), T_raw.max(), 200)\n",
+    "Cp_poly = poly(T_fine)\n",
+    "\n",
+    "# Compute residuals\n",
+    "Cp_poly_at_data = poly(T_raw)\n",
+    "mse_poly = np.mean((Cp_poly_at_data - Cp_raw) ** 2)\n",
+    "\n",
+    "plt.figure(figsize=(8, 5))\n",
+    "plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "plt.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Cubic polynomial (4 params)')\n",
+    "plt.xlabel('Temperature (K)')\n",
+    "plt.ylabel('$C_p$ (kJ/kg/K)')\n",
+    "plt.title('Polynomial fit')\n",
+    "plt.legend()\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"Polynomial coefficients: {coeffs}\")\n",
+    "print(f\"MSE: {mse_poly:.8f}\")\n",
+    "print(f\"Parameters: 4\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07673988",
+   "source": "### Comparing to a textbook polynomial\n\nMost thermodynamics textbooks tabulate $C_p$ correlations of the same cubic form, but with **fixed coefficients** drawn from a broader fit and reported in **molar** units, $C_p$ in J/(mol·K). For nitrogen, a typical reference set is:\n\n| Coefficient | Value |\n|-------------|-------|\n| $a$ | 28.883 |\n| $b$ | $-0.157 \\times 10^{-2}$ |\n| $c$ | $0.808 \\times 10^{-5}$ |\n| $d$ | $-2.871 \\times 10^{-9}$ |\n\n**Valid range: 273–1800 K.** Outside this range the textbook polynomial is being extrapolated and is not guaranteed by the original fit.\n\nTo compare with our NIST mass-basis data, we divide by the molar mass of N$_2$ (28.014 g/mol) — convenient because J/(mol·K) ÷ g/mol = kJ/(kg·K).",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "id": "f21c64a3",
+   "source": "# Reference polynomial from textbook (Cp in J/(mol·K))\n# Form: Cp = a + bT + cT^2 + dT^3\n# Valid range: 273-1800 K\n\na_ref = 28.883\nb_ref = -0.157e-2\nc_ref = 0.808e-5\nd_ref = -2.871e-9\nM_N2 = 28.014  # g/mol\n\nT_REF_MIN, T_REF_MAX = 273.0, 1800.0\n\ndef cp_ref_molar(T):\n    \"\"\"Textbook cubic in J/(mol·K).\"\"\"\n    return a_ref + b_ref * T + c_ref * T**2 + d_ref * T**3\n\ndef cp_ref(T):\n    \"\"\"Same fit converted to kJ/(kg·K) to match the NIST data.\"\"\"\n    return cp_ref_molar(T) / M_N2\n\n# Evaluate the textbook polynomial across the full plot range\nCp_ref_full = cp_ref(T_fine)\n\n# MSE on the data points that fall within the textbook's stated valid range\nin_range = (T_raw >= T_REF_MIN) & (T_raw <= T_REF_MAX)\nCp_ref_at_data = cp_ref(T_raw[in_range])\nmse_ref = np.mean((Cp_ref_at_data - Cp_raw[in_range]) ** 2)\n\n# Plot both fits across the full data range; shade the textbook's valid range\nplt.figure(figsize=(8, 5))\nplt.axvspan(T_REF_MIN, T_REF_MAX, alpha=0.07, color='green',\n            label='Textbook valid range (273–1800 K)')\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nplt.plot(T_fine, Cp_ref_full, 'g--', linewidth=2,\n         label=f'Textbook polynomial (4 params, MSE={mse_ref:.2e})')\nplt.plot(T_fine, Cp_poly, 'b-', linewidth=2,\n         label=f'numpy.polyfit (4 params, MSE={mse_poly:.2e})')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Two cubic fits: textbook coefficients vs. fit to these 35 points')\nplt.legend(fontsize=9)\nplt.show()\n\nprint(f\"Textbook polynomial MSE (within 273-1800 K, {in_range.sum()} points): {mse_ref:.6e}\")\nprint(f\"numpy.polyfit MSE (all {len(T_raw)} points):                     {mse_poly:.6e}\")\nprint()\nprint(\"Note: the textbook polynomial is plotted across the full range, but its\")\nprint(\"authors only claim validity from 273 to 1800 K (shaded band).\")\nprint(\"Beyond 1800 K, the curve is an extrapolation — useful as a teaching point\")\nprint(\"about trusting correlations only over their stated range.\")",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97y7mrcekji",
+   "metadata": {},
+   "source": [
+    "## 3. Neural network from scratch (numpy)\n",
+    "\n",
+    "Now let's build a one-hidden-layer neural network. The architecture:\n",
+    "\n",
+    "```\n",
+    "Input (1: T) -> Hidden (10 neurons, tanh) -> Output (1: Cp)\n",
+    "```\n",
+    "\n",
+    "We need to:\n",
+    "1. **Normalize** the data to [0, 1] so the network trains efficiently\n",
+    "2. **Forward pass**: compute predictions from input through each layer\n",
+    "3. **Loss**: mean squared error between predictions and data\n",
+    "4. **Backpropagation**: compute gradients of the loss w.r.t. each weight using the chain rule\n",
+    "5. **Gradient descent**: update weights in the direction that reduces the loss\n",
+    "\n",
+    "This is exactly what nanoGPT's `train.py` does — just at a much larger scale."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "365o7bqbwkr",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Normalize inputs and outputs to [0, 1]\n",
+    "T_min, T_max = T_raw.min(), T_raw.max()\n",
+    "Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()\n",
+    "\n",
+    "T = (T_raw - T_min) / (T_max - T_min)\n",
+    "Cp = (Cp_raw - Cp_min) / (Cp_max - Cp_min)\n",
+    "\n",
+    "X = T.reshape(-1, 1)    # (N, 1) input matrix\n",
+    "Y = Cp.reshape(-1, 1)   # (N, 1) target matrix\n",
+    "N = X.shape[0]\n",
+    "\n",
+    "# Network setup\n",
+    "H = 10  # hidden neurons\n",
+    "\n",
+    "np.random.seed(42)\n",
+    "W1 = np.random.randn(1, H) * 0.5   # input -> hidden weights\n",
+    "b1 = np.zeros((1, H))               # hidden biases\n",
+    "W2 = np.random.randn(H, 1) * 0.5   # hidden -> output weights\n",
+    "b2 = np.zeros((1, 1))               # output bias\n",
+    "\n",
+    "print(f\"Parameters: W1({W1.shape}) + b1({b1.shape}) + W2({W2.shape}) + b2({b2.shape})\")\n",
+    "print(f\"Total: {W1.size + b1.size + W2.size + b2.size} parameters for {N} data points\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5w1ezs9t2w6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Training loop\n",
+    "learning_rate = 0.01\n",
+    "epochs = 5000\n",
+    "log_interval = 500\n",
+    "losses_np = []\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    # Forward pass\n",
+    "    Z1 = X @ W1 + b1           # hidden pre-activation  (N, H)\n",
+    "    A1 = np.tanh(Z1)           # hidden activation       (N, H)\n",
+    "    Y_pred = A1 @ W2 + b2      # output                  (N, 1)\n",
+    "\n",
+    "    # Loss (mean squared error)\n",
+    "    error = Y_pred - Y\n",
+    "    loss = np.mean(error ** 2)\n",
+    "    losses_np.append(loss)\n",
+    "\n",
+    "    # Backpropagation (chain rule, working backward)\n",
+    "    dL_dYpred = 2 * error / N\n",
+    "    dL_dW2 = A1.T @ dL_dYpred\n",
+    "    dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True)\n",
+    "    dL_dA1 = dL_dYpred @ W2.T\n",
+    "    dL_dZ1 = dL_dA1 * (1 - A1 ** 2)   # tanh derivative\n",
+    "    dL_dW1 = X.T @ dL_dZ1\n",
+    "    dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True)\n",
+    "\n",
+    "    # Gradient descent update\n",
+    "    W2 -= learning_rate * dL_dW2\n",
+    "    b2 -= learning_rate * dL_db2\n",
+    "    W1 -= learning_rate * dL_dW1\n",
+    "    b1 -= learning_rate * dL_db1\n",
+    "\n",
+    "    if epoch % log_interval == 0 or epoch == epochs - 1:\n",
+    "        print(f\"Epoch {epoch:5d}  Loss: {loss:.6f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "onel9r0kjk",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Predict on a fine grid and convert back to physical units\n",
+    "T_fine_norm = np.linspace(0, 1, 200).reshape(-1, 1)\n",
+    "A1_fine = np.tanh(T_fine_norm @ W1 + b1)\n",
+    "Cp_nn_norm = A1_fine @ W2 + b2\n",
+    "Cp_nn = Cp_nn_norm * (Cp_max - Cp_min) + Cp_min\n",
+    "T_fine_K = T_fine_norm * (T_max - T_min) + T_min\n",
+    "\n",
+    "# MSE in original units for comparison with polynomial\n",
+    "Cp_nn_at_data = np.tanh(X @ W1 + b1) @ W2 + b2\n",
+    "Cp_nn_at_data = Cp_nn_at_data * (Cp_max - Cp_min) + Cp_min\n",
+    "mse_nn = np.mean((Cp_nn_at_data.flatten() - Cp_raw) ** 2)\n",
+    "\n",
+    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
+    "\n",
+    "ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params, MSE={mse_poly:.2e})')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r-', linewidth=2, label=f'NN numpy (31 params, MSE={mse_nn:.2e})')\n",
+    "ax1.set_xlabel('Temperature (K)')\n",
+    "ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
+    "ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
+    "ax1.legend()\n",
+    "\n",
+    "ax2.semilogy(losses_np)\n",
+    "ax2.set_xlabel('Epoch')\n",
+    "ax2.set_ylabel('MSE (normalized)')\n",
+    "ax2.set_title('Training loss — numpy NN')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea9z35qm9u8",
+   "metadata": {},
+   "source": [
+    "## 4. Neural network in PyTorch\n",
+    "\n",
+    "The same network, but PyTorch handles backpropagation automatically. Compare the training loop above to the one below — `loss.backward()` replaces all of our manual gradient calculations.\n",
+    "\n",
+    "This is the same API used in nanoGPT's `model.py` — `nn.Linear`, activation functions, `optimizer.step()`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3qxnrtyxqgz",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "# Prepare data as PyTorch tensors\n",
+    "X_t = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)\n",
+    "Y_t = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)\n",
+    "\n",
+    "# Define the network\n",
+    "model = nn.Sequential(\n",
+    "    nn.Linear(1, H),    # input -> hidden (W1, b1)\n",
+    "    nn.Tanh(),           # activation\n",
+    "    nn.Linear(H, 1),    # hidden -> output (W2, b2)\n",
+    ")\n",
+    "\n",
+    "print(model)\n",
+    "print(f\"Total parameters: {sum(p.numel() for p in model.parameters())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ydl3ycnypps",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Train\n",
+    "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n",
+    "loss_fn = nn.MSELoss()\n",
+    "losses_torch = []\n",
+    "\n",
+    "for epoch in range(epochs):\n",
+    "    Y_pred_t = model(X_t)\n",
+    "    loss = loss_fn(Y_pred_t, Y_t)\n",
+    "    losses_torch.append(loss.item())\n",
+    "\n",
+    "    optimizer.zero_grad()   # reset gradients\n",
+    "    loss.backward()         # automatic differentiation\n",
+    "    optimizer.step()        # update weights\n",
+    "\n",
+    "    if epoch % log_interval == 0 or epoch == epochs - 1:\n",
+    "        print(f\"Epoch {epoch:5d}  Loss: {loss.item():.6f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bg0kvnk4ho",
+   "metadata": {},
+   "source": [
+    "## 5. Compare all three approaches"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "h2dfstoh8gd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# PyTorch predictions\n",
+    "T_fine_t = torch.linspace(0, 1, 200).reshape(-1, 1)\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_norm = model(T_fine_t)\n",
+    "Cp_torch = Cp_torch_norm.numpy() * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "# MSE for PyTorch model\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_at_data = model(X_t).numpy() * (Cp_max - Cp_min) + Cp_min\n",
+    "mse_torch = np.mean((Cp_torch_at_data.flatten() - Cp_raw) ** 2)\n",
+    "\n",
+    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
+    "\n",
+    "# Left: all three fits\n",
+    "ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params)')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r--', linewidth=2, label=f'NN numpy (31 params)')\n",
+    "ax1.plot(T_fine_K.flatten(), Cp_torch.flatten(), 'g-', linewidth=2, alpha=0.8, label=f'NN PyTorch (31 params)')\n",
+    "ax1.set_xlabel('Temperature (K)')\n",
+    "ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
+    "ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
+    "ax1.legend()\n",
+    "\n",
+    "# Right: training loss comparison\n",
+    "ax2.semilogy(losses_np, label='numpy (gradient descent)')\n",
+    "ax2.semilogy(losses_torch, label='PyTorch (Adam)')\n",
+    "ax2.set_xlabel('Epoch')\n",
+    "ax2.set_ylabel('MSE (normalized)')\n",
+    "ax2.set_title('Training loss comparison')\n",
+    "ax2.legend()\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"MSE — Polynomial: {mse_poly:.2e}  |  NN numpy: {mse_nn:.2e}  |  NN PyTorch: {mse_torch:.2e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "xyw3sr20brn",
+   "metadata": {},
+   "source": [
+    "## 6. Extrapolation\n",
+    "\n",
+    "How do the models behave *outside* the training range? This is a key test — and where the differences become stark."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fi3iq2sjh6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extrapolate beyond the training range\n",
+    "T_extrap = np.linspace(100, 2500, 300)\n",
+    "T_extrap_norm = ((T_extrap - T_min) / (T_max - T_min)).reshape(-1, 1)\n",
+    "\n",
+    "# Polynomial extrapolation\n",
+    "Cp_poly_extrap = poly(T_extrap)\n",
+    "\n",
+    "# Numpy NN extrapolation\n",
+    "A1_extrap = np.tanh(T_extrap_norm @ W1 + b1)\n",
+    "Cp_nn_extrap = (A1_extrap @ W2 + b2) * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "# PyTorch NN extrapolation\n",
+    "with torch.no_grad():\n",
+    "    Cp_torch_extrap = model(torch.tensor(T_extrap_norm, dtype=torch.float32)).numpy()\n",
+    "Cp_torch_extrap = Cp_torch_extrap * (Cp_max - Cp_min) + Cp_min\n",
+    "\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
+    "plt.plot(T_extrap, Cp_poly_extrap, 'b-', linewidth=2, label='Polynomial')\n",
+    "plt.plot(T_extrap, Cp_nn_extrap.flatten(), 'r--', linewidth=2, label='NN numpy')\n",
+    "plt.plot(T_extrap, Cp_torch_extrap.flatten(), 'g-', linewidth=2, alpha=0.8, label='NN PyTorch')\n",
+    "plt.axvline(T_raw.min(), color='gray', linestyle=':', alpha=0.5, label='Training range')\n",
+    "plt.axvline(T_raw.max(), color='gray', linestyle=':', alpha=0.5)\n",
+    "plt.xlabel('Temperature (K)')\n",
+    "plt.ylabel('$C_p$ (kJ/kg/K)')\n",
+    "plt.title('Extrapolation beyond training data')\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "yb2s18keiw",
+   "metadata": {},
+   "source": [
+    "## 7. Exercises\n",
+    "\n",
+    "Try these in new cells below:\n",
+    "\n",
+    "1. **Change the number of hidden neurons** (`H`). Try 2, 5, 20, 50. How does the fit change? At what point does adding neurons stop helping?\n",
+    "\n",
+    "2. **Activation functions**: In the PyTorch model, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change?\n",
+    "\n",
+    "3. **Optimizer comparison**: Replace `Adam` with `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare?\n",
+    "\n",
+    "4. **Remove normalization**: Use `T_raw` and `Cp_raw` directly (no scaling to [0,1]). What happens? Can you fix it by adjusting the learning rate?\n",
+    "\n",
+    "5. **Overfitting**: Set `H = 100` and train for 20,000 epochs. Does it fit the training data well? Look at the extrapolation — is it reasonable?\n",
+    "\n",
+    "6. **Higher-order polynomial**: Try `np.polyfit(T_raw, Cp_raw, 10)`. How does it compare to the cubic? How does it extrapolate?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}