llm-workshop/06-neural-networks/nn_workshop.ipynb
Eric Furst f7d2b48f5a README updates, textbook polynomial cell, self-contained notebook
Same set of changes as che-computing-dev/LLMs:
- 03/04/05 READMEs: uv add workflow, required model caching
- 05-tool-use: add Setup section, requirements.txt
- 06-neural-networks: textbook cubic polynomial comparison cell
- 06-neural-networks: add nn_workshop_colab.ipynb (self-contained, inline data)
- vocab.md: catch up with terms from 02-05

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 10:18:10 -04:00

442 lines
No EOL
18 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "xbsmj1hcj1g",
"metadata": {},
"source": [
"# Building a Neural Network: $C_p(T)$ for Nitrogen\n",
"\n",
"**CHEG 667-013 — LLMs for Engineers**\n",
"\n",
"In this notebook we fit the heat capacity of N₂ gas using three approaches:\n",
"1. A polynomial fit (the classical approach)\n",
"2. A neural network built from scratch in numpy\n",
"3. The same network in PyTorch\n",
"\n",
"This makes the ML concepts behind LLMs — weights, loss, gradient descent, overfitting — concrete and tangible."
]
},
{
"cell_type": "markdown",
"id": "szrl41l3xbq",
"metadata": {},
"source": [
"## 1. Load and plot the data\n",
"\n",
"The data is from the [NIST Chemistry WebBook](https://webbook.nist.gov/): isobaric heat capacity of N₂ at 1 bar, 3002000 K."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "t4lqkcoeyil",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"data = np.loadtxt(\"data/n2_cp.csv\", delimiter=\",\", skiprows=1)\n",
"T_raw = data[:, 0] # Temperature (K)\n",
"Cp_raw = data[:, 1] # Cp (kJ/kg/K)\n",
"\n",
"plt.figure(figsize=(8, 5))\n",
"plt.plot(T_raw, Cp_raw, 'ko', markersize=6)\n",
"plt.xlabel('Temperature (K)')\n",
"plt.ylabel('$C_p$ (kJ/kg/K)')\n",
"plt.title('$C_p(T)$ for N$_2$ at 1 bar — NIST WebBook')\n",
"plt.show()\n",
"\n",
"print(f\"{len(T_raw)} data points, T range: {T_raw.min():.0f} {T_raw.max():.0f} K\")"
]
},
{
"cell_type": "markdown",
"id": "1jyrgsvp7op",
"metadata": {},
"source": [
"## 2. Polynomial fit (baseline)\n",
"\n",
"Textbooks fit $C_p(T)$ with a polynomial: $C_p = a + bT + cT^2 + dT^3$. This is a **4-parameter** model. Let's fit it with `numpy.polyfit` and see how well it does."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4smvu4z2oro",
"metadata": {},
"outputs": [],
"source": [
"# Fit a cubic polynomial\n",
"coeffs = np.polyfit(T_raw, Cp_raw, 3)\n",
"poly = np.poly1d(coeffs)\n",
"\n",
"T_fine = np.linspace(T_raw.min(), T_raw.max(), 200)\n",
"Cp_poly = poly(T_fine)\n",
"\n",
"# Compute residuals\n",
"Cp_poly_at_data = poly(T_raw)\n",
"mse_poly = np.mean((Cp_poly_at_data - Cp_raw) ** 2)\n",
"\n",
"plt.figure(figsize=(8, 5))\n",
"plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
"plt.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Cubic polynomial (4 params)')\n",
"plt.xlabel('Temperature (K)')\n",
"plt.ylabel('$C_p$ (kJ/kg/K)')\n",
"plt.title('Polynomial fit')\n",
"plt.legend()\n",
"plt.show()\n",
"\n",
"print(f\"Polynomial coefficients: {coeffs}\")\n",
"print(f\"MSE: {mse_poly:.8f}\")\n",
"print(f\"Parameters: 4\")"
]
},
{
"cell_type": "markdown",
"id": "07673988",
"source": "### Comparing to a textbook polynomial\n\nMost thermodynamics textbooks tabulate $C_p$ correlations of the same cubic form, but with **fixed coefficients** drawn from a broader fit and reported in **molar** units, $C_p$ in J/(mol·K). For nitrogen, a typical reference set is:\n\n| Coefficient | Value |\n|-------------|-------|\n| $a$ | 28.883 |\n| $b$ | $-0.157 \\times 10^{-2}$ |\n| $c$ | $0.808 \\times 10^{-5}$ |\n| $d$ | $-2.871 \\times 10^{-9}$ |\n\n**Valid range: 2731800 K.** Outside this range the textbook polynomial is being extrapolated and is not guaranteed by the original fit.\n\nTo compare with our NIST mass-basis data, we divide by the molar mass of N$_2$ (28.014 g/mol) — convenient because J/(mol·K) ÷ g/mol = kJ/(kg·K).",
"metadata": {}
},
{
"cell_type": "code",
"id": "f21c64a3",
"source": "# Reference polynomial from textbook (Cp in J/(mol·K))\n# Form: Cp = a + bT + cT^2 + dT^3\n# Valid range: 273-1800 K\n\na_ref = 28.883\nb_ref = -0.157e-2\nc_ref = 0.808e-5\nd_ref = -2.871e-9\nM_N2 = 28.014 # g/mol\n\nT_REF_MIN, T_REF_MAX = 273.0, 1800.0\n\ndef cp_ref_molar(T):\n \"\"\"Textbook cubic in J/(mol·K).\"\"\"\n return a_ref + b_ref * T + c_ref * T**2 + d_ref * T**3\n\ndef cp_ref(T):\n \"\"\"Same fit converted to kJ/(kg·K) to match the NIST data.\"\"\"\n return cp_ref_molar(T) / M_N2\n\n# Evaluate the textbook polynomial across the full plot range\nCp_ref_full = cp_ref(T_fine)\n\n# MSE on the data points that fall within the textbook's stated valid range\nin_range = (T_raw >= T_REF_MIN) & (T_raw <= T_REF_MAX)\nCp_ref_at_data = cp_ref(T_raw[in_range])\nmse_ref = np.mean((Cp_ref_at_data - Cp_raw[in_range]) ** 2)\n\n# Plot both fits across the full data range; shade the textbook's valid range\nplt.figure(figsize=(8, 5))\nplt.axvspan(T_REF_MIN, T_REF_MAX, alpha=0.07, color='green',\n label='Textbook valid range (2731800 K)')\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nplt.plot(T_fine, Cp_ref_full, 'g--', linewidth=2,\n label=f'Textbook polynomial (4 params, MSE={mse_ref:.2e})')\nplt.plot(T_fine, Cp_poly, 'b-', linewidth=2,\n label=f'numpy.polyfit (4 params, MSE={mse_poly:.2e})')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Two cubic fits: textbook coefficients vs. fit to these 35 points')\nplt.legend(fontsize=9)\nplt.show()\n\nprint(f\"Textbook polynomial MSE (within 273-1800 K, {in_range.sum()} points): {mse_ref:.6e}\")\nprint(f\"numpy.polyfit MSE (all {len(T_raw)} points): {mse_poly:.6e}\")\nprint()\nprint(\"Note: the textbook polynomial is plotted across the full range, but its\")\nprint(\"authors only claim validity from 273 to 1800 K (shaded band).\")\nprint(\"Beyond 1800 K, the curve is an extrapolation — useful as a teaching point\")\nprint(\"about trusting correlations only over their stated range.\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "97y7mrcekji",
"metadata": {},
"source": [
"## 3. Neural network from scratch (numpy)\n",
"\n",
"Now let's build a one-hidden-layer neural network. The architecture:\n",
"\n",
"```\n",
"Input (1: T) -> Hidden (10 neurons, tanh) -> Output (1: Cp)\n",
"```\n",
"\n",
"We need to:\n",
"1. **Normalize** the data to [0, 1] so the network trains efficiently\n",
"2. **Forward pass**: compute predictions from input through each layer\n",
"3. **Loss**: mean squared error between predictions and data\n",
"4. **Backpropagation**: compute gradients of the loss w.r.t. each weight using the chain rule\n",
"5. **Gradient descent**: update weights in the direction that reduces the loss\n",
"\n",
"This is exactly what nanoGPT's `train.py` does — just at a much larger scale."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "365o7bqbwkr",
"metadata": {},
"outputs": [],
"source": [
"# Normalize inputs and outputs to [0, 1]\n",
"T_min, T_max = T_raw.min(), T_raw.max()\n",
"Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()\n",
"\n",
"T = (T_raw - T_min) / (T_max - T_min)\n",
"Cp = (Cp_raw - Cp_min) / (Cp_max - Cp_min)\n",
"\n",
"X = T.reshape(-1, 1) # (N, 1) input matrix\n",
"Y = Cp.reshape(-1, 1) # (N, 1) target matrix\n",
"N = X.shape[0]\n",
"\n",
"# Network setup\n",
"H = 10 # hidden neurons\n",
"\n",
"np.random.seed(42)\n",
"W1 = np.random.randn(1, H) * 0.5 # input -> hidden weights\n",
"b1 = np.zeros((1, H)) # hidden biases\n",
"W2 = np.random.randn(H, 1) * 0.5 # hidden -> output weights\n",
"b2 = np.zeros((1, 1)) # output bias\n",
"\n",
"print(f\"Parameters: W1({W1.shape}) + b1({b1.shape}) + W2({W2.shape}) + b2({b2.shape})\")\n",
"print(f\"Total: {W1.size + b1.size + W2.size + b2.size} parameters for {N} data points\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5w1ezs9t2w6",
"metadata": {},
"outputs": [],
"source": [
"# Training loop\n",
"learning_rate = 0.01\n",
"epochs = 5000\n",
"log_interval = 500\n",
"losses_np = []\n",
"\n",
"for epoch in range(epochs):\n",
" # Forward pass\n",
" Z1 = X @ W1 + b1 # hidden pre-activation (N, H)\n",
" A1 = np.tanh(Z1) # hidden activation (N, H)\n",
" Y_pred = A1 @ W2 + b2 # output (N, 1)\n",
"\n",
" # Loss (mean squared error)\n",
" error = Y_pred - Y\n",
" loss = np.mean(error ** 2)\n",
" losses_np.append(loss)\n",
"\n",
" # Backpropagation (chain rule, working backward)\n",
" dL_dYpred = 2 * error / N\n",
" dL_dW2 = A1.T @ dL_dYpred\n",
" dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True)\n",
" dL_dA1 = dL_dYpred @ W2.T\n",
" dL_dZ1 = dL_dA1 * (1 - A1 ** 2) # tanh derivative\n",
" dL_dW1 = X.T @ dL_dZ1\n",
" dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True)\n",
"\n",
" # Gradient descent update\n",
" W2 -= learning_rate * dL_dW2\n",
" b2 -= learning_rate * dL_db2\n",
" W1 -= learning_rate * dL_dW1\n",
" b1 -= learning_rate * dL_db1\n",
"\n",
" if epoch % log_interval == 0 or epoch == epochs - 1:\n",
" print(f\"Epoch {epoch:5d} Loss: {loss:.6f}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "onel9r0kjk",
"metadata": {},
"outputs": [],
"source": [
"# Predict on a fine grid and convert back to physical units\n",
"T_fine_norm = np.linspace(0, 1, 200).reshape(-1, 1)\n",
"A1_fine = np.tanh(T_fine_norm @ W1 + b1)\n",
"Cp_nn_norm = A1_fine @ W2 + b2\n",
"Cp_nn = Cp_nn_norm * (Cp_max - Cp_min) + Cp_min\n",
"T_fine_K = T_fine_norm * (T_max - T_min) + T_min\n",
"\n",
"# MSE in original units for comparison with polynomial\n",
"Cp_nn_at_data = np.tanh(X @ W1 + b1) @ W2 + b2\n",
"Cp_nn_at_data = Cp_nn_at_data * (Cp_max - Cp_min) + Cp_min\n",
"mse_nn = np.mean((Cp_nn_at_data.flatten() - Cp_raw) ** 2)\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
"\n",
"ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
"ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params, MSE={mse_poly:.2e})')\n",
"ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r-', linewidth=2, label=f'NN numpy (31 params, MSE={mse_nn:.2e})')\n",
"ax1.set_xlabel('Temperature (K)')\n",
"ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
"ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
"ax1.legend()\n",
"\n",
"ax2.semilogy(losses_np)\n",
"ax2.set_xlabel('Epoch')\n",
"ax2.set_ylabel('MSE (normalized)')\n",
"ax2.set_title('Training loss — numpy NN')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "ea9z35qm9u8",
"metadata": {},
"source": [
"## 4. Neural network in PyTorch\n",
"\n",
"The same network, but PyTorch handles backpropagation automatically. Compare the training loop above to the one below — `loss.backward()` replaces all of our manual gradient calculations.\n",
"\n",
"This is the same API used in nanoGPT's `model.py` — `nn.Linear`, activation functions, `optimizer.step()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3qxnrtyxqgz",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"\n",
"# Prepare data as PyTorch tensors\n",
"X_t = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)\n",
"Y_t = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)\n",
"\n",
"# Define the network\n",
"model = nn.Sequential(\n",
" nn.Linear(1, H), # input -> hidden (W1, b1)\n",
" nn.Tanh(), # activation\n",
" nn.Linear(H, 1), # hidden -> output (W2, b2)\n",
")\n",
"\n",
"print(model)\n",
"print(f\"Total parameters: {sum(p.numel() for p in model.parameters())}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ydl3ycnypps",
"metadata": {},
"outputs": [],
"source": [
"# Train\n",
"optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n",
"loss_fn = nn.MSELoss()\n",
"losses_torch = []\n",
"\n",
"for epoch in range(epochs):\n",
" Y_pred_t = model(X_t)\n",
" loss = loss_fn(Y_pred_t, Y_t)\n",
" losses_torch.append(loss.item())\n",
"\n",
" optimizer.zero_grad() # reset gradients\n",
" loss.backward() # automatic differentiation\n",
" optimizer.step() # update weights\n",
"\n",
" if epoch % log_interval == 0 or epoch == epochs - 1:\n",
" print(f\"Epoch {epoch:5d} Loss: {loss.item():.6f}\")"
]
},
{
"cell_type": "markdown",
"id": "bg0kvnk4ho",
"metadata": {},
"source": [
"## 5. Compare all three approaches"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h2dfstoh8gd",
"metadata": {},
"outputs": [],
"source": [
"# PyTorch predictions\n",
"T_fine_t = torch.linspace(0, 1, 200).reshape(-1, 1)\n",
"with torch.no_grad():\n",
" Cp_torch_norm = model(T_fine_t)\n",
"Cp_torch = Cp_torch_norm.numpy() * (Cp_max - Cp_min) + Cp_min\n",
"\n",
"# MSE for PyTorch model\n",
"with torch.no_grad():\n",
" Cp_torch_at_data = model(X_t).numpy() * (Cp_max - Cp_min) + Cp_min\n",
"mse_torch = np.mean((Cp_torch_at_data.flatten() - Cp_raw) ** 2)\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n",
"\n",
"# Left: all three fits\n",
"ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
"ax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params)')\n",
"ax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r--', linewidth=2, label=f'NN numpy (31 params)')\n",
"ax1.plot(T_fine_K.flatten(), Cp_torch.flatten(), 'g-', linewidth=2, alpha=0.8, label=f'NN PyTorch (31 params)')\n",
"ax1.set_xlabel('Temperature (K)')\n",
"ax1.set_ylabel('$C_p$ (kJ/kg/K)')\n",
"ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\n",
"ax1.legend()\n",
"\n",
"# Right: training loss comparison\n",
"ax2.semilogy(losses_np, label='numpy (gradient descent)')\n",
"ax2.semilogy(losses_torch, label='PyTorch (Adam)')\n",
"ax2.set_xlabel('Epoch')\n",
"ax2.set_ylabel('MSE (normalized)')\n",
"ax2.set_title('Training loss comparison')\n",
"ax2.legend()\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n",
"\n",
"print(f\"MSE — Polynomial: {mse_poly:.2e} | NN numpy: {mse_nn:.2e} | NN PyTorch: {mse_torch:.2e}\")"
]
},
{
"cell_type": "markdown",
"id": "xyw3sr20brn",
"metadata": {},
"source": [
"## 6. Extrapolation\n",
"\n",
"How do the models behave *outside* the training range? This is a key test — and where the differences become stark."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fi3iq2sjh6",
"metadata": {},
"outputs": [],
"source": [
"# Extrapolate beyond the training range\n",
"T_extrap = np.linspace(100, 2500, 300)\n",
"T_extrap_norm = ((T_extrap - T_min) / (T_max - T_min)).reshape(-1, 1)\n",
"\n",
"# Polynomial extrapolation\n",
"Cp_poly_extrap = poly(T_extrap)\n",
"\n",
"# Numpy NN extrapolation\n",
"A1_extrap = np.tanh(T_extrap_norm @ W1 + b1)\n",
"Cp_nn_extrap = (A1_extrap @ W2 + b2) * (Cp_max - Cp_min) + Cp_min\n",
"\n",
"# PyTorch NN extrapolation\n",
"with torch.no_grad():\n",
" Cp_torch_extrap = model(torch.tensor(T_extrap_norm, dtype=torch.float32)).numpy()\n",
"Cp_torch_extrap = Cp_torch_extrap * (Cp_max - Cp_min) + Cp_min\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\n",
"plt.plot(T_extrap, Cp_poly_extrap, 'b-', linewidth=2, label='Polynomial')\n",
"plt.plot(T_extrap, Cp_nn_extrap.flatten(), 'r--', linewidth=2, label='NN numpy')\n",
"plt.plot(T_extrap, Cp_torch_extrap.flatten(), 'g-', linewidth=2, alpha=0.8, label='NN PyTorch')\n",
"plt.axvline(T_raw.min(), color='gray', linestyle=':', alpha=0.5, label='Training range')\n",
"plt.axvline(T_raw.max(), color='gray', linestyle=':', alpha=0.5)\n",
"plt.xlabel('Temperature (K)')\n",
"plt.ylabel('$C_p$ (kJ/kg/K)')\n",
"plt.title('Extrapolation beyond training data')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "yb2s18keiw",
"metadata": {},
"source": [
"## 7. Exercises\n",
"\n",
"Try these in new cells below:\n",
"\n",
"1. **Change the number of hidden neurons** (`H`). Try 2, 5, 20, 50. How does the fit change? At what point does adding neurons stop helping?\n",
"\n",
"2. **Activation functions**: In the PyTorch model, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change?\n",
"\n",
"3. **Optimizer comparison**: Replace `Adam` with `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare?\n",
"\n",
"4. **Remove normalization**: Use `T_raw` and `Cp_raw` directly (no scaling to [0,1]). What happens? Can you fix it by adjusting the learning rate?\n",
"\n",
"5. **Overfitting**: Set `H = 100` and train for 20,000 epochs. Does it fit the training data well? Look at the extrapolation — is it reasonable?\n",
"\n",
"6. **Higher-order polynomial**: Try `np.polyfit(T_raw, Cp_raw, 10)`. How does it compare to the cubic? How does it extrapolate?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}