Initial commit: LLM workshop materials

Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Eric 2026-03-28 07:11:01 -04:00
commit 1604671d36
56 changed files with 5577 additions and 0 deletions

View file

@ -0,0 +1,258 @@
# Large Language Models Part V: Building a Neural Network
**CHEG 667-013 — Chemical Engineering with Computers**
Department of Chemical and Biomolecular Engineering, University of Delaware
---
## Key idea
Build a neural network from scratch to understand the core mechanics behind LLMs.
## Key goals
- See concretely what "weights and biases" are and how they're organized
- Understand the forward pass, loss function, and gradient descent
- Implement backpropagation by hand in numpy
- See how PyTorch automates the same process
- Connect these concepts to what you've already seen in nanoGPT
---
Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have millions of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.
In this section, we step back from language and build a neural network ourselves — small enough to understand every weight, but powerful enough to learn a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.
Our task: fit the heat capacity $C_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:
$$C_p(T) = a + bT + cT^2 + dT^3$$
Can a neural network learn this relationship directly from data?
## 1. Setup
Use the virtual environment from Part I — `numpy` and `torch` are already installed. You may need to add `matplotlib`:
```bash
pip install matplotlib
```
## 2. The data
The file `data/n2_cp.csv` contains 35 data points: the isobaric heat capacity of N₂ gas at 1 bar from 300 K to 2000 K, from the NIST WebBook.
```bash
head data/n2_cp.csv
```
```
T_K,Cp_kJ_per_kgK
300.00,1.0413
350.00,1.0423
400.00,1.0450
...
```
The curve is smooth and nonlinear — $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case: simple enough for a small network, but not a straight line.
## 3. Architecture of a one-hidden-layer network
Our network has three layers:
```
Input (1 neuron: T) → Hidden (10 neurons) → Output (1 neuron: Cp)
```
Here's what happens at each step:
### Forward pass
**Step 1: Hidden layer.** Each of the 10 hidden neurons computes a weighted sum of the input plus a bias, then applies an *activation function*:
$$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$
where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity** — without it, stacking layers would just produce another linear function, no matter how many layers we use.
**Step 2: Output layer.** The output is a weighted sum of the hidden activations:
$$\hat{y} = \sum_j W_j \cdot a_j + b_{\text{out}}$$
This is a linear combination — no activation on the output, since we want to predict a continuous value.
### Counting parameters
With 10 hidden neurons:
- `W1`: 10 weights (input → hidden)
- `b1`: 10 biases (hidden)
- `W2`: 10 weights (hidden → output)
- `b2`: 1 bias (output)
- **Total: 31 parameters**
That's 31 parameters for 35 data points — almost a 1:1 ratio, which should make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.
Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea — layers of weights and activations — just scaled enormously.
## 4. Training
Training means finding the values of all 31 parameters that make the network's predictions match the data. This requires three things:
### Loss function
We need a number that says "how wrong is the network?" The **mean squared error** (MSE) is a natural choice:
$$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$
This is the same kind of loss we watched decrease during nanoGPT training in Part I (though nanoGPT uses cross-entropy loss, which is appropriate for classification over a vocabulary).
### Backpropagation
To improve the weights, we need to know how each weight affects the loss. **Backpropagation** computes these gradients by applying the chain rule, working backward from the loss through each layer. For example, the gradient of the loss with respect to an output weight $W_j$ is:
$$\frac{\partial L}{\partial W_j} = \frac{1}{N} \sum_i 2(\hat{y}_i - y_i) \cdot a_{ij}$$
The numpy implementation in `nn_numpy.py` computes every gradient explicitly. This is the part that PyTorch automates.
### Gradient descent
Once we have the gradients, we update each weight:
$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$
where $\eta$ is the **learning rate** — a small number (0.01 in our code) that controls how big each step is. Too large and training oscillates; too small and it's painfully slow.
One full pass through these three steps (forward → loss → backward → update) is one **epoch**. We train for 5000 epochs.
In nanoGPT, the training loop in `train.py` does exactly the same thing, but with the AdamW optimizer (a fancier version of gradient descent) and batches of data instead of the full dataset.
## 5. Running the numpy version
```bash
python nn_numpy.py
```
```
Epoch 0 Loss: 0.283941
Epoch 500 Loss: 0.001253
Epoch 1000 Loss: 0.000412
Epoch 1500 Loss: 0.000178
Epoch 2000 Loss: 0.000082
Epoch 2500 Loss: 0.000040
Epoch 3000 Loss: 0.000021
Epoch 3500 Loss: 0.000012
Epoch 4000 Loss: 0.000008
Epoch 4500 Loss: 0.000005
Epoch 4999 Loss: 0.000004
Final loss: 0.000004
Network: 1 input -> 10 hidden (tanh) -> 1 output
Total parameters: 31
```
The script produces a plot (`nn_fit.png`) showing the fit and the training loss curve. You should see the network's prediction closely tracking the NIST data points, and the loss dropping rapidly in the first 1000 epochs before leveling off.
> **Exercise 1:** Read through `nn_numpy.py` carefully. Identify where each of the following happens: (a) forward pass, (b) loss calculation, (c) backpropagation, (d) gradient descent update. Annotate your copy with comments.
> **Exercise 2:** Change the number of hidden neurons `H`. Try 2, 5, 10, 20, 50. How does the fit change? How many parameters does each network have? At what point does adding more neurons stop helping?
## 6. The PyTorch version
Now look at `nn_torch.py`. It does the same thing, but in about half the code:
```bash
python nn_torch.py
```
Compare the two scripts side by side. The key differences:
| | numpy version | PyTorch version |
|---|---|---|
| Define layers | Manual weight matrices | `nn.Linear(1, H)` |
| Forward pass | `X @ W1 + b1`, `np.tanh(...)` | `model(X)` |
| Backprop | Hand-coded chain rule | `loss.backward()` |
| Weight update | `W -= lr * dW` | `optimizer.step()` |
| Lines of code | ~80 | ~40 |
PyTorch's `loss.backward()` computes all the gradients we wrote out by hand — automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.
The `nn.Sequential` definition:
```python
model = nn.Sequential(
nn.Linear(1, H), # input -> hidden (W1, b1)
nn.Tanh(), # activation
nn.Linear(H, 1), # hidden -> output (W2, b2)
)
```
looks simple here, but it's the same API used in nanoGPT's `model.py` — just with more layers, attention mechanisms, and a much larger vocabulary.
> **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?
> **Exercise 4:** Replace the Adam optimizer with plain SGD: `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare? Try increasing the learning rate. What happens?
## 7. Normalization
Both scripts normalize the input ($T$) and output ($C_p$) to the range [0, 1] before training. This is important:
- Raw $T$ values range from 300 to 2000, while $C_p$ ranges from 1.04 to 1.28
- With unnormalized data, the gradients for the input weights would be hundreds of times larger than for the output weights
- The network would struggle to learn — or need a much smaller learning rate
Try it yourself:
> **Exercise 5:** Comment out the normalization in `nn_numpy.py` (use `T_raw` and `Cp_raw` directly). What happens to the training loss? Can you fix it by changing the learning rate?
## 8. Overfitting
With 31 parameters and 35 data points, our network is close to the edge. What happens with more parameters than data?
> **Exercise 6:** Increase `H` to 100 (giving 301 parameters — nearly 10× the number of data points). Train for 20,000 epochs. Plot the fit. Does it match the training data well? Now generate predictions at $T$ = 275 K and $T$ = 2100 K (outside the training range). Are they reasonable?
This is **overfitting** — the network memorizes the training data but fails to generalize. It's the same concept we discussed in Part I when nanoGPT's validation loss started increasing while the training loss kept decreasing.
In practice, we combat overfitting with:
- More data
- Regularization (dropout — remember this parameter from nanoGPT?)
- Early stopping (stop training when validation loss starts increasing)
- Keeping the model appropriately sized for the data
## 9. Connecting back to LLMs
Everything you've built here scales up to large language models:
| This tutorial | nanoGPT / LLMs |
|---|---|
| 31 parameters | 800K 70B+ parameters |
| 1 hidden layer | 4 96+ layers |
| tanh activation | GELU activation |
| MSE loss | Cross-entropy loss |
| Plain gradient descent | AdamW optimizer |
| Numpy arrays | PyTorch tensors (on GPU) |
| Fitting $C_p(T)$ | Predicting next tokens |
The fundamental loop — forward pass, compute loss, backpropagate, update weights — is identical. The difference is scale: more layers, more data, more compute, and architectural innovations like self-attention.
## Additional resources and references
### NIST Chemistry WebBook
- https://webbook.nist.gov/ — thermophysical property data used in this tutorial
### PyTorch
- Tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html
- `nn.Module` documentation: https://pytorch.org/docs/stable/nn.html
### Reading
- The "backpropagation" chapter in Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/
- 3Blue1Brown, *Neural Networks* video series: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi — excellent visual intuition for how neural networks learn

View file

@ -0,0 +1,36 @@
T_K,Cp_kJ_per_kgK
300.00,1.0413
350.00,1.0423
400.00,1.0450
450.00,1.0497
500.00,1.0564
550.00,1.0650
600.00,1.0751
650.00,1.0863
700.00,1.0981
750.00,1.1102
800.00,1.1223
850.00,1.1342
900.00,1.1457
950.00,1.1568
1000.0,1.1674
1050.0,1.1774
1100.0,1.1868
1150.0,1.1957
1200.0,1.2040
1250.0,1.2118
1300.0,1.2191
1350.0,1.2260
1400.0,1.2323
1450.0,1.2383
1500.0,1.2439
1550.0,1.2491
1600.0,1.2540
1650.0,1.2586
1700.0,1.2630
1750.0,1.2670
1800.0,1.2708
1850.0,1.2744
1900.0,1.2778
1950.0,1.2810
2000.0,1.2841
1 T_K Cp_kJ_per_kgK
2 300.00 1.0413
3 350.00 1.0423
4 400.00 1.0450
5 450.00 1.0497
6 500.00 1.0564
7 550.00 1.0650
8 600.00 1.0751
9 650.00 1.0863
10 700.00 1.0981
11 750.00 1.1102
12 800.00 1.1223
13 850.00 1.1342
14 900.00 1.1457
15 950.00 1.1568
16 1000.0 1.1674
17 1050.0 1.1774
18 1100.0 1.1868
19 1150.0 1.1957
20 1200.0 1.2040
21 1250.0 1.2118
22 1300.0 1.2191
23 1350.0 1.2260
24 1400.0 1.2323
25 1450.0 1.2383
26 1500.0 1.2439
27 1550.0 1.2491
28 1600.0 1.2540
29 1650.0 1.2586
30 1700.0 1.2630
31 1750.0 1.2670
32 1800.0 1.2708
33 1850.0 1.2744
34 1900.0 1.2778
35 1950.0 1.2810
36 2000.0 1.2841

View file

@ -0,0 +1,156 @@
# nn_numpy.py
#
# A neural network with one hidden layer, built from scratch using numpy.
# Fits Cp(T) data for nitrogen gas at 1 bar (NIST WebBook).
#
# This demonstrates the core mechanics of a neural network:
# - Forward pass: input -> hidden layer -> activation -> output
# - Loss calculation (mean squared error)
# - Backpropagation: computing gradients of the loss w.r.t. each weight
# - Gradient descent: updating weights to minimize loss
#
# CHEG 667-013
# E. M. Furst
import numpy as np
import matplotlib.pyplot as plt
# ── Load and prepare data ──────────────────────────────────────
data = np.loadtxt("data/n2_cp.csv", delimiter=",", skiprows=1)
T_raw = data[:, 0] # Temperature (K)
Cp_raw = data[:, 1] # Heat capacity (kJ/kg/K)
# Normalize inputs and outputs to [0, 1] range.
# Neural networks train better when values are small and centered.
T_min, T_max = T_raw.min(), T_raw.max()
Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()
T = (T_raw - T_min) / (T_max - T_min) # shape: (N,)
Cp = (Cp_raw - Cp_min) / (Cp_max - Cp_min) # shape: (N,)
# Reshape for matrix operations: each sample is a row
X = T.reshape(-1, 1) # (N, 1) -- input matrix
Y = Cp.reshape(-1, 1) # (N, 1) -- target matrix
N = X.shape[0] # number of data points
# ── Network architecture ───────────────────────────────────────
#
# Input (1) --> Hidden (H neurons, tanh) --> Output (1)
#
# The hidden layer has H neurons. Each neuron computes:
# z = w * x + b (weighted sum)
# a = tanh(z) (activation -- introduces nonlinearity)
#
# The output layer combines the hidden activations:
# y_pred = W2 @ a + b2
H = 10 # number of neurons in the hidden layer
# Initialize weights randomly (small values)
# W1: (1, H) -- connects input to each hidden neuron
# b1: (1, H) -- one bias per hidden neuron
# W2: (H, 1) -- connects hidden neurons to output
# b2: (1, 1) -- output bias
np.random.seed(42)
W1 = np.random.randn(1, H) * 0.5
b1 = np.zeros((1, H))
W2 = np.random.randn(H, 1) * 0.5
b2 = np.zeros((1, 1))
# ── Training parameters ───────────────────────────────────────
learning_rate = 0.01
epochs = 5000
log_interval = 500
# ── Training loop ─────────────────────────────────────────────
losses = []
for epoch in range(epochs):
# ── Forward pass ──────────────────────────────────────────
# Step 1: hidden layer pre-activation
Z1 = X @ W1 + b1 # (N, H)
# Step 2: hidden layer activation (tanh)
A1 = np.tanh(Z1) # (N, H)
# Step 3: output layer (linear -- no activation)
Y_pred = A1 @ W2 + b2 # (N, 1)
# ── Loss ──────────────────────────────────────────────────
# Mean squared error
error = Y_pred - Y # (N, 1)
loss = np.mean(error ** 2)
losses.append(loss)
# ── Backpropagation ───────────────────────────────────────
# Compute gradients by applying the chain rule, working
# backward from the loss to each weight.
# Gradient of loss w.r.t. output
dL_dYpred = 2 * error / N # (N, 1)
# Gradients for output layer weights
dL_dW2 = A1.T @ dL_dYpred # (H, 1)
dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True) # (1, 1)
# Gradient flowing back through the hidden layer
dL_dA1 = dL_dYpred @ W2.T # (N, H)
# Derivative of tanh: d/dz tanh(z) = 1 - tanh(z)^2
dL_dZ1 = dL_dA1 * (1 - A1 ** 2) # (N, H)
# Gradients for hidden layer weights
dL_dW1 = X.T @ dL_dZ1 # (1, H)
dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True) # (1, H)
# ── Gradient descent ──────────────────────────────────────
# Update each weight in the direction that reduces the loss
W2 -= learning_rate * dL_dW2
b2 -= learning_rate * dL_db2
W1 -= learning_rate * dL_dW1
b1 -= learning_rate * dL_db1
if epoch % log_interval == 0 or epoch == epochs - 1:
print(f"Epoch {epoch:5d} Loss: {loss:.6f}")
# ── Results ────────────────────────────────────────────────────
# Predict on a fine grid for smooth plotting
T_fine = np.linspace(0, 1, 200).reshape(-1, 1)
A1_fine = np.tanh(T_fine @ W1 + b1)
Cp_pred_norm = A1_fine @ W2 + b2
# Convert back to physical units
T_fine_K = T_fine * (T_max - T_min) + T_min
Cp_pred = Cp_pred_norm * (Cp_max - Cp_min) + Cp_min
# ── Plot ───────────────────────────────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Left: fit
ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')
ax1.plot(T_fine_K, Cp_pred, 'r-', linewidth=2, label=f'NN ({H} neurons)')
ax1.set_xlabel('Temperature (K)')
ax1.set_ylabel('$C_p$ (kJ/kg/K)')
ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')
ax1.legend()
# Right: training loss
ax2.semilogy(losses)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Mean Squared Error')
ax2.set_title('Training Loss')
plt.tight_layout()
plt.savefig('nn_fit.png', dpi=150)
plt.show()
print(f"\nFinal loss: {losses[-1]:.6f}")
print(f"Network: {1} input -> {H} hidden (tanh) -> {1} output")
print(f"Total parameters: {W1.size + b1.size + W2.size + b2.size}")

View file

@ -0,0 +1,99 @@
# nn_torch.py
#
# The same neural network as nn_numpy.py, but using PyTorch.
# Compare this to the numpy version to see what the framework handles for you:
# - Automatic differentiation (no manual backprop)
# - Built-in optimizers (Adam instead of hand-coded gradient descent)
# - GPU support (if available)
#
# CHEG 667-013
# E. M. Furst
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
# ── Load and prepare data ──────────────────────────────────────
data = np.loadtxt("data/n2_cp.csv", delimiter=",", skiprows=1)
T_raw = data[:, 0]
Cp_raw = data[:, 1]
# Normalize to [0, 1]
T_min, T_max = T_raw.min(), T_raw.max()
Cp_min, Cp_max = Cp_raw.min(), Cp_raw.max()
X = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)
Y = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)
# ── Define the network ─────────────────────────────────────────
#
# nn.Sequential stacks layers in order. Compare this to nanoGPT's
# model.py, which uses the same PyTorch building blocks (nn.Linear,
# activation functions) but with many more layers.
H = 10 # hidden neurons
model = nn.Sequential(
nn.Linear(1, H), # input -> hidden (W1, b1)
nn.Tanh(), # activation
nn.Linear(H, 1), # hidden -> output (W2, b2)
)
print(f"Model:\n{model}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters())}\n")
# ── Training ───────────────────────────────────────────────────
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
epochs = 5000
log_interval = 500
losses = []
for epoch in range(epochs):
# Forward pass -- PyTorch tracks operations for automatic differentiation
Y_pred = model(X)
loss = loss_fn(Y_pred, Y)
losses.append(loss.item())
# Backward pass -- PyTorch computes all gradients automatically
optimizer.zero_grad() # reset gradients from previous step
loss.backward() # compute gradients via automatic differentiation
optimizer.step() # update weights (Adam optimizer)
if epoch % log_interval == 0 or epoch == epochs - 1:
print(f"Epoch {epoch:5d} Loss: {loss.item():.6f}")
# ── Results ────────────────────────────────────────────────────
# Predict on a fine grid
T_fine = torch.linspace(0, 1, 200).reshape(-1, 1)
with torch.no_grad(): # no gradient tracking needed for inference
Cp_pred_norm = model(T_fine)
# Convert back to physical units
T_fine_K = T_fine.numpy() * (T_max - T_min) + T_min
Cp_pred = Cp_pred_norm.numpy() * (Cp_max - Cp_min) + Cp_min
# ── Plot ───────────────────────────────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')
ax1.plot(T_fine_K, Cp_pred, 'r-', linewidth=2, label=f'NN ({H} neurons)')
ax1.set_xlabel('Temperature (K)')
ax1.set_ylabel('$C_p$ (kJ/kg/K)')
ax1.set_title('$C_p(T)$ for N$_2$ at 1 bar — PyTorch')
ax1.legend()
ax2.semilogy(losses)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Mean Squared Error')
ax2.set_title('Training Loss')
plt.tight_layout()
plt.savefig('nn_fit_torch.png', dpi=150)
plt.show()

View file

@ -0,0 +1,137 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "xbsmj1hcj1g",
"source": "# Building a Neural Network: $C_p(T)$ for Nitrogen\n\n**CHEG 667-013 — LLMs for Engineers**\n\nIn this notebook we fit the heat capacity of N₂ gas using three approaches:\n1. A polynomial fit (the classical approach)\n2. A neural network built from scratch in numpy\n3. The same network in PyTorch\n\nThis makes the ML concepts behind LLMs — weights, loss, gradient descent, overfitting — concrete and tangible.",
"metadata": {}
},
{
"cell_type": "markdown",
"id": "szrl41l3xbq",
"source": "## 1. Load and plot the data\n\nThe data is from the [NIST Chemistry WebBook](https://webbook.nist.gov/): isobaric heat capacity of N₂ at 1 bar, 3002000 K.",
"metadata": {}
},
{
"cell_type": "code",
"id": "t4lqkcoeyil",
"source": "import numpy as np\nimport matplotlib.pyplot as plt\n\ndata = np.loadtxt(\"data/n2_cp.csv\", delimiter=\",\", skiprows=1)\nT_raw = data[:, 0] # Temperature (K)\nCp_raw = data[:, 1] # Cp (kJ/kg/K)\n\nplt.figure(figsize=(8, 5))\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6)\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('$C_p(T)$ for N$_2$ at 1 bar — NIST WebBook')\nplt.show()\n\nprint(f\"{len(T_raw)} data points, T range: {T_raw.min():.0f} {T_raw.max():.0f} K\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "1jyrgsvp7op",
"source": "## 2. Polynomial fit (baseline)\n\nTextbooks fit $C_p(T)$ with a polynomial: $C_p = a + bT + cT^2 + dT^3$. This is a **4-parameter** model. Let's fit it with `numpy.polyfit` and see how well it does.",
"metadata": {}
},
{
"cell_type": "code",
"id": "4smvu4z2oro",
"source": "# Fit a cubic polynomial\ncoeffs = np.polyfit(T_raw, Cp_raw, 3)\npoly = np.poly1d(coeffs)\n\nT_fine = np.linspace(T_raw.min(), T_raw.max(), 200)\nCp_poly = poly(T_fine)\n\n# Compute residuals\nCp_poly_at_data = poly(T_raw)\nmse_poly = np.mean((Cp_poly_at_data - Cp_raw) ** 2)\n\nplt.figure(figsize=(8, 5))\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nplt.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Cubic polynomial (4 params)')\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Polynomial fit')\nplt.legend()\nplt.show()\n\nprint(f\"Polynomial coefficients: {coeffs}\")\nprint(f\"MSE: {mse_poly:.8f}\")\nprint(f\"Parameters: 4\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "97y7mrcekji",
"source": "## 3. Neural network from scratch (numpy)\n\nNow let's build a one-hidden-layer neural network. The architecture:\n\n```\nInput (1: T) → Hidden (10 neurons, tanh) → Output (1: Cp)\n```\n\nWe need to:\n1. **Normalize** the data to [0, 1] so the network trains efficiently\n2. **Forward pass**: compute predictions from input through each layer\n3. **Loss**: mean squared error between predictions and data\n4. **Backpropagation**: compute gradients of the loss w.r.t. each weight using the chain rule\n5. **Gradient descent**: update weights in the direction that reduces the loss\n\nThis is exactly what nanoGPT's `train.py` does — just at a much larger scale.",
"metadata": {}
},
{
"cell_type": "code",
"id": "365o7bqbwkr",
"source": "# Normalize inputs and outputs to [0, 1]\nT_min, T_max = T_raw.min(), T_raw.max()\nCp_min, Cp_max = Cp_raw.min(), Cp_raw.max()\n\nT = (T_raw - T_min) / (T_max - T_min)\nCp = (Cp_raw - Cp_min) / (Cp_max - Cp_min)\n\nX = T.reshape(-1, 1) # (N, 1) input matrix\nY = Cp.reshape(-1, 1) # (N, 1) target matrix\nN = X.shape[0]\n\n# Network setup\nH = 10 # hidden neurons\n\nnp.random.seed(42)\nW1 = np.random.randn(1, H) * 0.5 # input → hidden weights\nb1 = np.zeros((1, H)) # hidden biases\nW2 = np.random.randn(H, 1) * 0.5 # hidden → output weights\nb2 = np.zeros((1, 1)) # output bias\n\nprint(f\"Parameters: W1({W1.shape}) + b1({b1.shape}) + W2({W2.shape}) + b2({b2.shape})\")\nprint(f\"Total: {W1.size + b1.size + W2.size + b2.size} parameters for {N} data points\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"id": "5w1ezs9t2w6",
"source": "# Training loop\nlearning_rate = 0.01\nepochs = 5000\nlog_interval = 500\nlosses_np = []\n\nfor epoch in range(epochs):\n # Forward pass\n Z1 = X @ W1 + b1 # hidden pre-activation (N, H)\n A1 = np.tanh(Z1) # hidden activation (N, H)\n Y_pred = A1 @ W2 + b2 # output (N, 1)\n\n # Loss (mean squared error)\n error = Y_pred - Y\n loss = np.mean(error ** 2)\n losses_np.append(loss)\n\n # Backpropagation (chain rule, working backward)\n dL_dYpred = 2 * error / N\n dL_dW2 = A1.T @ dL_dYpred\n dL_db2 = np.sum(dL_dYpred, axis=0, keepdims=True)\n dL_dA1 = dL_dYpred @ W2.T\n dL_dZ1 = dL_dA1 * (1 - A1 ** 2) # tanh derivative\n dL_dW1 = X.T @ dL_dZ1\n dL_db1 = np.sum(dL_dZ1, axis=0, keepdims=True)\n\n # Gradient descent update\n W2 -= learning_rate * dL_dW2\n b2 -= learning_rate * dL_db2\n W1 -= learning_rate * dL_dW1\n b1 -= learning_rate * dL_db1\n\n if epoch % log_interval == 0 or epoch == epochs - 1:\n print(f\"Epoch {epoch:5d} Loss: {loss:.6f}\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"id": "onel9r0kjk",
"source": "# Predict on a fine grid and convert back to physical units\nT_fine_norm = np.linspace(0, 1, 200).reshape(-1, 1)\nA1_fine = np.tanh(T_fine_norm @ W1 + b1)\nCp_nn_norm = A1_fine @ W2 + b2\nCp_nn = Cp_nn_norm * (Cp_max - Cp_min) + Cp_min\nT_fine_K = T_fine_norm * (T_max - T_min) + T_min\n\n# MSE in original units for comparison with polynomial\nCp_nn_at_data = np.tanh(X @ W1 + b1) @ W2 + b2\nCp_nn_at_data = Cp_nn_at_data * (Cp_max - Cp_min) + Cp_min\nmse_nn = np.mean((Cp_nn_at_data.flatten() - Cp_raw) ** 2)\n\nfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n\nax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params, MSE={mse_poly:.2e})')\nax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r-', linewidth=2, label=f'NN numpy (31 params, MSE={mse_nn:.2e})')\nax1.set_xlabel('Temperature (K)')\nax1.set_ylabel('$C_p$ (kJ/kg/K)')\nax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\nax1.legend()\n\nax2.semilogy(losses_np)\nax2.set_xlabel('Epoch')\nax2.set_ylabel('MSE (normalized)')\nax2.set_title('Training loss — numpy NN')\n\nplt.tight_layout()\nplt.show()",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "ea9z35qm9u8",
"source": "## 4. Neural network in PyTorch\n\nThe same network, but PyTorch handles backpropagation automatically. Compare the training loop above to the one below — `loss.backward()` replaces all of our manual gradient calculations.\n\nThis is the same API used in nanoGPT's `model.py` — `nn.Linear`, activation functions, `optimizer.step()`.",
"metadata": {}
},
{
"cell_type": "code",
"id": "3qxnrtyxqgz",
"source": "import torch\nimport torch.nn as nn\n\n# Prepare data as PyTorch tensors\nX_t = torch.tensor((T_raw - T_min) / (T_max - T_min), dtype=torch.float32).reshape(-1, 1)\nY_t = torch.tensor((Cp_raw - Cp_min) / (Cp_max - Cp_min), dtype=torch.float32).reshape(-1, 1)\n\n# Define the network\nmodel = nn.Sequential(\n nn.Linear(1, H), # input → hidden (W1, b1)\n nn.Tanh(), # activation\n nn.Linear(H, 1), # hidden → output (W2, b2)\n)\n\nprint(model)\nprint(f\"Total parameters: {sum(p.numel() for p in model.parameters())}\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"id": "ydl3ycnypps",
"source": "# Train\noptimizer = torch.optim.Adam(model.parameters(), lr=0.01)\nloss_fn = nn.MSELoss()\nlosses_torch = []\n\nfor epoch in range(epochs):\n Y_pred_t = model(X_t)\n loss = loss_fn(Y_pred_t, Y_t)\n losses_torch.append(loss.item())\n\n optimizer.zero_grad() # reset gradients\n loss.backward() # automatic differentiation\n optimizer.step() # update weights\n\n if epoch % log_interval == 0 or epoch == epochs - 1:\n print(f\"Epoch {epoch:5d} Loss: {loss.item():.6f}\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "bg0kvnk4ho",
"source": "## 5. Compare all three approaches",
"metadata": {}
},
{
"cell_type": "code",
"id": "h2dfstoh8gd",
"source": "# PyTorch predictions\nT_fine_t = torch.linspace(0, 1, 200).reshape(-1, 1)\nwith torch.no_grad():\n Cp_torch_norm = model(T_fine_t)\nCp_torch = Cp_torch_norm.numpy() * (Cp_max - Cp_min) + Cp_min\n\n# MSE for PyTorch model\nwith torch.no_grad():\n Cp_torch_at_data = model(X_t).numpy() * (Cp_max - Cp_min) + Cp_min\nmse_torch = np.mean((Cp_torch_at_data.flatten() - Cp_raw) ** 2)\n\nfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))\n\n# Left: all three fits\nax1.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nax1.plot(T_fine, Cp_poly, 'b-', linewidth=2, label=f'Polynomial (4 params)')\nax1.plot(T_fine_K.flatten(), Cp_nn.flatten(), 'r--', linewidth=2, label=f'NN numpy (31 params)')\nax1.plot(T_fine_K.flatten(), Cp_torch.flatten(), 'g-', linewidth=2, alpha=0.8, label=f'NN PyTorch (31 params)')\nax1.set_xlabel('Temperature (K)')\nax1.set_ylabel('$C_p$ (kJ/kg/K)')\nax1.set_title('$C_p(T)$ for N$_2$ at 1 bar')\nax1.legend()\n\n# Right: training loss comparison\nax2.semilogy(losses_np, label='numpy (gradient descent)')\nax2.semilogy(losses_torch, label='PyTorch (Adam)')\nax2.set_xlabel('Epoch')\nax2.set_ylabel('MSE (normalized)')\nax2.set_title('Training loss comparison')\nax2.legend()\n\nplt.tight_layout()\nplt.show()\n\nprint(f\"MSE — Polynomial: {mse_poly:.2e} | NN numpy: {mse_nn:.2e} | NN PyTorch: {mse_torch:.2e}\")",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "xyw3sr20brn",
"source": "## 6. Extrapolation\n\nHow do the models behave *outside* the training range? This is a key test — and where the differences become stark.",
"metadata": {}
},
{
"cell_type": "code",
"id": "fi3iq2sjh6",
"source": "# Extrapolate beyond the training range\nT_extrap = np.linspace(100, 2500, 300)\nT_extrap_norm = ((T_extrap - T_min) / (T_max - T_min)).reshape(-1, 1)\n\n# Polynomial extrapolation\nCp_poly_extrap = poly(T_extrap)\n\n# Numpy NN extrapolation\nA1_extrap = np.tanh(T_extrap_norm @ W1 + b1)\nCp_nn_extrap = (A1_extrap @ W2 + b2) * (Cp_max - Cp_min) + Cp_min\n\n# PyTorch NN extrapolation\nwith torch.no_grad():\n Cp_torch_extrap = model(torch.tensor(T_extrap_norm, dtype=torch.float32)).numpy()\nCp_torch_extrap = Cp_torch_extrap * (Cp_max - Cp_min) + Cp_min\n\nplt.figure(figsize=(10, 6))\nplt.plot(T_raw, Cp_raw, 'ko', markersize=6, label='NIST data')\nplt.plot(T_extrap, Cp_poly_extrap, 'b-', linewidth=2, label='Polynomial')\nplt.plot(T_extrap, Cp_nn_extrap.flatten(), 'r--', linewidth=2, label='NN numpy')\nplt.plot(T_extrap, Cp_torch_extrap.flatten(), 'g-', linewidth=2, alpha=0.8, label='NN PyTorch')\nplt.axvline(T_raw.min(), color='gray', linestyle=':', alpha=0.5, label='Training range')\nplt.axvline(T_raw.max(), color='gray', linestyle=':', alpha=0.5)\nplt.xlabel('Temperature (K)')\nplt.ylabel('$C_p$ (kJ/kg/K)')\nplt.title('Extrapolation beyond training data')\nplt.legend()\nplt.show()",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "yb2s18keiw",
"source": "## 7. Exercises\n\nTry these in new cells below:\n\n1. **Change the number of hidden neurons** (`H`). Try 2, 5, 20, 50. How does the fit change? At what point does adding neurons stop helping?\n\n2. **Activation functions**: In the PyTorch model, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change?\n\n3. **Optimizer comparison**: Replace `Adam` with `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare?\n\n4. **Remove normalization**: Use `T_raw` and `Cp_raw` directly (no scaling to [0,1]). What happens? Can you fix it by adjusting the learning rate?\n\n5. **Overfitting**: Set `H = 100` and train for 20,000 epochs. Does it fit the training data well? Look at the extrapolation — is it reasonable?\n\n6. **Higher-order polynomial**: Try `np.polyfit(T_raw, Cp_raw, 10)`. How does it compare to the cubic? How does it extrapolate?",
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}