llm-workshop/05-neural-networks/README.md

# Large Language Models Part V: Building a Neural Network

**CHEG 667-013 — Chemical Engineering with Computers**
Department of Chemical and Biomolecular Engineering, University of Delaware

---

## Key idea

Build a neural network from scratch to understand the core mechanics behind LLMs.

## Key goals

- See concretely what "weights and biases" are and how they're organized
- Understand the forward pass, loss function, and gradient descent
- Implement backpropagation by hand in numpy
- See how PyTorch automates the same process
- Connect these concepts to what you've already seen in nanoGPT

---

Everything we've done in this workshop is **machine learning** (ML) — the practice of training models to learn patterns from data rather than programming rules by hand. LLMs are one (very large) example of ML, built on neural networks. Throughout this workshop, we've used ML terms like *model weights*, *training loss*, *gradient descent*, and *overfitting* — often without defining them precisely. In Part I, we watched nanoGPT's training loss decrease over 2000 iterations. In Part II, we saw that models have many millions (even billions) of parameters. In Parts III and IV, we used embedding models that map text into vectors — another ML technique.

In this section, we step back from language and build a neural network ourselves — small enough to understand every weight, but powerful enough to learn a real physical relationship. The goal is to make the ML concepts behind LLMs concrete.

Our task: fit the ideal gas heat capacity $C^*_p(T)$ of nitrogen gas using data from the [NIST Chemistry WebBook](https://webbook.nist.gov/). This is a function that chemical engineers know well. Textbooks like *Chemical, Biochemical, and Engineering Thermodynamics* (a UD favorite) typically fit it with a polynomial:

$$C^*_p(T) = a + bT + cT^2 + dT^3$$

Can a neural network learn this relationship directly from data?


## 1. Setup

All dependencies (`numpy`, `torch`, `matplotlib`) are installed by `uv sync`. (See the main [README](../README.md).)

## 2. The data

The file `data/n2_cp.csv` contains 35 data points: the isobaric heat capacity of N₂ gas at 1 bar from 300 K to 2000 K, from the NIST WebBook.

```bash
head data/n2_cp.csv
```

```
T_K,Cp_kJ_per_kgK
300.00,1.0413
350.00,1.0423
400.00,1.0450
...
```

The curve is smooth and nonlinear. $C_p$ increases with temperature as molecular vibrational modes become active. This is a good test case. It is simple enough for a small network, but not a straight line.


## 3. Architecture of a one-hidden-layer network

Our network has three layers:

```
Input (1 neuron: T)  ->  Hidden (10 neurons)  ->  Output (1 neuron: Cp)
```

Here's what happens at each step:

### Forward pass

**Step 1: Hidden layer.** Each of the 10 hidden neurons computes a weighted sum of the input plus a bias, then applies an *activation function*:

$$z_j = w_j \cdot x + b_j \qquad a_j = \tanh(z_j)$$

where $w_j$ and $b_j$ are the weight and bias for neuron $j$. The activation function (here, `tanh`) introduces **nonlinearity**. Without it, stacking layers would just produce another linear function, no matter how many layers we use.

**Step 2: Output layer.** The output is a weighted sum of the hidden activations:

$$\hat{y} = \sum_j W_j \cdot a_j + b_{\text{out}}$$

This is a linear combination. There is no activation on the output, since we want to predict a continuous value.

### Counting parameters

With 10 hidden neurons:
- `W1`: 10 weights (input -> hidden)
- `b1`: 10 biases (hidden)
- `W2`: 10 weights (hidden -> output)
- `b2`: 1 bias (output)
- **Total: 31 parameters**

That's 31 parameters for 35 data points, or almost a 1:1 ratio, which might make you nervous about overfitting. In general, a model with as many parameters as data points can memorize instead of learning. We get away with it here because (a) the $C_p(T)$ data is very smooth with no noise, and (b) the `tanh` activation constrains each neuron to a smooth S-curve, so the network can't wiggle wildly between points the way a high-degree polynomial could. We'll revisit this in the overfitting section below.

Compare: the small nanoGPT model from Part I had ~800,000 parameters. GPT-2 has 124 million. The architecture is the same idea with layers of weights and activations, but just scaled enormously. (In order to "fit" language!)


## 4. Training

Training means finding the values of all 31 parameters that make the network's predictions match the data. This requires three things:

### Loss function

We need a number that says "how wrong is the network?" for a given set of paratmers. The **mean squared error** (MSE) is a natural choice here:

$$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$$

This is the same kind of loss we watched decrease during nanoGPT training in Part I (though nanoGPT uses cross-entropy loss, which is appropriate for classification over a vocabulary).

### Backpropagation

To improve the weights, we need to know how each weight affects the loss. **Backpropagation** computes these gradients by applying the chain rule, working backward from the loss through each layer. For example, the gradient of the loss with respect to an output weight $W_j$ is:

$$\frac{\partial L}{\partial W_j} = \frac{1}{N} \sum_i 2(\hat{y}_i - y_i) \cdot a_{ij}$$

The numpy implementation in `nn_numpy.py` computes every gradient explicitly. This is the part that PyTorch automates.

It's worth noting that backprop is roughly 2/3 of the compute per training step. This ratio holds fairly consistently from small networks up through large transformers, and is one reason inference (forward pass only) is so much cheaper than training.

### Gradient descent

Once we have the gradients, we update each weight:

$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$

where $\eta$ is the **learning rate** — a small number (0.01 in our code) that controls how big each step is. Too large and training oscillates; too small and it's painfully slow.

One full pass through these three steps (forward -> loss -> backward -> update) is one **epoch**. We train for 5000 epochs.

In nanoGPT, the training loop in `train.py` does exactly the same thing, but with the AdamW optimizer (a fancier version of gradient descent) and batches of data instead of the full dataset.


## 5. Running the numpy version

```bash
python nn_numpy.py
```

```
Epoch     0  Loss: 0.283941
Epoch   500  Loss: 0.001253
Epoch  1000  Loss: 0.000412
Epoch  1500  Loss: 0.000178
Epoch  2000  Loss: 0.000082
Epoch  2500  Loss: 0.000040
Epoch  3000  Loss: 0.000021
Epoch  3500  Loss: 0.000012
Epoch  4000  Loss: 0.000008
Epoch  4500  Loss: 0.000005
Epoch  4999  Loss: 0.000004

Final loss: 0.000004
Network: 1 input -> 10 hidden (tanh) -> 1 output
Total parameters: 31
```

The script produces a plot (`nn_fit.png`) showing the fit and the training loss curve. You should see the network's prediction closely tracking the NIST data points, and the loss dropping rapidly in the first 1000 epochs before leveling off.

> **Exercise 1:** Read through `nn_numpy.py` carefully. Identify where each of the following happens: (a) forward pass, (b) loss calculation, (c) backpropagation, (d) gradient descent update. Annotate your copy with comments.

> **Exercise 2:** Change the number of hidden neurons `H`. Try 2, 5, 10, 20, 50. How does the fit change? How many parameters does each network have? At what point does adding more neurons stop helping?


## 6. The PyTorch version

Now look at `nn_torch.py`. It does the same thing, but in about half the code:

```bash
python nn_torch.py
```

Compare the two scripts side by side. The key differences:

| | numpy version | PyTorch version |
|---|---|---|
| Define layers | Manual weight matrices | `nn.Linear(1, H)` |
| Forward pass | `X @ W1 + b1`, `np.tanh(...)` | `model(X)` |
| Backprop | Hand-coded chain rule | `loss.backward()` |
| Weight update | `W -= lr * dW` | `optimizer.step()` |
| Lines of code | ~80 | ~40 |

PyTorch's `loss.backward()` computes all the gradients we wrote out by hand, automatically. This is called **automatic differentiation**. It's what makes training networks with millions of parameters feasible.

The `nn.Sequential` definition:

```python
model = nn.Sequential(
    nn.Linear(1, H),    # input -> hidden (W1, b1)
    nn.Tanh(),           # activation
    nn.Linear(H, 1),    # hidden -> output (W2, b2)
)
```

looks simple here, but it uses the same PyTorch building blocks as nanoGPT's `model.py` (`nn.Linear` layers and activation functions) just with more layers, attention mechanisms, and a much larger vocabulary.

> **Exercise 3:** In the PyTorch version, replace `nn.Tanh()` with `nn.ReLU()` or `nn.Sigmoid()`. How does the fit change? Why might different activation functions work better for different problems?

> **Exercise 4:** Replace the Adam optimizer with plain SGD: `torch.optim.SGD(model.parameters(), lr=0.01)`. How does training speed compare? Try increasing the learning rate. What happens?


## 7. Normalization

Both scripts normalize the input ($T$) and output ($C_p$) to the range [0, 1] before training. This is important:

- Raw $T$ values range from 300 to 2000, while $C_p$ ranges from 1.04 to 1.28
- With unnormalized data, the gradients for the input weights would be hundreds of times larger than for the output weights
- The network would struggle to learn — or need a much smaller learning rate

Try it yourself:

> **Exercise 5:** Comment out the normalization in `nn_numpy.py` (use `T_raw` and `Cp_raw` directly). What happens to the training loss? Can you fix it by changing the learning rate?


## 8. Overfitting

With 31 parameters and 35 data points, our network is close to the edge. What happens with more parameters than data?

> **Exercise 6:** Increase `H` to 100 (giving 301 parameters — nearly 10× the number of data points). Train for 20,000 epochs. Plot the fit. Does it match the training data well? Now generate predictions at $T$ = 275 K and $T$ = 2100 K (outside the training range). Are they reasonable?

This is **overfitting** — the network memorizes the training data but fails to generalize. It's the same concept we discussed in Part I when nanoGPT's validation loss started increasing while the training loss kept decreasing.

In practice, we combat overfitting with:
- More data
- Regularization (dropout — remember this parameter from nanoGPT?)
- Early stopping (stop training when validation loss starts increasing)
- Keeping the model appropriately sized for the data


## 9. Connecting back to LLMs

Everything you've built here in this section scales up to large language models:

| This tutorial | nanoGPT / LLMs |
|---|---|
| 31 parameters | 800K – 70B+ parameters |
| 1 hidden layer | 4 – 96+ layers |
| tanh activation | GELU activation |
| MSE loss | Cross-entropy loss |
| Plain gradient descent | AdamW optimizer |
| Numpy arrays | PyTorch tensors (on GPU) |
| Fitting $C_p(T)$ | Predicting next tokens |

The fundamental loop — forward pass, compute loss, backpropagate, update weights — is identical. The difference is scale: more layers, more data, more compute, and architectural innovations like self-attention.


## Additional resources and references

### NIST Chemistry WebBook

- https://webbook.nist.gov/ — thermophysical property data used in this tutorial

### PyTorch

- Tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html
- `nn.Module` documentation: https://pytorch.org/docs/stable/nn.html

### Reading

- The "backpropagation" chapter in Goodfellow, Bengio & Courville, *Deep Learning* (2016), freely available at https://www.deeplearningbook.org/
- 3Blue1Brown, *Neural Networks* video series: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi — excellent visual intuition for how neural networks learn