Sync changes from che-computing
- Fix checkpoint directory name in 01-nanogpt - Add generative text references (OUTPUT, Love Letters) - Add PYTORCH.md troubleshooting (MPS, CUDA, WSL) - Minor spacing fix in 02-ollama Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
564e75b824
commit
794cdaea0d
3 changed files with 119 additions and 4 deletions
111
01-nanogpt/PYTORCH.md
Normal file
111
01-nanogpt/PYTORCH.md
Normal file
|
|
@ -0,0 +1,111 @@
|
||||||
|
# PyTorch Troubleshooting
|
||||||
|
|
||||||
|
## Which device should I use?
|
||||||
|
|
||||||
|
PyTorch can run on CPU, NVIDIA GPU (CUDA), or Apple GPU (MPS). Use this to check what's available on your machine:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('MPS:', torch.backends.mps.is_available())"
|
||||||
|
```
|
||||||
|
|
||||||
|
Then use the appropriate `--device` flag when running nanoGPT:
|
||||||
|
|
||||||
|
| Hardware | Flag |
|
||||||
|
|----------|------|
|
||||||
|
| No GPU / any machine | `--device=cpu` |
|
||||||
|
| Apple Silicon (M1/M2/M3/M4) | `--device=mps` |
|
||||||
|
| NVIDIA GPU | `--device=cuda` |
|
||||||
|
|
||||||
|
CPU works everywhere but is the slowest. For the exercises in this course, CPU is fine.
|
||||||
|
|
||||||
|
|
||||||
|
## Apple Silicon (macOS)
|
||||||
|
|
||||||
|
The default PyTorch installed by `uv add torch` includes MPS (Metal Performance Shaders) support out of the box. No special installation is needed.
|
||||||
|
|
||||||
|
To use it with nanoGPT:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python train.py config/train_shakespeare_char.py --device=mps --compile=False
|
||||||
|
python sample.py --out_dir=out-shakespeare-char --device=mps --compile=False
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: `--compile=False` is required on MPS because `torch.compile` does not support the MPS backend.
|
||||||
|
|
||||||
|
|
||||||
|
## Windows with NVIDIA GPU (WSL)
|
||||||
|
|
||||||
|
If you have a Windows laptop with an NVIDIA GPU (common on gaming laptops with RTX 3060, 4060, etc.), you can use it through WSL. The key requirement is that the NVIDIA drivers are installed on the **Windows side**. WSL automatically bridges to the Windows GPU driver, so you do not need to install CUDA or NVIDIA drivers inside WSL itself.
|
||||||
|
|
||||||
|
Check that WSL can see your GPU:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
If this works and shows your GPU, follow the NVIDIA GPU instructions below to install PyTorch with CUDA support.
|
||||||
|
|
||||||
|
If `nvidia-smi` is not found, make sure you have the latest NVIDIA drivers installed in Windows (download from https://www.nvidia.com/Download/index.aspx). After installing or updating the driver, restart WSL (`wsl --shutdown` from PowerShell, then reopen Ubuntu).
|
||||||
|
|
||||||
|
If you have an Intel or AMD integrated GPU without an NVIDIA card, use `--device=cpu`. CPU mode works fine for all the exercises in this course.
|
||||||
|
|
||||||
|
|
||||||
|
## NVIDIA GPU (Linux / WSL)
|
||||||
|
|
||||||
|
### Problem: "NVIDIA driver is too old" or CUDA not found
|
||||||
|
|
||||||
|
If you see an error like:
|
||||||
|
|
||||||
|
```
|
||||||
|
RuntimeError: The NVIDIA driver on your system is too old (found version 12020)
|
||||||
|
```
|
||||||
|
|
||||||
|
or if this check fails:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())"
|
||||||
|
# If cuda is None and is_available() is False, you have the CPU-only build
|
||||||
|
```
|
||||||
|
|
||||||
|
The issue is that `uv add torch` installs the **CPU-only** PyTorch wheel by default. It does not include CUDA support.
|
||||||
|
|
||||||
|
### Fix: reinstall PyTorch with CUDA
|
||||||
|
|
||||||
|
First, check your NVIDIA driver version:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for the "CUDA Version" in the top right of the output. Then install the matching PyTorch CUDA wheels. For most systems with CUDA 12.x drivers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip uninstall torch torchvision torchaudio
|
||||||
|
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
||||||
|
```
|
||||||
|
|
||||||
|
For older drivers with CUDA 11.8:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
||||||
|
```
|
||||||
|
|
||||||
|
Then verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import torch; print(torch.cuda.is_available())"
|
||||||
|
# Should print: True
|
||||||
|
```
|
||||||
|
|
||||||
|
### Notes
|
||||||
|
|
||||||
|
- The CUDA 12.1 wheels are compatible with CUDA 12.x drivers (e.g., 12.2, 12.4).
|
||||||
|
- `uv pip install` operates on the active virtual environment only and does not affect other environments.
|
||||||
|
- See https://pytorch.org/get-started/locally/ for the full compatibility matrix.
|
||||||
|
|
||||||
|
|
||||||
|
## General tips
|
||||||
|
|
||||||
|
- Always check your device before starting a long training run.
|
||||||
|
- If something is not working, run the diagnostic check at the top of this file first.
|
||||||
|
- The `--compile=False` flag is needed on CPU and MPS. It is optional (but harmless) on CUDA.
|
||||||
|
|
@ -195,7 +195,7 @@ Every 250th iteration, the training script does a validation step. If the valida
|
||||||
|
|
||||||
```
|
```
|
||||||
step 250: train loss 2.4293, val loss 2.4447
|
step 250: train loss 2.4293, val loss 2.4447
|
||||||
saving checkpoint to out-shakespeare-char-cpu
|
saving checkpoint to out-shakespeare-char
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -205,7 +205,7 @@ When we train nanoGPT, it starts with randomly assigned weights and biases. This
|
||||||
|
|
||||||
> **Exercise 3:** As the model trains, it reports the training and validation losses. In a Jupyter notebook, plot these values with the number of iterations. *Hint:* To capture the output when you perform a training run, you could run the process in the background while redirecting its output to a file: `python train.py config/train_shakespeare_char.py [options] > output.txt &`. (Remember, the ampersand at the end runs the process in the background.) You can still monitor the run by typing `tail -f output.txt`. This command will "follow" the end of the file as it is written.
|
> **Exercise 3:** As the model trains, it reports the training and validation losses. In a Jupyter notebook, plot these values with the number of iterations. *Hint:* To capture the output when you perform a training run, you could run the process in the background while redirecting its output to a file: `python train.py config/train_shakespeare_char.py [options] > output.txt &`. (Remember, the ampersand at the end runs the process in the background.) You can still monitor the run by typing `tail -f output.txt`. This command will "follow" the end of the file as it is written.
|
||||||
|
|
||||||
After the training finishes, we should have the model in `/out-shakespeare-char-cpu`:
|
After the training finishes, we should have the model in `/out-shakespeare-char`:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ ls -l
|
$ ls -l
|
||||||
|
|
@ -221,7 +221,7 @@ In this case, the model is about 9.3 MB. That's not great! Our *training* text w
|
||||||
The script `sample.py` runs inference on the model we just trained. We're using the CPU here, too.
|
The script `sample.py` runs inference on the model we just trained. We're using the CPU here, too.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python sample.py --out_dir=out-shakespeare-char-cpu --device=cpu
|
python sample.py --out_dir=out-shakespeare-char --device=cpu
|
||||||
```
|
```
|
||||||
|
|
||||||
After a short time, the model will begin generating text.
|
After a short time, the model will begin generating text.
|
||||||
|
|
@ -376,3 +376,7 @@ These books are informative and accessible resources for understanding the under
|
||||||
Including the sections:
|
Including the sections:
|
||||||
- Attention and LLMs - https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html
|
- Attention and LLMs - https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html
|
||||||
- Softmax - https://d2l.ai/chapter_linear-classification/softmax-regression.html
|
- Softmax - https://d2l.ai/chapter_linear-classification/softmax-regression.html
|
||||||
|
|
||||||
|
If generating text with computers tickles your fancy, I recommend checking out the book *OUTPUT: An Anthology of Computer-Generated Text* by Lillian-Yvonne Bertram and Nick Montfort. It is a timely book covering a wide range of texts, "from research systems, natural-language generation products and services, and artistic and literary programs." (Bertram, Lillian-Yvonne, and Nick Montfort, editors. Output: An Anthology of Computer-Generated Text, 1953–2023. The MIT Press, 2024.)
|
||||||
|
|
||||||
|
While it still feels novel to many of us, interest in machine or "generative" text dates almost to the beginning of the modern computer era. Many experiments, spanning a context from AI research to artistic and literary practices, have been shared over the intervening decades. Christopher Strachey's program, often referred to as *Love Letters*, was written in 1952 for the Manchester Mark I computer. It is considered by many to be the first example of generative computer literature. In 2009, David Link ran Strachey's original code on an emulated Mark I, and Nick Montfort, professor of digital media at MIT, coded a modern recreation of it in 2014. The text output follows the pattern "you are my [adjective] [noun]. my [adjective] [noun] [adverb] [verbs] your [adjective] [noun]," signed by "M.U.C." for the Manchester University Computer. With the vocabulary in the program, there are over 300 billion possible combinations.
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue