Cleanup edits to module 01 and 05

walkthroughs.
2026-04-02 12:55:14 -04:00 · 2026-04-02 12:55:14 -04:00 · 896570f71c
commit 896570f71c
parent e10e411e41
2 changed files with 17 additions and 19 deletions
--- a/01-nanogpt/README.md
+++ b/01-nanogpt/README.md
@ -85,7 +85,7 @@ drwxr-xr-x  5 furst  staff     160 Apr 17 12:44 data/

 Here's a quick run-down on some of the files and directories:

- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into smaller units that a machine learning model can process.)
+- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into units that a machine learning model can process.)
 - `/config` — scripts to train or finetune the model, depending on the tokenization method used.
 - `train.py` — a Python script that trains the model. This will build the weights and biases of the transformer.
 - `sample.py` — a Python script that runs inference on the model. This is a "prompt" script that will cause the model to begin generating text.
@ -124,7 +124,7 @@ total 6576
 -rw-r--r--  1 furst  staff   223080 Apr 17 14:54 val.bin
 ```

-The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format — it can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
+The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format. It can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.

 > **Exercise 1:** The `prepare.py` script downloads and tokenizes a version of *Tiny Shakespeare*. How big is the text file? Use the command `wc` to find the number of lines, words, and characters. Examine the text with the command `less`.