Cleanup edits to module 01 and 05

walkthroughs.
This commit is contained in:
Eric 2026-04-02 12:55:14 -04:00
commit 896570f71c
2 changed files with 17 additions and 19 deletions

View file

@ -85,7 +85,7 @@ drwxr-xr-x 5 furst staff 160 Apr 17 12:44 data/
Here's a quick run-down on some of the files and directories:
- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into smaller units that a machine learning model can process.)
- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into units that a machine learning model can process.)
- `/config` — scripts to train or finetune the model, depending on the tokenization method used.
- `train.py` — a Python script that trains the model. This will build the weights and biases of the transformer.
- `sample.py` — a Python script that runs inference on the model. This is a "prompt" script that will cause the model to begin generating text.
@ -124,7 +124,7 @@ total 6576
-rw-r--r-- 1 furst staff 223080 Apr 17 14:54 val.bin
```
The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format — it can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format. It can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
> **Exercise 1:** The `prepare.py` script downloads and tokenizes a version of *Tiny Shakespeare*. How big is the text file? Use the command `wc` to find the number of lines, words, and characters. Examine the text with the command `less`.