Cleanup edits to module 01 and 05
walkthroughs.
This commit is contained in:
parent
e10e411e41
commit
896570f71c
2 changed files with 17 additions and 19 deletions
|
|
@ -85,7 +85,7 @@ drwxr-xr-x 5 furst staff 160 Apr 17 12:44 data/
|
|||
|
||||
Here's a quick run-down on some of the files and directories:
|
||||
|
||||
- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into smaller units that a machine learning model can process.)
|
||||
- `/data` — contains three datasets for training the nanoGPT. Two of these (`/data/openwebtext` and `/data/shakespeare`) encode the training datasets into the GPT-2 tokens (byte pair encoding, or BPE). We will focus on the third, `/data/shakespeare_char`, which will generate a character-level tokenization of the text. (Tokenization is the process of breaking down text into units that a machine learning model can process.)
|
||||
- `/config` — scripts to train or finetune the model, depending on the tokenization method used.
|
||||
- `train.py` — a Python script that trains the model. This will build the weights and biases of the transformer.
|
||||
- `sample.py` — a Python script that runs inference on the model. This is a "prompt" script that will cause the model to begin generating text.
|
||||
|
|
@ -124,7 +124,7 @@ total 6576
|
|||
-rw-r--r-- 1 furst staff 223080 Apr 17 14:54 val.bin
|
||||
```
|
||||
|
||||
The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format — it can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
|
||||
The script downloads `input.txt` and tokenizes the text. It splits the tokenized text into two binary files: `train.bin` and `val.bin`. These are the training and validation datasets. `meta.pkl` is a Python pickle file that contains information about the model size and parameters. Pickle is Python's built-in serialization format. It can store arbitrary Python objects as binary files, which makes it convenient *but also a security concern* since loading an untrusted pickle can execute arbitrary code.
|
||||
|
||||
> **Exercise 1:** The `prepare.py` script downloads and tokenizes a version of *Tiny Shakespeare*. How big is the text file? Use the command `wc` to find the number of lines, words, and characters. Examine the text with the command `less`.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue