You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
remove all non-English texts and notice (rasbt#304)
* remove all non-English texts and notice
1. almost 18GB txt left after `is_english` filtered.
2. remove notice use gutenberg's strip_headers
3. after re-run get_data.py, seems all data are under `gutenberg/data/.mirror` folder.
* some improvements
* update readme
---------
Co-authored-by: rasbt <[email protected]>
Copy file name to clipboardexpand all lines: ch05/03_bonus_pretraining_on_gutenberg/README.md
+8-1
Original file line number
Diff line number
Diff line change
@@ -82,11 +82,18 @@ Next, run the `prepare_dataset.py` script, which concatenates the (as of this wr
82
82
83
83
```bash
84
84
python prepare_dataset.py \
85
-
--data_dir gutenberg/data \
85
+
--data_dir gutenberg/data/raw \
86
86
--max_size_mb 500 \
87
87
--output_dir gutenberg_preprocessed
88
88
```
89
89
90
+
```
91
+
...
92
+
Skipping gutenberg/data/raw/PG29836_raw.txt as it does not contain primarily English text. Skipping gutenberg/data/raw/PG16527_raw.txt as it does not contain primarily English text. 100%|██████████████████████████████████████████████████████████| 57250/57250 [25:04<00:00, 38.05it/s]
93
+
42 file(s) saved in /Users/sebastian/Developer/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed
94
+
```
95
+
96
+
90
97
> [!TIP]
91
98
> Note that the produced files are stored in plaintext format and are not pre-tokenized for simplicity. However, you may want to update the codes to store the dataset in a pre-tokenized form to save computation time if you are planning to use the dataset more often or train for multiple epochs. See the *Design Decisions and Improvements* at the bottom of this page for more information.
0 commit comments