Skip to content

Commit 6df65c1

Browse files
jlamypoiriertscholakoleksost
authored
Dataset from file (#146)
Co-authored-by: Torsten Scholak <[email protected]> Co-authored-by: Oleksiy Ostapenko <[email protected]>
1 parent 23006dc commit 6df65c1

File tree

16 files changed

+648
-101
lines changed

16 files changed

+648
-101
lines changed

docs/quick-start.md

Lines changed: 49 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,8 @@ Choose based on your goals for this tutorial.
224224

225225
For this tutorial, we'll use text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
226226

227-
Create a configuration file for the dataset preparation. Copy the following content:
227+
Create a configuration file for the dataset preparation.
228+
Save the following as `./fast-llm-tutorial/prepare-config.yaml``:
228229

229230
=== "Small"
230231

@@ -242,10 +243,15 @@ Create a configuration file for the dataset preparation. Copy the following cont
242243

243244
tokenizer:
244245
path: fast-llm-tutorial/pretrained-model
246+
247+
splits: # (3)!
248+
training: 0.9
249+
validation: 0.1
245250
```
246251

247252
1. Processing speed scales linearly with the number of CPUs.
248253
2. This small dataset restricts to the first 10K records of the OpenWebText dataset to speed up the process. If you want to use the full dataset, replace with `openwebtext`.
254+
3. 90% train, 10% validation. These settings need to be adjusted based on the size of your dataset.
249255

250256
=== "Big"
251257

@@ -263,11 +269,14 @@ Create a configuration file for the dataset preparation. Copy the following cont
263269

264270
tokenizer:
265271
path: fast-llm-tutorial/pretrained-model
272+
273+
splits: # (2)!
274+
training: 0.99
275+
validation: 0.01
266276
```
267277

268278
1. Processing speed scales linearly with the number of CPUs.
269-
270-
Save it as `./fast-llm-tutorial/prepare-config.yaml`.
279+
2. 99% train, 1% validation. These settings need to be adjusted based on the size of your dataset.
271280

272281
Fast-LLM ships with a `prepare` command that will download and preprocess the dataset for you.
273282

@@ -498,33 +507,36 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
498507
sequence_length: 1024
499508
batch_size: 480 # (5)!
500509
data:
501-
format: file
502-
path: fast-llm-tutorial/dataset/fast_llm_dataset.json # (6)!
503-
split: [9, 1, 0] # (7)!
510+
datasets:
511+
Training:
512+
type: file
513+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (6)!
514+
Validation:
515+
type: file
516+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (6)!
504517
optimizer:
505518
learning_rate:
506519
base: 6.0e-04
507520
pretrained:
508-
format: llama # (8)!
521+
format: llama # (7)!
509522
path: fast-llm-tutorial/pretrained-model
510-
model_weights: no # (9)!
523+
model_weights: no # (8)!
511524
model:
512525
base_model:
513526
transformer:
514-
use_flash_attention: yes # (10)!
527+
use_flash_attention: yes # (9)!
515528
distributed:
516-
training_dtype: bf16 # (11)!
529+
training_dtype: bf16 # (10)!
517530
run:
518531
experiment_dir: fast-llm-tutorial/experiment
519532
```
520533

521534
1. For the small run, we'll stop after 100 iterations.
522535
2. The trained model will be saved in `Transformers` Llama format to `fast-llm-tutorial/experiment/export/llama/100` at the end of the small run. You can also save as a `Fast-LLM` checkpoint by setting the `format` to `fast_llm`.
523536
3. Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `null` with your own W&B entity name. If you don't want to use W&B, just ignore this section.
524-
3. Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M at 1024 sequenced length and a 80GB GPU, a `micro_batch_size` of 60 should work well.
525-
4. Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
526-
5. Location of the dataset metadata file generated in Step 4.
527-
6. 90% train, 10% validation, 0% test. These settings need to be adjusted based on the size of your dataset.
537+
4. Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M at 1024 sequenced length and a 80GB GPU, a `micro_batch_size` of 60 should work well.
538+
5. Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
539+
6. Location of the dataset metadata files generated in Step 4.
528540
7. Format of the pretrained model. Since SmolLM is a Llama model, we set this to `llama`.
529541
8. We'll train SmolLM2-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in the model directory).
530542
9. By default, Fast-LLM uses FlashAttention for faster training. If you're using Volta GPUs, set this to `no`.
@@ -556,32 +568,36 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
556568
sequence_length: 4096
557569
batch_size: 512 # (5)!
558570
data:
559-
format: file
560-
path: fast-llm-tutorial/dataset/fast_llm_dataset.json # (6)!
561-
split: [99, 1, 0] # (7)!
562-
optimizer: # (8)!
571+
datasets:
572+
Training:
573+
type: file
574+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (6)!
575+
Validation:
576+
type: file
577+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (6)!
578+
optimizer: # (7)!
563579
weight_decay: 0.1
564580
beta_1: 0.9
565581
beta_2: 0.95
566-
learning_rate: # (9)!
582+
learning_rate: # (8)!
567583
base: 6.0e-04
568584
minimum: 6.0e-05
569585
decay_style: cosine
570586
decay_iterations: 100_000
571587
warmup_iterations: 2000
572588
pretrained:
573-
format: llama # (10)!
589+
format: llama # (9)!
574590
path: fast-llm-tutorial/pretrained-model
575-
model_weights: yes # (11)!
591+
model_weights: yes # (10)!
576592
model:
577593
base_model:
578594
transformer:
579-
use_flash_attention: yes # (12)!
580-
cross_entropy_impl: fused # (13)!
595+
use_flash_attention: yes # (11)!
596+
cross_entropy_impl: fused # (12)!
581597
multi_stage:
582-
zero_stage: 2 # (14)!
598+
zero_stage: 2 # (13)!
583599
distributed:
584-
training_dtype: bf16 # (15)!
600+
training_dtype: bf16 # (14)!
585601
run:
586602
experiment_dir: fast-llm-tutorial/experiment
587603
```
@@ -592,15 +608,14 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
592608
4. Adjust the number of sequences per GPU based on GPU memory. Considering a 4k token sequence length and 80GB GPUs, a `micro_batch_size` of 1 should work well.
593609
5. Must be divisible by the number of GPUs and the `micro_batch_size`. At 4k tokens per sequence, 512 corresponds to about 2.1 million tokens per batch.
594610
6. Location of the dataset metadata file generated in Step 4.
595-
7. 99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
596-
8. These are good default optimizer settings for training models.
597-
9. We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
598-
10. Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
599-
11. We want to continue training Llama-3.1-8B from a checkpoint. If you're training from scratch, set this to `no`.
600-
12. By default, Fast-LLM uses FlashAttention for faster training. If you're using Volta GPUs, set this to `no`.
601-
13. Configure Fast-LLM to use the fused cross-entropy loss implementation rather than the default Triton implementation for models with a large vocabulary size such as Llama-3.1-8B. This avoids issues with block size limitations in our current Triton code.
602-
14. We are using ZeRO stage 2 for this tutorial. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
603-
15. `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, use `fp16` (half-precision floating point) for training instead of `bf16`.
611+
7. These are good default optimizer settings for training models.
612+
8. We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
613+
9. Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
614+
10. We want to continue training Llama-3.1-8B from a checkpoint. If you're training from scratch, set this to `no`.
615+
11. By default, Fast-LLM uses FlashAttention for faster training. If you're using Volta GPUs, set this to `no`.
616+
12. Configure Fast-LLM to use the fused cross-entropy loss implementation rather than the default Triton implementation for models with a large vocabulary size such as Llama-3.1-8B. This avoids issues with block size limitations in our current Triton code.
617+
13. We are using ZeRO stage 2 for this tutorial. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
618+
14. `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, use `fp16` (half-precision floating point) for training instead of `bf16`.
604619

605620
## 🔑 (Optional) Step 6: Add Your Weights & Biases API Key
606621

docs/recipes/data-configuration.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
title: Configuring Data for Training
3+
---
4+
5+
In this section we show how to configure datasets through a series of examples
6+
7+
We already saw an example dataset configuration in the [quick-start guide](../quick-start.md), where we prepared a simple dataset and split it into training and validation sub-datasets, and used these to train a small model. This was done by:
8+
9+
1. Defining a dataset preparation configuration.
10+
2. Running `fast-llm prepare` with said configuration. This generated some binary files along with two fast-llm configuration files, `fast-llm-tutorial/dataset/fast_llm_config_training.yaml` and `fast-llm-tutorial/dataset/fast_llm_config_validation.yaml`.
11+
3. Defining a fast-llm data configuration that use those datasets:
12+
13+
```yaml
14+
data:
15+
datasets:
16+
Training:
17+
type: file
18+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml
19+
Validation:
20+
type: file
21+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml
22+
```
23+
24+
4. Running `fast-llm training` with said configuration.
25+
26+
In this section we are interested in generalizing step 3. For more details on steps 1 and 2, please refer to the quick-start guide or [this example](data-configuration.md).
27+
28+
## Example 1: Blending multiple datasets
29+
30+
In this example, we have three datasets and want to sample from each of them during training with probabilities 0.70, 0.25 and 0.05. For this, we use the `blended` type which takes other datasets as arguments:
31+
32+
```yaml
33+
data:
34+
datasets:
35+
Training:
36+
type: blended
37+
datasets:
38+
- type: file
39+
path: path/to/dataset_0.yaml
40+
- type: file
41+
path: path/to/dataset_1.yaml
42+
- type: file
43+
path: path/to/dataset_2.yaml
44+
weights: [0.70, 0.25, 0.05]
45+
```
46+
47+
!!! note "Dataset wrappers"
48+
The `blended` dataset wrapper is one example of the many dataset wrappers available in fast-llm. Such wrappers may be nested (almost) arbitrarily to generate the dataset scheme that fits your needs. Fast-LLM will use the `type` argument to dynamically select the appropriate configuration class(es). With some effort you can even create your own wrapper!
49+
50+
## Example 2: Configure shuffling
51+
52+
In this example, we have a large dataset that comes pre-shuffled, so shuffling in unnecessary for the first epoch.
53+
54+
```yaml
55+
data:
56+
datasets:
57+
Training:
58+
type: file
59+
path: path/to/dataset.yaml
60+
sampling:
61+
shuffle: skip_first_epoch
62+
```
63+
64+
## Example 3: Disable shuffling for validation
65+
66+
In this example, we want to disable shuffling entirely, but only for the validation dataset. We can do this with the `sampled` dataset wrapper:
67+
68+
```yaml
69+
data:
70+
datasets:
71+
Training:
72+
type: file
73+
path: path/to/training_dataset.yaml
74+
Validation:
75+
type: sampled
76+
dataset:
77+
type: file
78+
path: path/to/validation_dataset.yaml
79+
80+
sampling:
81+
shuffle: disabled
82+
```
83+
84+
!!! note "More about sampling configuration"
85+
Sampling parameters may be globally defined through data configuration (example 2), dataset wrapper(s) (examples 3, 4), or both (example 5). In the case where a dataset sampling is configured with both methods (or multiple nested wrappers), (innermost) wrapper overrides the data (or next-to-innermost wrapper) for the explicitly defined fields (and only those).
86+
87+
## Example 4: Set sampling seed for individual datasets
88+
89+
In this example, we have a blend of datasets as in example 1, but we wish to set the seed for each dataset individually for reproducibility reasons. For this, we use the `seed` field of the `sampling` wrapper:
90+
91+
```yaml
92+
data:
93+
datasets:
94+
Training:
95+
type: blended
96+
datasets:
97+
- type: sampled
98+
dataset:
99+
type: file
100+
path: path/to/dataset_0.yaml
101+
sampling:
102+
seed:1234
103+
- type: sampled
104+
dataset:
105+
type: file
106+
path: path/to/dataset_0.yaml
107+
sampling:
108+
seed:2345
109+
- type: sampled
110+
dataset:
111+
type: file
112+
path: path/to/dataset_0.yaml
113+
sampling:
114+
seed:3456
115+
weights: [0.70, 0.25, 0.05]
116+
```
117+
118+
!!! note "Default seed"
119+
In the absence of explicit seed, Fast-LLM uses a default seed (`data.sampling`'s default) instead, and uses seed shifts to ensure different seeds for each phase and for the various blended datasets.
120+
121+
## Example 5: Advanced scenario
122+
123+
In this example, we combine everything we learned so far to create a complex scenario, where:
124+
125+
* The training dataset is a blend consists of two datasets, one of them being itself a blend of three datasets.
126+
* All datasets except for one come pre-shuffled, so can skip shuffling for the first epoch.
127+
* We want to set the seed explicitly for the validation and innermost blended datasets, but keep the default seed for the others.
128+
129+
```yaml
130+
data:
131+
datasets:
132+
Training:
133+
type: blended
134+
datasets:
135+
- type: sampled
136+
dataset:
137+
type: blended
138+
datasets:
139+
- type: file
140+
# Seed = 1234
141+
path: path/to/dataset_0.yaml
142+
- type: file
143+
# Seed = 1234 + blend_shift, shuffle = skip_first_epoch
144+
path: path/to/dataset_1.yaml
145+
- type: sampled
146+
dataset:
147+
type: file
148+
# Seed = 1234 + 2 * blend_shift, shuffle = epoch
149+
path: path/to/dataset_2.yaml
150+
sampling:
151+
# Shuffle each epoch independently (default shuffling)
152+
shuffle: epoch
153+
sampling:
154+
seed: 1234
155+
- type: file
156+
# Seed = default + train_shift + 2 * blend_shift, shuffle = skip_first_epoch
157+
path: path/to/dataset_3.yaml
158+
weights: [0.70, 0.25, 0.05]
159+
Validation:
160+
type: sampled
161+
dataset:
162+
type: file
163+
# Seed = 2345, shuffle = skip_first_epoch
164+
path: path/to/validation_dataset.yaml
165+
sampling:
166+
seed: 2345
167+
sampling:
168+
shuffle: skip_first_epoch
169+
```
170+
171+
!!! note "Configure from file"
172+
If a dataset configuration is especially complex and makes the dataset configuration excessively big, or is reused across many experiments, you may want to save it to a yaml file and refer to it un the config using a `file` dataset. This can be used to reduce the present example to
173+
```yaml
174+
data:
175+
datasets:
176+
Training:
177+
type: file
178+
path: path/to/training_dataset_config.yaml
179+
Validation:
180+
type: file
181+
path: path/to/validation_dataset_config.yaml
182+
sampling:
183+
shuffle: skip_first_epoch
184+
```
185+
In fact, all the elementary datasets from file we've been using so far are of this format, and consist of more elementary `memmap` datasets optionally wrapped with `blended` and/or `slice` wrappers.

fast_llm/data/dataset/config.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -216,9 +216,6 @@ def build_and_sample(
216216
from fast_llm.data.dataset.blended import BlendedDataset
217217

218218
# Build and sample the datasets.
219-
# TODO: Vary the seed?
220-
# Add 5 times the standard deviation (of a binomial distribution)
221-
# so the probability of sampling more than this amount during blending is negligible.
222219

223220
sampled_datasets = [
224221
dataset.build_and_sample(
@@ -230,6 +227,7 @@ def build_and_sample(
230227
if self.legacy
231228
else math.ceil(weight * sampling.num_samples) + 1
232229
),
230+
# TODO: Seed may not be unique for nested blended datasets.
233231
config=sampling.config.to_copy({"seed": sampling.config.seed + i * (0 if self.legacy else 697)}),
234232
),
235233
)

0 commit comments

Comments
 (0)