Skip to content

Commit a7d55d0

Browse files
jlamypoiriertscholakoleksost
authored
Dataset configuration examples (#156)
Co-authored-by: Torsten Scholak <[email protected]> Co-authored-by: Oleksiy Ostapenko <[email protected]>
1 parent 9395a0c commit a7d55d0

File tree

3 files changed

+188
-4
lines changed

3 files changed

+188
-4
lines changed

Diff for: docs/recipes/data-configuration.md

+185
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
title: Configuring Data for Training
3+
---
4+
5+
In this section we show how to configure datasets through a series of examples
6+
7+
We already saw an example dataset configuration in the [quick-start guide](../quick-start.md), where we prepared a simple dataset and split it into training and validation sub-datasets, and used these to train a small model. This was done by:
8+
9+
1. Defining a dataset preparation configuration.
10+
2. Running `fast-llm prepare` with said configuration. This generated some binary files along with two fast-llm configuration files, `fast-llm-tutorial/dataset/fast_llm_config_training.yaml` and `fast-llm-tutorial/dataset/fast_llm_config_validation.yaml`.
11+
3. Defining a fast-llm data configuration that use those datasets:
12+
13+
```yaml
14+
data:
15+
datasets:
16+
Training:
17+
type: file
18+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml
19+
Validation:
20+
type: file
21+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml
22+
```
23+
24+
4. Running `fast-llm training` with said configuration.
25+
26+
In this section we are interested in generalizing step 3. For more details on steps 1 and 2, please refer to the quick-start guide or [this example](data-configuration.md).
27+
28+
## Example 1: Blending multiple datasets
29+
30+
In this example, we have three datasets and want to sample from each of them during training with probabilities 0.70, 0.25 and 0.05. For this, we use the `blended` type which takes other datasets as arguments:
31+
32+
```yaml
33+
data:
34+
datasets:
35+
Training:
36+
type: blended
37+
datasets:
38+
- type: file
39+
path: path/to/dataset_0.yaml
40+
- type: file
41+
path: path/to/dataset_1.yaml
42+
- type: file
43+
path: path/to/dataset_2.yaml
44+
weights: [0.70, 0.25, 0.05]
45+
```
46+
47+
!!! note "Dataset wrappers"
48+
The `blended` dataset wrapper is one example of the many dataset wrappers available in fast-llm. Such wrappers may be nested (almost) arbitrarily to generate the dataset scheme that fits your needs. Fast-LLM will use the `type` argument to dynamically select the appropriate configuration class(es). With some effort you can even create your own wrapper!
49+
50+
## Example 2: Configure shuffling
51+
52+
In this example, we have a large dataset that comes pre-shuffled, so shuffling in unnecessary for the first epoch.
53+
54+
```yaml
55+
data:
56+
datasets:
57+
Training:
58+
type: file
59+
path: path/to/dataset.yaml
60+
sampling:
61+
shuffle: skip_first_epoch
62+
```
63+
64+
## Example 3: Disable shuffling for validation
65+
66+
In this example, we want to disable shuffling entirely, but only for the validation dataset. We can do this with the `sampled` dataset wrapper:
67+
68+
```yaml
69+
data:
70+
datasets:
71+
Training:
72+
type: file
73+
path: path/to/training_dataset.yaml
74+
Validation:
75+
type: sampled
76+
dataset:
77+
type: file
78+
path: path/to/validation_dataset.yaml
79+
80+
sampling:
81+
shuffle: disabled
82+
```
83+
84+
!!! note "More about sampling configuration"
85+
Sampling parameters may be globally defined through data configuration (example 2), dataset wrapper(s) (examples 3, 4), or both (example 5). In the case where a dataset sampling is configured with both methods (or multiple nested wrappers), (innermost) wrapper overrides the data (or next-to-innermost wrapper) for the explicitly defined fields (and only those).
86+
87+
## Example 4: Set sampling seed for individual datasets
88+
89+
In this example, we have a blend of datasets as in example 1, but we wish to set the seed for each dataset individually for reproducibility reasons. For this, we use the `seed` field of the `sampling` wrapper:
90+
91+
```yaml
92+
data:
93+
datasets:
94+
Training:
95+
type: blended
96+
datasets:
97+
- type: sampled
98+
dataset:
99+
type: file
100+
path: path/to/dataset_0.yaml
101+
sampling:
102+
seed:1234
103+
- type: sampled
104+
dataset:
105+
type: file
106+
path: path/to/dataset_0.yaml
107+
sampling:
108+
seed:2345
109+
- type: sampled
110+
dataset:
111+
type: file
112+
path: path/to/dataset_0.yaml
113+
sampling:
114+
seed:3456
115+
weights: [0.70, 0.25, 0.05]
116+
```
117+
118+
!!! note "Default seed"
119+
In the absence of explicit seed, Fast-LLM uses a default seed (`data.sampling`'s default) instead, and uses seed shifts to ensure different seeds for each phase and for the various blended datasets.
120+
121+
## Example 5: Advanced scenario
122+
123+
In this example, we combine everything we learned so far to create a complex scenario, where:
124+
125+
* The training dataset is a blend consists of two datasets, one of them being itself a blend of three datasets.
126+
* All datasets except for one come pre-shuffled, so can skip shuffling for the first epoch.
127+
* We want to set the seed explicitly for the validation and innermost blended datasets, but keep the default seed for the others.
128+
129+
```yaml
130+
data:
131+
datasets:
132+
Training:
133+
type: blended
134+
datasets:
135+
- type: sampled
136+
dataset:
137+
type: blended
138+
datasets:
139+
- type: file
140+
# Seed = 1234
141+
path: path/to/dataset_0.yaml
142+
- type: file
143+
# Seed = 1234 + blend_shift, shuffle = skip_first_epoch
144+
path: path/to/dataset_1.yaml
145+
- type: sampled
146+
dataset:
147+
type: file
148+
# Seed = 1234 + 2 * blend_shift, shuffle = epoch
149+
path: path/to/dataset_2.yaml
150+
sampling:
151+
# Shuffle each epoch independently (default shuffling)
152+
shuffle: epoch
153+
sampling:
154+
seed: 1234
155+
- type: file
156+
# Seed = default + train_shift + 2 * blend_shift, shuffle = skip_first_epoch
157+
path: path/to/dataset_3.yaml
158+
weights: [0.70, 0.25, 0.05]
159+
Validation:
160+
type: sampled
161+
dataset:
162+
type: file
163+
# Seed = 2345, shuffle = skip_first_epoch
164+
path: path/to/validation_dataset.yaml
165+
sampling:
166+
seed: 2345
167+
sampling:
168+
shuffle: skip_first_epoch
169+
```
170+
171+
!!! note "Configure from file"
172+
If a dataset configuration is especially complex and makes the dataset configuration excessively big, or is reused across many experiments, you may want to save it to a yaml file and refer to it un the config using a `file` dataset. This can be used to reduce the present example to
173+
```yaml
174+
data:
175+
datasets:
176+
Training:
177+
type: file
178+
path: path/to/training_dataset_config.yaml
179+
Validation:
180+
type: file
181+
path: path/to/validation_dataset_config.yaml
182+
sampling:
183+
shuffle: skip_first_epoch
184+
```
185+
In fact, all the elementary datasets from file we've been using so far are of this format, and consist of more elementary `memmap` datasets optionally wrapped with `blended` and/or `slice` wrappers.

Diff for: fast_llm/data/dataset/config.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -216,9 +216,6 @@ def build_and_sample(
216216
from fast_llm.data.dataset.blended import BlendedDataset
217217

218218
# Build and sample the datasets.
219-
# TODO: Vary the seed?
220-
# Add 5 times the standard deviation (of a binomial distribution)
221-
# so the probability of sampling more than this amount during blending is negligible.
222219

223220
sampled_datasets = [
224221
dataset.build_and_sample(
@@ -230,6 +227,7 @@ def build_and_sample(
230227
if self.legacy
231228
else math.ceil(weight * sampling.num_samples) + 1
232229
),
230+
# TODO: Seed may not be unique for nested blended datasets.
233231
config=sampling.config.to_copy({"seed": sampling.config.seed + i * (0 if self.legacy else 697)}),
234232
),
235233
)

Diff for: mkdocs.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,8 @@ nav:
167167
- StarCoder 2: success-stories/starcoder-2.md
168168
- License: license.md
169169
- Recipes:
170-
- Data Preparation: recipes/data-preparation.md
170+
- Prepare a dataset: recipes/data-preparation.md
171+
- Configure a dataset: recipes/data-configuration.md
171172
- Train Llama 8B from scratch: recipes/train-llama-8b.md
172173
- Continue training Llama 8B: recipes/continue-training-llama-8b.md
173174
- Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md

0 commit comments

Comments
 (0)