Skip to content

Dataset configuration examples #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions docs/recipes/data-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
title: Configuring Data for Training
---

In this section we show how to configure datasets through a series of examples

We already saw an example dataset configuration in the [quick-start guide](../quick-start.md), where we prepared a simple dataset and split it into training and validation sub-datasets, and used these to train a small model. This was done by:

1. Defining a dataset preparation configuration.
2. Running `fast-llm prepare` with said configuration. This generated some binary files along with two fast-llm configuration files, `fast-llm-tutorial/dataset/fast_llm_config_training.yaml` and `fast-llm-tutorial/dataset/fast_llm_config_validation.yaml`.
3. Defining a fast-llm data configuration that use those datasets:

```yaml
data:
datasets:
Training:
type: file
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml
Validation:
type: file
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml
```

4. Running `fast-llm training` with said configuration.

In this section we are interested in generalizing step 3. For more details on steps 1 and 2, please refer to the quick-start guide or [this example](data-configuration.md).

## Example 1: Blending multiple datasets

In this example, we have three datasets and want to sample from each of them during training with probabilities 0.70, 0.25 and 0.05. For this, we use the `blended` type which takes other datasets as arguments:

```yaml
data:
datasets:
Training:
type: blended
datasets:
- type: file
path: path/to/dataset_0.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be path/to/dataset_0/fast_llm_config_training.yaml? This way its clear that this should point to yaml files created in the previous prepare step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be a bit too verbose?

- type: file
path: path/to/dataset_1.yaml
- type: file
path: path/to/dataset_2.yaml
Copy link
Contributor

@oleksost oleksost Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: In the past we were creating data mixes as a separate yaml file using fml-ops functionality (e.g. /mnt/datasets/dolmafw70_fw30_merged.json). Can a dataset with type file still be pointing to one of such json files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the json files are deprecated and only available with legacy dataset configs

weights: [0.70, 0.25, 0.05]
```

!!! note "Dataset wrappers"
The `blended` dataset wrapper is one example of the many dataset wrappers available in fast-llm. Such wrappers may be nested (almost) arbitrarily to generate the dataset scheme that fits your needs. Fast-LLM will use the `type` argument to dynamically select the appropriate configuration class(es). With some effort you can even create your own wrapper!

## Example 2: Configure shuffling

In this example, we have a large dataset that comes pre-shuffled, so shuffling in unnecessary for the first epoch.

```yaml
data:
datasets:
Training:
type: file
path: path/to/dataset.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, for clarity it may be better to use path/to/dataset/fast_llm_config_training.yaml?

sampling:
shuffle: skip_first_epoch
```

## Example 3: Disable shuffling for validation

In this example, we want to disable shuffling entirely, but only for the validation dataset. We can do this with the `sampled` dataset wrapper:

```yaml
data:
datasets:
Training:
type: file
path: path/to/training_dataset.yaml
Validation:
type: sampled
dataset:
type: file
path: path/to/validation_dataset.yaml

sampling:
shuffle: disabled
```

!!! note "More about sampling configuration"
Sampling parameters may be globally defined through data configuration (example 2), dataset wrapper(s) (examples 3, 4), or both (example 5). In the case where a dataset sampling is configured with both methods (or multiple nested wrappers), (innermost) wrapper overrides the data (or next-to-innermost wrapper) for the explicitly defined fields (and only those).

## Example 4: Set sampling seed for individual datasets

In this example, we have a blend of datasets as in example 1, but we wish to set the seed for each dataset individually for reproducibility reasons. For this, we use the `seed` field of the `sampling` wrapper:

```yaml
data:
datasets:
Training:
type: blended
datasets:
- type: sampled
dataset:
type: file
path: path/to/dataset_0.yaml
sampling:
seed:1234
- type: sampled
dataset:
type: file
path: path/to/dataset_0.yaml
sampling:
seed:2345
- type: sampled
dataset:
type: file
path: path/to/dataset_0.yaml
sampling:
seed:3456
weights: [0.70, 0.25, 0.05]
```

!!! note "Default seed"
In the absence of explicit seed, Fast-LLM uses a default seed (`data.sampling`'s default) instead, and uses seed shifts to ensure different seeds for each phase and for the various blended datasets.

## Example 5: Advanced scenario

In this example, we combine everything we learned so far to create a complex scenario, where:

* The training dataset is a blend consists of two datasets, one of them being itself a blend of three datasets.
* All datasets except for one come pre-shuffled, so can skip shuffling for the first epoch.
* We want to set the seed explicitly for the validation and innermost blended datasets, but keep the default seed for the others.

```yaml
data:
datasets:
Training:
type: blended
datasets:
- type: sampled
dataset:
type: blended
datasets:
- type: file
# Seed = 1234
path: path/to/dataset_0.yaml
- type: file
# Seed = 1234 + blend_shift, shuffle = skip_first_epoch
path: path/to/dataset_1.yaml
- type: sampled
dataset:
type: file
# Seed = 1234 + 2 * blend_shift, shuffle = epoch
path: path/to/dataset_2.yaml
sampling:
# Shuffle each epoch independently (default shuffling)
shuffle: epoch
sampling:
seed: 1234
- type: file
# Seed = default + train_shift + 2 * blend_shift, shuffle = skip_first_epoch
path: path/to/dataset_3.yaml
weights: [0.70, 0.25, 0.05]
Validation:
type: sampled
dataset:
type: file
# Seed = 2345, shuffle = skip_first_epoch
path: path/to/validation_dataset.yaml
sampling:
seed: 2345
sampling:
shuffle: skip_first_epoch
```

!!! note "Configure from file"
If a dataset configuration is especially complex and makes the dataset configuration excessively big, or is reused across many experiments, you may want to save it to a yaml file and refer to it un the config using a `file` dataset. This can be used to reduce the present example to
```yaml
data:
datasets:
Training:
type: file
path: path/to/training_dataset_config.yaml
Validation:
type: file
path: path/to/validation_dataset_config.yaml
sampling:
shuffle: skip_first_epoch
```
In fact, all the elementary datasets from file we've been using so far are of this format, and consist of more elementary `memmap` datasets optionally wrapped with `blended` and/or `slice` wrappers.
4 changes: 1 addition & 3 deletions fast_llm/data/dataset/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,9 +216,6 @@ def build_and_sample(
from fast_llm.data.dataset.blended import BlendedDataset

# Build and sample the datasets.
# TODO: Vary the seed?
# Add 5 times the standard deviation (of a binomial distribution)
# so the probability of sampling more than this amount during blending is negligible.

sampled_datasets = [
dataset.build_and_sample(
Expand All @@ -230,6 +227,7 @@ def build_and_sample(
if self.legacy
else math.ceil(weight * sampling.num_samples) + 1
),
# TODO: Seed may not be unique for nested blended datasets.
config=sampling.config.to_copy({"seed": sampling.config.seed + i * (0 if self.legacy else 697)}),
),
)
Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,8 @@ nav:
- StarCoder 2: success-stories/starcoder-2.md
- License: license.md
- Recipes:
- Data Preparation: recipes/data-preparation.md
- Prepare a dataset: recipes/data-preparation.md
- Configure a dataset: recipes/data-configuration.md
- Train Llama 8B from scratch: recipes/train-llama-8b.md
- Continue training Llama 8B: recipes/continue-training-llama-8b.md
- Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md
Expand Down