Dataset configuration examples (#156)

jlamypoirier · tscholak · oleksost · web-flow · commit a7d55d0f4008 · 2025-02-26T08:02:54.000-05:00
Co-authored-by: Torsten Scholak &lt;torsten.scholak@googlemail.com&gt;
Co-authored-by: Oleksiy Ostapenko &lt;ostapy2@gmail.com&gt;
diff --git a/docs/recipes/data-configuration.md b/docs/recipes/data-configuration.md
@@ -0,0 +1,185 @@
+---
+title: Configuring Data for Training
+---
+
+In this section we show how to configure datasets through a series of examples
+
+We already saw an example dataset configuration in the [quick-start guide](../quick-start.md), where we prepared a simple dataset and split it into training and validation sub-datasets, and used these to train a small model. This was done by:
+
+1. Defining a dataset preparation configuration.
+2. Running `fast-llm prepare` with said configuration. This generated some binary files along with two fast-llm configuration files, `fast-llm-tutorial/dataset/fast_llm_config_training.yaml` and `fast-llm-tutorial/dataset/fast_llm_config_validation.yaml`.
+3. Defining a fast-llm data configuration that use those datasets:
+
+    ```yaml
+   data:
+      datasets:
+        Training:
+          type: file
+          path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml
+        Validation:
+          type: file
+          path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml
+   ```
+
+4. Running `fast-llm training` with said configuration.
+
+In this section we are interested in generalizing step 3. For more details on steps 1 and 2, please refer to the quick-start guide or [this example](data-configuration.md).
+
+## Example 1: Blending multiple datasets
+
+In this example, we have three datasets and want to sample from each of them during training with probabilities 0.70, 0.25 and 0.05. For this, we use the `blended` type which takes other datasets as arguments:
+
+```yaml
+data:
+  datasets:
+    Training:
+      type: blended
+      datasets:
+        - type: file
+          path: path/to/dataset_0.yaml
+        - type: file
+          path: path/to/dataset_1.yaml
+        - type: file
+          path: path/to/dataset_2.yaml
+      weights: [0.70, 0.25, 0.05]
+```
+
+!!! note "Dataset wrappers"
+    The `blended` dataset wrapper is one example of the many dataset wrappers available in fast-llm. Such wrappers may be nested (almost) arbitrarily to generate the dataset scheme that fits your needs. Fast-LLM will use the `type` argument to dynamically select the appropriate configuration class(es). With some effort you can even create your own wrapper!
+
+## Example 2: Configure shuffling
+
+In this example, we have a large dataset that comes pre-shuffled, so shuffling in unnecessary for the first epoch.
+
+```yaml
+data:
+  datasets:
+    Training:
+      type: file
+      path: path/to/dataset.yaml
+  sampling:
+    shuffle: skip_first_epoch
+```
+
+## Example 3: Disable shuffling for validation
+
+In this example, we want to disable shuffling entirely, but only for the validation dataset. We can do this with the `sampled` dataset wrapper:
+
+```yaml
+data:
+  datasets:
+    Training:
+      type: file
+      path: path/to/training_dataset.yaml
+    Validation:
+      type: sampled
+      dataset:
+        type: file
+        path: path/to/validation_dataset.yaml
+
+      sampling:
+        shuffle: disabled
+```
+
+!!! note "More about sampling configuration"
+    Sampling parameters may be globally defined through data configuration (example 2), dataset wrapper(s) (examples 3, 4), or both (example 5). In the case where a dataset sampling is configured with both methods (or multiple nested wrappers), (innermost) wrapper overrides the data (or next-to-innermost wrapper) for the explicitly defined fields (and only those).
+
+## Example 4: Set sampling seed for individual datasets
+
+In this example, we have a blend of datasets as in example 1, but we wish to set the seed for each dataset individually for reproducibility reasons. For this, we use the `seed` field of the `sampling` wrapper:
+
+```yaml
+data:
+  datasets:
+    Training:
+      type: blended
+      datasets:
+        - type: sampled
+          dataset:
+            type: file
+            path: path/to/dataset_0.yaml
+          sampling:
+            seed:1234
+        - type: sampled
+          dataset:
+            type: file
+            path: path/to/dataset_0.yaml
+          sampling:
+            seed:2345
+        - type: sampled
+          dataset:
+            type: file
+            path: path/to/dataset_0.yaml
+          sampling:
+            seed:3456
+      weights: [0.70, 0.25, 0.05]
+```
+
+!!! note "Default seed"
+    In the absence of explicit seed, Fast-LLM uses a default seed (`data.sampling`'s default) instead, and uses seed shifts to ensure different seeds for each phase and for the various blended datasets.
+
+## Example 5: Advanced scenario
+
+In this example, we combine everything we learned so far to create a complex scenario, where:
+
+* The training dataset is a blend consists of two datasets, one of them being itself a blend of three datasets.
+* All datasets except for one come pre-shuffled, so can skip shuffling for the first epoch.
+* We want to set the seed explicitly for the validation and innermost blended datasets, but keep the default seed for the others.
+
+```yaml
+data:
+  datasets:
+    Training:
+      type: blended
+      datasets:
+        - type: sampled
+          dataset:
+            type: blended
+            datasets:
+              - type: file
+                # Seed = 1234
+                path: path/to/dataset_0.yaml
+              - type: file
+                # Seed = 1234 + blend_shift, shuffle = skip_first_epoch
+                path: path/to/dataset_1.yaml
+              - type: sampled
+                dataset:
+                  type: file
+                  # Seed = 1234 + 2 * blend_shift, shuffle = epoch
+                  path: path/to/dataset_2.yaml
+                sampling:
+                  # Shuffle each epoch independently (default shuffling)
+                  shuffle: epoch
+          sampling:
+            seed: 1234
+        - type: file
+          # Seed = default + train_shift + 2 * blend_shift, shuffle = skip_first_epoch
+          path: path/to/dataset_3.yaml
+      weights: [0.70, 0.25, 0.05]
+    Validation:
+        type: sampled
+        dataset:
+          type: file
+          # Seed = 2345, shuffle = skip_first_epoch
+          path: path/to/validation_dataset.yaml
+        sampling:
+          seed: 2345
+  sampling:
+    shuffle: skip_first_epoch
+```
+
+!!! note "Configure from file"
+    If a dataset configuration is especially complex and makes the dataset configuration excessively big, or is reused across many experiments, you may want to save it to a yaml file and refer to it un the config using a `file` dataset. This can be used to reduce the present example to
+    ```yaml
+    data:
+      datasets:
+        Training:
+          type: file
+          path: path/to/training_dataset_config.yaml
+        Validation:
+          type: file
+          path: path/to/validation_dataset_config.yaml
+      sampling:
+        shuffle: skip_first_epoch
+     ```
+     In fact, all the elementary datasets from file we've been using so far are of this format, and consist of more elementary `memmap` datasets optionally wrapped with `blended` and/or `slice` wrappers.
diff --git a/fast_llm/data/dataset/config.py b/fast_llm/data/dataset/config.py
@@ -216,9 +216,6 @@ def build_and_sample(
         from fast_llm.data.dataset.blended import BlendedDataset
 
         # Build and sample the datasets.
-        # TODO: Vary the seed?
-        # Add 5 times the standard deviation (of a binomial distribution)
-        # so the probability of sampling more than this amount during blending is negligible.
 
         sampled_datasets = [
             dataset.build_and_sample(
@@ -230,6 +227,7 @@ def build_and_sample(
                         if self.legacy
                         else math.ceil(weight * sampling.num_samples) + 1
                     ),
+                    # TODO: Seed may not be unique for nested blended datasets.
                     config=sampling.config.to_copy({"seed": sampling.config.seed + i * (0 if self.legacy else 697)}),
                 ),
             )
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -167,7 +167,8 @@ nav:
       - StarCoder 2: success-stories/starcoder-2.md
     - License: license.md
   - Recipes:
-    - Data Preparation: recipes/data-preparation.md
+    - Prepare a dataset: recipes/data-preparation.md
+    - Configure a dataset: recipes/data-configuration.md
     - Train Llama 8B from scratch: recipes/train-llama-8b.md
     - Continue training Llama 8B: recipes/continue-training-llama-8b.md
     - Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md