Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox · 2025-03-02T16:12:33Z

What does this PR do?

This PR adds support for setting a seed in the class DataCollatorForLanguageModeling. This helps with reproducibility in generating masks for masked language modeling (MLM). This issue was approved by @Rocketknight1 (#36357)

Currently, there is a way for ensuring reproducibility by using the function transformers.set_seed(). However, this function sets the seed of the global RNG for PyTorch, Numpy, etc. What this means is that, setting a global seed can impact other pseudo-random functions outside the scope of the collator such as parameter initialization for models. This also means, that changes in the script outside the collator can impact the masking.

Instead, it is preferred to create generator objects which can be passed around to different functions. This is also considered good practice. What my PR does, is:

Allows users to pass a seed parameter to DataCollatorForLanguageModeling
Instantiates a generator object with the seed depending on the return_tensors parameter
The generator object is used for pseudo-random functions within the class
In case the user does not pass a seed, the collator falls back to its default implementation, for backwards compatibility

The generator object is scoped to the collator class, so it won't affect pseudo-random functions outside the class and vice-versa.

One important factor to consider is using multiple workers for the collator function, as PyTorch's DataLoader does. PyTorch has documentation regarding this, whereby we set a different seed for each worker given by shared_seed + worker_id. This is because the worker's seeds would be cloned, and so each worker would mask the input in exactly the same manner, which is undesirable. A critical part of PyTorch's DataLoader is that from within the worker, it is possible to access the worker's id (important to set the worker seed). Because of this constraint, this PR only supports multi-worker scenarios with PyTorch's DataLoader. With the seed set, if the code detects that the collator is running in a multi-processing scenario and the worker information is unavailable, an error is thrown.

The algorithm for creating the generator object is:

# use multiprocessing to get current process' name
if current_process == "MainProcess":
    # Collator running in the in main python process
    # No issues here
    generator = get_generator(seed)
else:
    # Multiprocess scenario. Throw an error if
    # we cannot access PyTorch's worker info.
    worker_info = torch.utils.data.get_worker_info()
    if worker_info is None:
        raise ValueError
    
    generator = get_generator(seed + worker_info.id)

Tests have also been written to verify this behaviour in tests/trainer/test_data_collator.py.

These changes were done on Python 3.12.8. The dependencies installed were as pip install -e ".[dev]" along with:

torch==2.6.0
tensorflow==2.18.0
tf-keras==2.18.0

Fixes #36357

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 should be the right person to review this.

…te tests for verifying behaviour.

github-actions · 2025-03-02T16:12:45Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

capemox · 2025-03-03T08:14:43Z

@Rocketknight1 this won't pass a few data collator tests, but I submitted a PR to fix these (#36457)

Add support for seed in DataCollatorForLanguageModeling. Also wro…

f5fb8dc

…te tests for verifying behaviour.

github-actions bot marked this pull request as draft March 2, 2025 16:12

capemox marked this pull request as ready for review March 2, 2025 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for seed in `DataCollatorForLanguageModeling` #36497

Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox commented Mar 2, 2025

github-actions bot commented Mar 2, 2025

capemox commented Mar 3, 2025 •

edited

Loading

Add support for seed in DataCollatorForLanguageModeling #36497

Are you sure you want to change the base?

Add support for seed in DataCollatorForLanguageModeling #36497

Conversation

capemox commented Mar 2, 2025

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Mar 2, 2025

capemox commented Mar 3, 2025 • edited Loading

Add support for seed in `DataCollatorForLanguageModeling` #36497

Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox commented Mar 3, 2025 •

edited

Loading