Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for seed in DataCollatorForLanguageModeling #36497

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

capemox
Copy link

@capemox capemox commented Mar 2, 2025

What does this PR do?

This PR adds support for setting a seed in the class DataCollatorForLanguageModeling. This helps with reproducibility in generating masks for masked language modeling (MLM). This issue was approved by @Rocketknight1 (#36357)

Currently, there is a way for ensuring reproducibility by using the function transformers.set_seed(). However, this function sets the seed of the global RNG for PyTorch, Numpy, etc. What this means is that, setting a global seed can impact other pseudo-random functions outside the scope of the collator such as parameter initialization for models. This also means, that changes in the script outside the collator can impact the masking.

Instead, it is preferred to create generator objects which can be passed around to different functions. This is also considered good practice. What my PR does, is:

  • Allows users to pass a seed parameter to DataCollatorForLanguageModeling
  • Instantiates a generator object with the seed depending on the return_tensors parameter
  • The generator object is used for pseudo-random functions within the class
  • In case the user does not pass a seed, the collator falls back to its default implementation, for backwards compatibility

The generator object is scoped to the collator class, so it won't affect pseudo-random functions outside the class and vice-versa.

One important factor to consider is using multiple workers for the collator function, as PyTorch's DataLoader does. PyTorch has documentation regarding this, whereby we set a different seed for each worker given by shared_seed + worker_id. This is because the worker's seeds would be cloned, and so each worker would mask the input in exactly the same manner, which is undesirable. A critical part of PyTorch's DataLoader is that from within the worker, it is possible to access the worker's id (important to set the worker seed). Because of this constraint, this PR only supports multi-worker scenarios with PyTorch's DataLoader. With the seed set, if the code detects that the collator is running in a multi-processing scenario and the worker information is unavailable, an error is thrown.

The algorithm for creating the generator object is:

# use multiprocessing to get current process' name
if current_process == "MainProcess":
    # Collator running in the in main python process
    # No issues here
    generator = get_generator(seed)
else:
    # Multiprocess scenario. Throw an error if
    # we cannot access PyTorch's worker info.
    worker_info = torch.utils.data.get_worker_info()
    if worker_info is None:
        raise ValueError
    
    generator = get_generator(seed + worker_info.id)

Tests have also been written to verify this behaviour in tests/trainer/test_data_collator.py.

These changes were done on Python 3.12.8. The dependencies installed were as pip install -e ".[dev]" along with:

  • torch==2.6.0
  • tensorflow==2.18.0
  • tf-keras==2.18.0

Fixes #36357

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 should be the right person to review this.

@github-actions github-actions bot marked this pull request as draft March 2, 2025 16:12
Copy link

github-actions bot commented Mar 2, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@capemox capemox marked this pull request as ready for review March 2, 2025 16:33
@capemox
Copy link
Author

capemox commented Mar 3, 2025

@Rocketknight1 this won't pass a few data collator tests, but I submitted a PR to fix these (#36457)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow setting a seed for DataCollatorForLanguageModeling
1 participant