Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Dataset Validation (LM-Loss/Perplexity) #178

Merged
merged 22 commits into from
Mar 28, 2025

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Mar 7, 2025

✨ Description

Closes #65

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

πŸ“ Changes

List the key changes introduced in this PR:

  1. Added support for multipole validation datasets with loss and other existing metrics tracking. This introduces breaking change as now training.validation config field is a dictionary of validation dataset names and their application parameters.
  2. Changed existing tests to accommodate new config format
  3. Changed and extended documentation to describe new config format

βœ… Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • πŸ“œ I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • πŸŽ‰ The functionality is complete, and I have tested the changes.
  • πŸ“ I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • πŸ§ͺ I have added or updated tests to cover my changes.
  • βœ”οΈ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • πŸ‹οΈ I have tested the changes on realistic training workloads, if applicable.

@bigximik
Copy link
Contributor Author

bigximik commented Mar 7, 2025

@jlamypoirier What do you think of such approach?

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigximik That's really close to the approach I had in mind. I would suggest moving some of the changes from the data to the trainer though, it would simplify things and we'll want it anyway in the next step of #65.

@bigximik
Copy link
Contributor Author

bigximik commented Mar 19, 2025

I have implemented it so that the config will look something like this:

data:
  datasets:
    Training: # training dataset, hardcoded name
      type: memmap
      path: some_path1
    Test:  # test dataset, hardcoded name
      type: memmap
      path: some_path2
    validation_dataset_name1:  # validation dataset, any name
      type: memmap
      path: some_path3
    validation_dataset_name2:  # validation dataset, any name
      type: memmap
      path: some_path4

training:
  training_iters: 2
  test_iters: 2
  validation:
    validation_dataset_name1:
      interval: 2
      iterations: 1
    validation_dataset_name2:
      interval: 2
      iterations: 1

@jlamypoirier, have I got it right?

@bigximik
Copy link
Contributor Author

Maybe to add training_dataset_name and test_dataset_name into TrainingConfig with default values Training and Test so people can override it if they like?

@jlamypoirier
Copy link
Collaborator

I have implemented it so that the config will look something like this:
@jlamypoirier, have I got it right?

Yes

Maybe to add training_dataset_name and test_dataset_name into TrainingConfig with default values Training and Test so people can override it if they like?

I don't think it's worth it, but feel free to fix the capitalization if you find a way to do it without breaking backward comparibility

@bigximik
Copy link
Contributor Author

bigximik commented Mar 24, 2025

in wandb metrics for different validation datasets look like this:

image

@bigximik bigximik marked this pull request as ready for review March 24, 2025 12:38
@bigximik bigximik requested a review from jlamypoirier March 24, 2025 12:38
@bigximik
Copy link
Contributor Author

I have changed validation to evaluations and ValidationConfig to EvaluationConfig, but kept the phase as Validation.
So, in the documentation, it is still referred to as validation datasets, which need to be defined in data.datasets, while their usage should be specified in training.evaluations.

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to merge, some minor issues and suggestions

@tscholak
Copy link
Collaborator

Guys can we get to a conclusion here please? I’d like us to merge this by tomorrow end of day, at the latest. Thanks

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigximik Adjusted the code, ready to merge if it works for you.

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you both, looks good to me!

@bigximik
Copy link
Contributor Author

bigximik commented Mar 28, 2025

I have updated the instruction fine-tuning documentation that came from the merge but created a separate issue for moving the check of used dataset definitions to _validate in TrainerConfig (#213).

@jlamypoirier jlamypoirier merged commit 21182c2 into main Mar 28, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the denis/multi_validation branch March 28, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-Dataset Validation with LM-Loss
3 participants