Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support easy concatenation of datasets #201

Open
1 of 4 tasks
tscholak opened this issue Mar 24, 2025 · 0 comments
Open
1 of 4 tasks

Support easy concatenation of datasets #201

tscholak opened this issue Mar 24, 2025 · 0 comments
Labels
enhancement New feature or request need update

Comments

@tscholak
Copy link
Collaborator

🎯 Goal (What & Why)

To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.

🚀 Execution Plan

Step 1: What is the smallest working version?

Support this only for shallow datasets that are themselves only a collection of bin/idx files.

Step 2: What additional optimizations are possible (but optional)?

Support this for any hierarchical definition of datasets.

📌 Acceptance Criteria (Must-Haves for Completion)

  • The feature must be functional and tested.
  • The implementation must be documented in practical terms.
  • The PR must include a performance/impact summary.
  • No refactors unless directly necessary for feature completion.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.
@tscholak tscholak added enhancement New feature or request need update labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need update
Projects
None yet
Development

No branches or pull requests

1 participant