You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.
🚀 Execution Plan
Step 1: What is the smallest working version?
Support this only for shallow datasets that are themselves only a collection of bin/idx files.
Step 2: What additional optimizations are possible (but optional)?
Support this for any hierarchical definition of datasets.
📌 Acceptance Criteria (Must-Haves for Completion)
The feature must be functional and tested.
The implementation must be documented in practical terms.
The PR must include a performance/impact summary.
No refactors unless directly necessary for feature completion.
🛠️ Project Management
Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.
The text was updated successfully, but these errors were encountered:
🎯 Goal (What & Why)
To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.
🚀 Execution Plan
Step 1: What is the smallest working version?
Support this only for shallow datasets that are themselves only a collection of bin/idx files.
Step 2: What additional optimizations are possible (but optional)?
Support this for any hierarchical definition of datasets.
📌 Acceptance Criteria (Must-Haves for Completion)
🛠️ Project Management
Estimate
field (in days) in the GitHub project.Size
field to categorize the PR size (Small/Medium/Large).The text was updated successfully, but these errors were encountered: