Support easy concatenation of datasets #201

tscholak · 2025-03-24T14:14:44Z

🎯 Goal (What & Why)

To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.

🚀 Execution Plan

Step 1: What is the smallest working version?

Support this only for shallow datasets that are themselves only a collection of bin/idx files.

Step 2: What additional optimizations are possible (but optional)?

Support this for any hierarchical definition of datasets.

📌 Acceptance Criteria (Must-Haves for Completion)

The feature must be functional and tested.
The implementation must be documented in practical terms.
The PR must include a performance/impact summary.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.

The text was updated successfully, but these errors were encountered:

tscholak added enhancement New feature or request need update labels Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support easy concatenation of datasets #201

Support easy concatenation of datasets #201

tscholak commented Mar 24, 2025

Support easy concatenation of datasets #201

Support easy concatenation of datasets #201

Comments

tscholak commented Mar 24, 2025

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (but optional)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management