Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset from file #146

Merged
merged 11 commits into from
Mar 4, 2025
Merged

Dataset from file #146

merged 11 commits into from
Mar 4, 2025

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Feb 15, 2025

✨ Description

Fix: #136

Replace the both the json and concatenated memmap dataset by a generic (fast-llm) config from file. The dataset preparator now generates such config file, a blended dataset with probabilities proportional to file sizes (same as json dataset). Note:

  • Dropped the json format from the preparator, but existing json datasets still work.
  • Deprecated the concatenated memmap format, keeping since @oleksost is using it but we should try to remove soon.
  • Added a way to split the dataset in the preparator. This was necessary to allow train/valid/test (as is done in tests)
  • Updated the tutorial to use the new format.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Sorry, something went wrong.

fix
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlamypoirier, thanks.
I'd like for @RaymondLi0 and @oleksost to have a look at these changes before we merge them.
Please also make sure the tests pass.

Comment on lines 269 to 274
dataset_config = {
"type": "blended",
"datasets": [dataset_dict for dataset_dict in dataset_dicts],
"weights": [dataset_dict["num_tokens"] for dataset_dict in dataset_dicts],
}
yaml.safe_dump(dataset_config, (self._config.output_path / "fast_llm_config.yaml").open("w"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same code as above

@tscholak
Copy link
Collaborator

I dropped the concatenated memmap format, assuming no one is using it yet.

In fact @oleksost is using it, so we can't just remove it.

@jlamypoirier
Copy link
Collaborator Author

@tscholak @RaymondLi0 @oleksost PR is ready, can you please review?

@oleksost
Copy link
Contributor

Regarding deprecating concatenated_memmap, I had a question here: #156 (comment)

@tscholak tscholak mentioned this pull request Feb 26, 2025
8 tasks

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Torsten Scholak <[email protected]>
Co-authored-by: Oleksiy Ostapenko <[email protected]>
@jlamypoirier
Copy link
Collaborator Author

Which dataset wrappers are currently available? Here we only document blended, file and sampled. Is the concatenated_memmap (will be deprecated) and memmap still available? Would it make sense to document each of the available wrappers?

They are still available but not expected to be used directly by normal users. memmap is should to be accessed through the yaml file, and concatenated_memmap is to be removed once we stop using it.

I think what's missing here is an example that shows how to turn a bunch of idx and bin files into a yaml or json dataset config file.

The yaml file is not to be created manually, it is created in dataset preparation so is already covered. Remains the case of already prepared datasets that don't have a yaml file, we can either re-prepare those or make a quick script that create it for an already prepared dataset, what do you think?

If I have a directory with a bunch of shards (shard_*.bin and shard_*.idx files) and the corresponding fast_llm_dataset.json (<- generated by the prepare command) file in it, is it equivalent to use:

type: file
path: path/to/directory/fast_llm_dataset.json

and

type: memmap
path: path/to/directory/shard

and

type: concatenated_memmap
path: path/to/directory/

This are all different. file uses the yaml file created in the prepare step, it won't work with older json files. Both file and legacy (json) are a probabilistic sampling of all shard.memmap is only a single shard (and not to be used in general), and concatenated_memmap is a single dataset made of all shards (and deprecated).

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@jlamypoirier
Copy link
Collaborator Author

@tscholak @oleksost can we merge this?

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@jlamypoirier jlamypoirier merged commit 6df65c1 into main Mar 4, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the dataset_from_file branch March 4, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat/bug] Concatenated dataset takes too much resources
3 participants