Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset configuration examples #156

Merged
merged 2 commits into from
Feb 26, 2025
Merged

Conversation

jlamypoirier
Copy link
Collaborator

✨ Description

Fix: #150

Add some examples to demonstrate the dataset mechanism

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

type: blended
datasets:
- type: file
path: path/to/dataset_0.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be path/to/dataset_0/fast_llm_config_training.yaml? This way its clear that this should point to yaml files created in the previous prepare step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be a bit too verbose?

datasets:
Training:
type: file
path: path/to/dataset.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, for clarity it may be better to use path/to/dataset/fast_llm_config_training.yaml?

- type: file
path: path/to/dataset_1.yaml
- type: file
path: path/to/dataset_2.yaml
Copy link
Contributor

@oleksost oleksost Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: In the past we were creating data mixes as a separate yaml file using fml-ops functionality (e.g. /mnt/datasets/dolmafw70_fw30_merged.json). Can a dataset with type file still be pointing to one of such json files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the json files are deprecated and only available with legacy dataset configs

@oleksost
Copy link
Contributor

oleksost commented Feb 20, 2025

Which dataset wrappers are currently available? Here we only document blended, file and sampled. Is the concatenated_memmap (will be deprecated) and memmap still available? Would it make sense to document each of the available wrappers?

@oleksost
Copy link
Contributor

oleksost commented Feb 20, 2025

If I have a directory with a bunch of shards (shard_*.bin and shard_*.idx files) and the corresponding fast_llm_dataset.json (<- generated by the prepare command) file in it, is it equivalent to use:

type: file
path: path/to/directory/fast_llm_dataset.json

and

type: memmap
path: path/to/directory/shard

and

type: concatenated_memmap
path: path/to/directory/

@oleksost oleksost mentioned this pull request Feb 20, 2025
8 tasks
@tscholak
Copy link
Collaborator

tscholak commented Feb 24, 2025

I think what's missing here is an example that shows how to turn a bunch of idx and bin files into a yaml or json dataset config file.

@tscholak
Copy link
Collaborator

let's merge this into #146 and then push it over the finish line

@tscholak tscholak merged commit a7d55d0 into dataset_from_file Feb 26, 2025
@tscholak tscholak deleted the dataset_examples branch February 26, 2025 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants