-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset configuration examples #156
Conversation
type: blended | ||
datasets: | ||
- type: file | ||
path: path/to/dataset_0.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be path/to/dataset_0/fast_llm_config_training.yaml
? This way its clear that this should point to yaml files created in the previous prepare step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it would be a bit too verbose?
datasets: | ||
Training: | ||
type: file | ||
path: path/to/dataset.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before, for clarity it may be better to use path/to/dataset/fast_llm_config_training.yaml
?
- type: file | ||
path: path/to/dataset_1.yaml | ||
- type: file | ||
path: path/to/dataset_2.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: In the past we were creating data mixes as a separate yaml file using fml-ops functionality (e.g. /mnt/datasets/dolmafw70_fw30_merged.json
). Can a dataset with type file
still be pointing to one of such json
files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the json files are deprecated and only available with legacy dataset configs
Which dataset wrappers are currently available? Here we only document |
If I have a directory with a bunch of shards (
and
and
|
Co-authored-by: Oleksiy Ostapenko <[email protected]>
I think what's missing here is an example that shows how to turn a bunch of idx and bin files into a yaml or json dataset config file. |
let's merge this into #146 and then push it over the finish line |
β¨ Description
Fix: #150
Add some examples to demonstrate the dataset mechanism
π Type of change
Select all that apply: