-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset from file #146
Dataset from file #146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlamypoirier, thanks.
I'd like for @RaymondLi0 and @oleksost to have a look at these changes before we merge them.
Please also make sure the tests pass.
dataset_config = { | ||
"type": "blended", | ||
"datasets": [dataset_dict for dataset_dict in dataset_dicts], | ||
"weights": [dataset_dict["num_tokens"] for dataset_dict in dataset_dicts], | ||
} | ||
yaml.safe_dump(dataset_config, (self._config.output_path / "fast_llm_config.yaml").open("w")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same code as above
In fact @oleksost is using it, so we can't just remove it. |
@tscholak @RaymondLi0 @oleksost PR is ready, can you please review? |
Regarding deprecating |
Co-authored-by: Torsten Scholak <[email protected]> Co-authored-by: Oleksiy Ostapenko <[email protected]>
They are still available but not expected to be used directly by normal users.
The yaml file is not to be created manually, it is created in dataset preparation so is already covered. Remains the case of already prepared datasets that don't have a yaml file, we can either re-prepare those or make a quick script that create it for an already prepared dataset, what do you think?
This are all different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
✨ Description
Fix: #136
Replace the both the json and concatenated memmap dataset by a generic (fast-llm) config from file. The dataset preparator now generates such config file, a blended dataset with probabilities proportional to file sizes (same as json dataset). Note:
🔍 Type of change
Select all that apply: