-
Notifications
You must be signed in to change notification settings - Fork 44
Feat/add preprocess dataset #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/add preprocess dataset #162
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Added some comments towards potential issues that we should address as well as some NITs.
For more general changes, can we add in doc statements for any public functions containing a quick writeup on the functionality and adding in :param, :return, and :raises? For anything that is an entrypoint, we should also add in a simple example of usage. And finally, for the tests, can you mark each test with a decorator of @pytest.mark.smoke, @pytest.mark.sanity, or @pytest.mark.regression based on the frequency they should run with?
…o fead/add-preprocess-dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomerG711 changes are looking good. I left two minor NITs that aren't blockers for me to land and added a code suggestion for simplification to avoid running tokenizer.encode twice. Also added a comment towards what we had talked about to get the tokenizer with padding to converge faster. That can also be a follow up fix if you'd like to land this.
Added a new command - preprocess. This command allows the users to preprocess dataset (from HF, file etc.), and limit the prompts the specific token sizes distribution. The generated dataset is saved to a local file and optionally to HF, and can later be used by GuideLLM benchmark. Solves #106