Skip to content

Feat/add preprocess dataset #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 5, 2025

Conversation

TomerG711
Copy link
Collaborator

@TomerG711 TomerG711 commented May 15, 2025

Added a new command - preprocess. This command allows the users to preprocess dataset (from HF, file etc.), and limit the prompts the specific token sizes distribution. The generated dataset is saved to a local file and optionally to HF, and can later be used by GuideLLM benchmark. Solves #106

@TomerG711 TomerG711 changed the title Fead/add preprocess dataset [Issue-106] Feat/add preprocess dataset [Issue-106] May 15, 2025
@TomerG711 TomerG711 changed the title Feat/add preprocess dataset [Issue-106] Feat/add preprocess dataset May 18, 2025
@TomerG711 TomerG711 marked this pull request as ready for review May 28, 2025 08:07
Copy link
Member

@markurtz markurtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Added some comments towards potential issues that we should address as well as some NITs.

For more general changes, can we add in doc statements for any public functions containing a quick writeup on the functionality and adding in :param, :return, and :raises? For anything that is an entrypoint, we should also add in a simple example of usage. And finally, for the tests, can you mark each test with a decorator of @pytest.mark.smoke, @pytest.mark.sanity, or @pytest.mark.regression based on the frequency they should run with?

markurtz
markurtz previously approved these changes Jun 4, 2025
Copy link
Member

@markurtz markurtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerG711 changes are looking good. I left two minor NITs that aren't blockers for me to land and added a code suggestion for simplification to avoid running tokenizer.encode twice. Also added a comment towards what we had talked about to get the tokenizer with padding to converge faster. That can also be a follow up fix if you'd like to land this.

@markurtz markurtz merged commit db7b534 into neuralmagic:main Jun 5, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants