Feat/add preprocess dataset #162

TomerG711 · 2025-05-15T14:53:00Z

Added a new command - preprocess. This command allows the users to preprocess dataset (from HF, file etc.), and limit the prompts the specific token sizes distribution. The generated dataset is saved to a local file and optionally to HF, and can later be used by GuideLLM benchmark. Solves #106

markurtz

Overall looks good. Added some comments towards potential issues that we should address as well as some NITs.

For more general changes, can we add in doc statements for any public functions containing a quick writeup on the functionality and adding in :param, :return, and :raises? For anything that is an entrypoint, we should also add in a simple example of usage. And finally, for the tests, can you mark each test with a decorator of @pytest.mark.smoke, @pytest.mark.sanity, or @pytest.mark.regression based on the frequency they should run with?

src/guidellm/__main__.py

src/guidellm/preprocess/dataset.py

…o fead/add-preprocess-dataset

markurtz

@TomerG711 changes are looking good. I left two minor NITs that aren't blockers for me to land and added a code suggestion for simplification to avoid running tokenizer.encode twice. Also added a comment towards what we had talked about to get the tokenizer with padding to converge faster. That can also be a follow up fix if you'd like to land this.

src/guidellm/preprocess/dataset.py

TomerG711 added 4 commits May 15, 2025 17:46

Implemented preprocessing for datasets

c700d9d

Implemented preprocessing for datasets

91027e1

Reverted irrelevant formatting

e02ec8a

Removed redundant comments

fc4bfb2

TomerG711 changed the title ~~Fead/add preprocess dataset [Issue-106]~~ Feat/add preprocess dataset [Issue-106] May 15, 2025

TomerG711 changed the title ~~Feat/add preprocess dataset [Issue-106]~~ Feat/add preprocess dataset May 18, 2025

TomerG711 and others added 6 commits May 18, 2025 14:40

Added support for saving in specific file format

2bf436c

Fixed code styling

64a91de

Improved error message

b5a4877

Fixed code styling

4d624a2

Merge branch 'main' into fead/add-preprocess-dataset

8e737b6

Merge branch 'main' into fead/add-preprocess-dataset

829366a

TomerG711 marked this pull request as ready for review May 28, 2025 08:07

Merge branch 'main' into fead/add-preprocess-dataset

6f275d7

markurtz requested changes May 29, 2025

View reviewed changes

TomerG711 and others added 9 commits June 1, 2025 10:29

Merge branch 'main' into fead/add-preprocess-dataset

cf99526

Fixed CR comments

e2ca919

Merge remote-tracking branch 'origin/fead/add-preprocess-dataset' int…

4af81d9

…o fead/add-preprocess-dataset

Fixed UTs

a9f4fa6

Added pytest mark to UTs

40c1118

Added docs

f61b7f0

Ran tox -e style

b6146c9

Fixed help for preprocess dataset subcommand

f3a3cd7

Fixed help for preprocess dataset subcommand

448d609

markurtz previously approved these changes Jun 4, 2025

View reviewed changes

src/guidellm/preprocess/dataset.py Show resolved Hide resolved

src/guidellm/preprocess/dataset.py Outdated Show resolved Hide resolved

src/guidellm/preprocess/dataset.py Outdated Show resolved Hide resolved

src/guidellm/preprocess/dataset.py Outdated Show resolved Hide resolved

TomerG711 dismissed markurtz’s stale review via c48472a June 5, 2025 09:48

TomerG711 added 3 commits June 5, 2025 12:48

Fixed CR comments

c48472a

Linters

0cc3ffe

Linters

c0cd1c9

markurtz approved these changes Jun 5, 2025

View reviewed changes

Merge branch 'main' into fead/add-preprocess-dataset

06f19a0

markurtz merged commit db7b534 into neuralmagic:main Jun 5, 2025
12 of 13 checks passed

This was referenced Jun 12, 2025

[FeatureRequest] Constraint input & output sizes no matter the dataset #106

Closed

Improve performance of synthetic dataset prompt creation to match a given number of tokens #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/add preprocess dataset #162

Feat/add preprocess dataset #162

Uh oh!

TomerG711 commented May 15, 2025 •

edited

Loading

Uh oh!

markurtz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markurtz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feat/add preprocess dataset #162

Feat/add preprocess dataset #162

Uh oh!

Conversation

TomerG711 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markurtz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markurtz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomerG711 commented May 15, 2025 •

edited

Loading