Reenable and improve preprocess dataset by jaredoconnell · Pull Request #472 · vllm-project/guidellm

jaredoconnell · 2025-11-19T20:58:38Z

Summary

This PR re-enables, tests, and documents the preprocess dataset command.
Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes.

Details

Uses the post-refactor code to re-enable the command.
Switches over to the same format used by benchmark run's synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM.
Adds options for prefixes. I added an option to include prefixes in the count, since prefixes are included in input and output tokens, and affect performance.

Test Plan

Run with a known dataset, or create one as a simple CSV.
New tests are added that should cover everything except huggingface uploads. They are all at least in part generated by AI, but I went through each one iteratively to ensure they do what they need to do.

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson

I think the reuse of SyntheticTextDatasetConfig is not a great idea. prefix_buckets are defined differently here. For synthetic data a prefix bucket of prefix_tokens=10,prefix_count=1 means you get one identical prefix for the entire dataset. As implemented here prefix_tokens=10,prefix_count=1 will only ensure that every row has a prefix of length 10. It does not guarantee any shared prefix between rows.

Rather then reuse SyntheticTextDatasetConfig I think the best option is to create a new config format that is similar only where it makes sense. For example:

prompt_tokens:
prompt_tokens_stdenv:
prompt_tokens_min:
prompt_tokens_max:
output_tokens:
output_tokens_stdenv:
output_tokens_min:
output_tokens_max:
prefix_tokens:
prefix_tokens_stdenv:
prefix_tokens_min:
prefix_tokens_max:

Treat prefix the same as prompt and output.

src/guidellm/data/entrypoints.py

src/guidellm/__main__.py

src/guidellm/data/entrypoints.py

Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

jaredoconnell · 2025-11-21T05:11:13Z

I moved it to its on class in a way that retains a single source of truth so that we can use the same documentation. I just simplified it to only have the option of trimming prefixes. I decided that it wouldn't make sense to use a randomized size sampling because that's not how samples work in real scenarios typically.

src/guidellm/data/schemas.py

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Use separate class for preprocess config Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson

Need to double-check that benchmark run is unaffected but LGTM.

markurtz

Overall looks good, my only comment is if we could move the bulk of what's currently in entrypoints.py into a separate file or package, that way we can keep that file dedicated to just the intended exposed APIs and the rest for classes and logic functions would be nested under a sub namespace

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson

LGTM

jaredoconnell added 7 commits November 14, 2025 18:00

Reenable preprocess dataset

5c883fb

Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Move preprocess code to data/entrypoints.py

53a26eb

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Added tests for dataset preprocess command

2bc1eee

Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Added test for prefix buckets

3885ba1

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Added documentation for preprocess dataset

0b260c8

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Improve documentation and fix lint and type errors

81fbbe8

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Merge branch 'main' into features/reenable-preprocess

11cf5d2

jaredoconnell requested review from markurtz and sjmonson November 19, 2025 20:58

jaredoconnell added 4 commits November 19, 2025 19:02

Port tests from old dataset test file

6308e6a

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Fix linting errors

08d9419

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Format docs

f1857e0

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Remove extra whitespace at end of file

c60f1f2

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson requested changes Nov 20, 2025

View reviewed changes

jaredoconnell added 5 commits November 20, 2025 23:01

Address review comments

1eea713

Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Address review comments

77df1c8

Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Merge branch 'main' into features/reenable-preprocess

a227a50

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Fix tests broken in prior commit

00ee570

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Fix linter errors

6366782

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

jaredoconnell requested a review from sjmonson November 21, 2025 16:39

sjmonson requested changes Nov 21, 2025

View reviewed changes

src/guidellm/data/schemas.py Outdated Show resolved Hide resolved

src/guidellm/data/schemas.py Outdated Show resolved Hide resolved

jaredoconnell added 2 commits November 21, 2025 12:31

Remove unused field in SyntheticTextDatasetConfig

86bca8e

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Addressed review comments

e922b5b

Use separate class for preprocess config Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson previously approved these changes Nov 21, 2025

View reviewed changes

markurtz reviewed Nov 21, 2025

View reviewed changes

sjmonson added this to the v0.4.1 milestone Nov 24, 2025

Move most preprocess logic out of entrypoints file

ab93473

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

jaredoconnell dismissed sjmonson’s stale review via ab93473 November 24, 2025 17:16

Merge branch 'main' into features/reenable-preprocess

aeab22c

jaredoconnell added 2 commits November 24, 2025 12:25

Made tests correctly reference moved code

1e74368

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

Fix quality linter error

9eda590

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

sjmonson approved these changes Nov 24, 2025

View reviewed changes

markurtz approved these changes Nov 25, 2025

View reviewed changes

Merge branch 'main' into features/reenable-preprocess

88c2753

sjmonson merged commit 766e6ef into vllm-project:main Nov 25, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable and improve preprocess dataset#472

Reenable and improve preprocess dataset#472
sjmonson merged 23 commits intovllm-project:mainfrom
jaredoconnell:features/reenable-preprocess

jaredoconnell commented Nov 19, 2025

Uh oh!

sjmonson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

sjmonson left a comment

Uh oh!

markurtz left a comment

Uh oh!

sjmonson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jaredoconnell commented Nov 19, 2025

Summary

Details

Test Plan

Use of AI

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

markurtz left a comment

Choose a reason for hiding this comment

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants