Reenable and improve preprocess dataset#472
Conversation
Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
sjmonson
left a comment
There was a problem hiding this comment.
I think the reuse of SyntheticTextDatasetConfig is not a great idea. prefix_buckets are defined differently here. For synthetic data a prefix bucket of prefix_tokens=10,prefix_count=1 means you get one identical prefix for the entire dataset. As implemented here prefix_tokens=10,prefix_count=1 will only ensure that every row has a prefix of length 10. It does not guarantee any shared prefix between rows.
Rather then reuse SyntheticTextDatasetConfig I think the best option is to create a new config format that is similar only where it makes sense. For example:
prompt_tokens:
prompt_tokens_stdenv:
prompt_tokens_min:
prompt_tokens_max:
output_tokens:
output_tokens_stdenv:
output_tokens_min:
output_tokens_max:
prefix_tokens:
prefix_tokens_stdenv:
prefix_tokens_min:
prefix_tokens_max:Treat prefix the same as prompt and output.
Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
|
I moved it to its on class in a way that retains a single source of truth so that we can use the same documentation. I just simplified it to only have the option of trimming prefixes. I decided that it wouldn't make sense to use a randomized size sampling because that's not how samples work in real scenarios typically. |
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Use separate class for preprocess config Signed-off-by: Jared O'Connell <joconnel@redhat.com>
sjmonson
left a comment
There was a problem hiding this comment.
Need to double-check that benchmark run is unaffected but LGTM.
markurtz
left a comment
There was a problem hiding this comment.
Overall looks good, my only comment is if we could move the bulk of what's currently in entrypoints.py into a separate file or package, that way we can keep that file dedicated to just the intended exposed APIs and the rest for classes and logic functions would be nested under a sub namespace
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Summary
This PR re-enables, tests, and documents the
preprocess datasetcommand.Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes.
Details
benchmark run's synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM.Test Plan
Use of AI
## WRITTEN BY AI ##)