Ensure synthetic text datasets remain random across benchmarks#463
Ensure synthetic text datasets remain random across benchmarks#463
Conversation
… run in the same session by enforcing set_epoch across the data loader and iterators chain Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
There was a problem hiding this comment.
Pull Request Overview
This PR fixes synthetic dataset randomness across benchmark runs by implementing epoch tracking through the data loading chain. The key change is that SyntheticTextGenerator has been refactored into SyntheticTextDataset with a proper _SyntheticTextExamplesIterable that increments random seeds based on iteration count, ensuring different data is generated for each benchmark iteration.
Key changes:
- Introduced epoch tracking in
DataLoaderandDatasetsIteratorclasses - Refactored
SyntheticTextGeneratorintoSyntheticTextDatasetextendingIterableDatasetwith_SyntheticTextExamplesIterablefor proper epoch/iteration support - Updated all test references from
SyntheticTextGeneratortoSyntheticTextDataset
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/guidellm/data/loaders.py | Added epoch tracking to DatasetsIterator and DataLoader, with propagation of epoch to datasets via set_epoch |
| src/guidellm/data/deserializers/synthetic.py | Refactored generator into SyntheticTextDataset with _SyntheticTextExamplesIterable that increments random seed per iteration |
| tests/unit/data/deserializers/test_synthetic.py | Updated all test references from SyntheticTextGenerator to SyntheticTextDataset |
| src/guidellm/data/deserializers/init.py | Updated exports to replace SyntheticTextGenerator with SyntheticTextDataset |
Comments suppressed due to low confidence (1)
tests/unit/data/deserializers/test_synthetic.py:314
- The test is accessing the private method
_create_promptonSyntheticTextDataset, but this method is actually defined in_SyntheticTextExamplesIterable, not onSyntheticTextDataset. This test will fail becauseSyntheticTextDatasetdoesn't have a_create_promptmethod. The method needs to be exposed onSyntheticTextDatasetor the test needs to access it via_ex_iterable.
result = generator._create_prompt(5, faker, "unique_prefix ")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
sjmonson
left a comment
There was a problem hiding this comment.
Minor clarification question but otherwise LGTM. Tested working.
Summary
Fix synthetic dataset so it can preserve randomness across benchmarks run in the same session by enforcing set_epoch across the data loader and iterators chain. This was happening due to a new iterator being created from the DataLoader for each benchmark. The solution was to preserve epoch information across datasets so they can make use of that, if needed, and in the case of SyntheticTextGeneration, to increment a random seed.
Details
Test Plan
Automation tests
Related Issues
Use of AI
## WRITTEN BY AI ##)