Skip to content

Fix strategy initialization deadlock#528

Merged
sjmonson merged 2 commits intomainfrom
fix/req_start_delay
Jan 14, 2026
Merged

Fix strategy initialization deadlock#528
sjmonson merged 2 commits intomainfrom
fix/req_start_delay

Conversation

@sjmonson
Copy link
Collaborator

@sjmonson sjmonson commented Jan 13, 2026

Summary

Fixed an issue where applying the initial start time to a Stratagy could results in a deadlock. Also ensure that the recorded benchmark start_time is relative to the first request and not the scheduling start time.

Details

At the start of a benchmark the main thread will set the start time in shared memory. Each worker will also attempt to poll the start time. This behavior can cause transient deadlocks that prevent the benchmark from progressing to sending requests. Additionally the recorded "start_time" used for constraint calculations and stored as the benchmark start time was relative to the scheduler start time and not the first request time. This PR implements separate fixes for each issue.

Test Plan

The following patch is needed to observe the issue however it can be very difficult to reproduce depending on the hardware:

diff --git a/src/guidellm/scheduler/worker_group.py b/src/guidellm/scheduler/worker_group.py
index a2ee246f..4af4f710 100644
--- a/src/guidellm/scheduler/worker_group.py
+++ b/src/guidellm/scheduler/worker_group.py
@@ -22,7 +22,7 @@ from multiprocessing.process import BaseProcess
 from multiprocessing.synchronize import Barrier, Event
 from typing import Generic, NamedTuple
 
-from guidellm.logger import logger
+from loguru import logger
 from guidellm.scheduler.constraints import Constraint, RequestsExhaustedConstraint
 from guidellm.scheduler.schemas import (
     BackendInterface,
@@ -278,7 +278,11 @@ class WorkerProcessGroup(Generic[RequestT, ResponseT]):
         ):
             raise RuntimeError("create_processes() must be called before start()")
 
+        logger.critical(f"Starting in {start_time - time.time():.2f} seconds...")
+        initial_time = time.time()
         self.strategy.init_processes_start(start_time=start_time)
+        init_time = time.time()
+        logger.critical(f"Initialized strategy in {init_time - initial_time:.2f} seconds.")
         stop_send_requests_event = threading.Event()
         send_requests_stopped_event = threading.Event()
         self.state = WorkerGroupState[RequestT, ResponseT](
@@ -300,6 +304,8 @@ class WorkerProcessGroup(Generic[RequestT, ResponseT]):
             send_stop_criteria=[stop_send_requests_event],
             receive_stop_criteria=[self.shutdown_event],
         )
+        logger.critical(f"Started messaging in {time.time() - init_time:.2f} seconds.")
+        logger.critical(f"Finished startup in {time.time() - initial_time:.2f} seconds.")
 
         if (wait_time := start_time - time.time()) > 0:
             await asyncio.sleep(wait_time)

The following test would reliably reproduce the issue in at least one concurrency level on H200 hardware:

guidellm benchmark run \
            --target="${GUIDELLM_TARGET}" \
            --rate-type=concurrent \
            --data=prompt_tokens=1000,output_tokens=1000 \
            --max-seconds=600 \
            --rate=1,50,100,200,300,500,650 \
            --disable-progress \
            --output-path="output.json" 2>&1 | tee "output.log"

Example in output.log where strategy sync hit issue:

26-01-13 21:16:33|CRITICAL         |guidellm.scheduler.worker_group:start:281 - Starting in 1.00 seconds...
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:285 - Initialized strategy in 200.15 seconds.
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:307 - Started messaging in 0.00 seconds.
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:308 - Finished startup in 200.15 seconds.

Related Issues


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

@sjmonson sjmonson requested review from jaredoconnell and markurtz and removed request for jaredoconnell January 14, 2026 15:29
@sjmonson sjmonson self-assigned this Jan 14, 2026
@sjmonson sjmonson added this to the v0.5.2 milestone Jan 14, 2026
@sjmonson sjmonson marked this pull request as ready for review January 14, 2026 16:04
jaredoconnell
jaredoconnell previously approved these changes Jan 14, 2026
Copy link
Collaborator

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an event looks to be an improvement to the design. I just have one comment.

Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
@sjmonson sjmonson force-pushed the fix/req_start_delay branch from e094303 to 9c43821 Compare January 14, 2026 20:07
Copy link
Collaborator

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I tested it with a basic setup and checked the relevant timing values in the output.

@sjmonson sjmonson merged commit e58d49e into main Jan 14, 2026
16 checks passed
@sjmonson sjmonson deleted the fix/req_start_delay branch January 14, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some benchmark runs have a long delay before starting

2 participants