Fix strategy initialization deadlock by sjmonson · Pull Request #528 · vllm-project/guidellm

sjmonson · 2026-01-13T20:44:30Z

Summary

Fixed an issue where applying the initial start time to a Stratagy could results in a deadlock. Also ensure that the recorded benchmark start_time is relative to the first request and not the scheduling start time.

Details

At the start of a benchmark the main thread will set the start time in shared memory. Each worker will also attempt to poll the start time. This behavior can cause transient deadlocks that prevent the benchmark from progressing to sending requests. Additionally the recorded "start_time" used for constraint calculations and stored as the benchmark start time was relative to the scheduler start time and not the first request time. This PR implements separate fixes for each issue.

Test Plan

The following patch is needed to observe the issue however it can be very difficult to reproduce depending on the hardware:

diff --git a/src/guidellm/scheduler/worker_group.py b/src/guidellm/scheduler/worker_group.py
index a2ee246f..4af4f710 100644
--- a/src/guidellm/scheduler/worker_group.py
+++ b/src/guidellm/scheduler/worker_group.py
@@ -22,7 +22,7 @@ from multiprocessing.process import BaseProcess
 from multiprocessing.synchronize import Barrier, Event
 from typing import Generic, NamedTuple
 
-from guidellm.logger import logger
+from loguru import logger
 from guidellm.scheduler.constraints import Constraint, RequestsExhaustedConstraint
 from guidellm.scheduler.schemas import (
     BackendInterface,
@@ -278,7 +278,11 @@ class WorkerProcessGroup(Generic[RequestT, ResponseT]):
         ):
             raise RuntimeError("create_processes() must be called before start()")
 
+        logger.critical(f"Starting in {start_time - time.time():.2f} seconds...")
+        initial_time = time.time()
         self.strategy.init_processes_start(start_time=start_time)
+        init_time = time.time()
+        logger.critical(f"Initialized strategy in {init_time - initial_time:.2f} seconds.")
         stop_send_requests_event = threading.Event()
         send_requests_stopped_event = threading.Event()
         self.state = WorkerGroupState[RequestT, ResponseT](
@@ -300,6 +304,8 @@ class WorkerProcessGroup(Generic[RequestT, ResponseT]):
             send_stop_criteria=[stop_send_requests_event],
             receive_stop_criteria=[self.shutdown_event],
         )
+        logger.critical(f"Started messaging in {time.time() - init_time:.2f} seconds.")
+        logger.critical(f"Finished startup in {time.time() - initial_time:.2f} seconds.")
 
         if (wait_time := start_time - time.time()) > 0:
             await asyncio.sleep(wait_time)

The following test would reliably reproduce the issue in at least one concurrency level on H200 hardware:

guidellm benchmark run \
            --target="${GUIDELLM_TARGET}" \
            --rate-type=concurrent \
            --data=prompt_tokens=1000,output_tokens=1000 \
            --max-seconds=600 \
            --rate=1,50,100,200,300,500,650 \
            --disable-progress \
            --output-path="output.json" 2>&1 | tee "output.log"

Example in output.log where strategy sync hit issue:

26-01-13 21:16:33|CRITICAL         |guidellm.scheduler.worker_group:start:281 - Starting in 1.00 seconds...
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:285 - Initialized strategy in 200.15 seconds.
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:307 - Started messaging in 0.00 seconds.
26-01-13 21:19:53|CRITICAL         |guidellm.scheduler.worker_group:start:308 - Finished startup in 200.15 seconds.

Related Issues

Resolves Some benchmark runs have a long delay before starting #526

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

jaredoconnell

Using an event looks to be an improvement to the design. I just have one comment.

src/guidellm/scheduler/strategies.py

Signed-off-by: Samuel Monson <smonson@redhat.com>

jaredoconnell

Looks good to me. I tested it with a basic setup and checked the relevant timing values in the output.

sjmonson requested review from jaredoconnell and markurtz and removed request for jaredoconnell January 14, 2026 15:29

sjmonson self-assigned this Jan 14, 2026

sjmonson added this to the v0.5.2 milestone Jan 14, 2026

sjmonson marked this pull request as ready for review January 14, 2026 16:04

jaredoconnell previously approved these changes Jan 14, 2026

View reviewed changes

jaredoconnell reviewed Jan 14, 2026

View reviewed changes

src/guidellm/scheduler/strategies.py Outdated Show resolved Hide resolved

sjmonson dismissed jaredoconnell’s stale review via e094303 January 14, 2026 20:06

sjmonson force-pushed the fix/req_start_delay branch from 44293ab to e094303 Compare January 14, 2026 20:06

sjmonson added 2 commits January 14, 2026 15:07

Remove global spinlock in Strategies

a770697

Signed-off-by: Samuel Monson <smonson@redhat.com>

Keep global count of the earliest start and latest end requests

9c43821

Signed-off-by: Samuel Monson <smonson@redhat.com>

sjmonson force-pushed the fix/req_start_delay branch from e094303 to 9c43821 Compare January 14, 2026 20:07

jaredoconnell approved these changes Jan 14, 2026

View reviewed changes

sjmonson merged commit e58d49e into main Jan 14, 2026
16 checks passed

sjmonson deleted the fix/req_start_delay branch January 14, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix strategy initialization deadlock#528

Fix strategy initialization deadlock#528
sjmonson merged 2 commits intomainfrom
fix/req_start_delay

sjmonson commented Jan 13, 2026 •

edited

Loading

Uh oh!

jaredoconnell left a comment

Uh oh!

Uh oh!

jaredoconnell left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjmonson commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Test Plan

Related Issues

Use of AI

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjmonson commented Jan 13, 2026 •

edited

Loading