Tags: vllm-project/guidellm
Tags
Added rampup to constant rate type (#549) ## Summary Simply allows a linear rampup of the constant rate profile. ## Test Plan The simplest test is to run a short constant test with 4 requests per second, with a long rampup. You can see how it ramps as expected. There are also new tests. ## Related Issues Fulfills part of the goals of #428 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [ ] Includes AI-assisted code completion - [x] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
OpenAI API-Key Support (#535) ## Summary A basic set of changes to add the api key as a bearer token to all relevant requests. ## Details - The user passes the api key in as an argument to the backend - OpenAI's protocol specifies that it should be specified as a bearer token - Headers are merged, because requests can have - This PR also cleans up dead code that was unused since the refactor - I excluded the API key from the info data structure for security purposes. Let me know if some info belongs there, like a boolean value specifying if an API key is provided, or if a cryptic hash of the token would be helpful. But otherwise I think it's good as-is. ## Test Plan Run a vLLM server with the option `--api-key <your API key>` passed in. After doing that, run a PR with this not specified, guidellm would usually fail. Try with the options as documented in this PR's content, and it should work. ## Related Issues - Resolves: #491 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [x] Includes AI-assisted code completion - [x] Includes code generated by an AI application - [ ] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
Record output_tokens for incomplete requests (#519) ## Summary <!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> Sets `continuous_usage_stats` to get token usage on incomplete requests. If usage is still unavailable fall back to iteration count. ## Details <!-- Provide a detailed list of all changes introduced in this pull request. --> In v0.3.0 and earlier the number of iterations was used as proxy for output token count in incomplete requests that did not return usage metrics. In v0.4.0 this behavior was removed which lead to large discrepancies in output token count based on the percentage of the benchmark consisting of incomplete requests. This PR restore the original behavior of falling back to number of iterations. Additionally it sets the `continuous_usage_stats` flag to enable usage metrics on every iteration, when available. ## Test Plan <!-- List the steps needed to test this PR. --> - Run a long-generation, high concurrency benchmark using a `max-seconds` constraint. For incomplete requests check that output_tokens is greater than 0 for some requests. ## Related Issues <!-- Link any relevant issues that this PR addresses. --> - Resolves #514 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [ ] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [ ] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
Allow unlimited connections per-worker (#488) ## Summary <!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> By default each httpx client supports a maximum of 100 connections ([ref](https://www.python-httpx.org/advanced/resource-limits/)). We want this uncapped as connection count is maintained by a semaphore. ## Test Plan <!-- List the steps needed to test this PR. --> - See #487 ## Related Issues <!-- Link any relevant issues that this PR addresses. --> - Resolves #487 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [ ] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [ ] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
Configurable max_tokens/max_completion_tokens key (#399) ## Summary <!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> Makes the `max_tokens` request key configurable through an environment variable per endpoint type. Defaults to `max_tokens` for legacy `completions` and `max_completion_tokens` for `chat/completions` ## Details <!-- Provide a detailed list of all changes introduced in this pull request. --> - Add the `GUIDELLM__OPENAI__MAX_OUTPUT_KEY` config option which is a dict mapping from route name -> output tokens key. Default is `{"text_completions": "max_tokens", "chat_completions": "max_completion_tokens"}` ## Test Plan <!-- List the steps needed to test this PR. --> - ## Related Issues <!-- Link any relevant issues that this PR addresses. --> - Closes #395 - Closes #269 - Related #210 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [ ] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [ ] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Samuel Monson <smonson@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
PreviousNext