LLM Load Balancer

A centralized management system for coordinating LLM inference runtimes across multiple machines

Overview

LLM Load Balancer is a powerful centralized system that provides unified management and a single API endpoint for multiple LLM inference runtimes running across different machines. It features intelligent load balancing, automatic failure detection, real-time monitoring capabilities, and seamless integration for enhanced scalability.

Vision

LLM Load Balancer is designed to serve three primary use cases:

Private LLM Server - For individuals and small teams who want to run their own LLM infrastructure with full control over their data and models
Enterprise Gateway - For organizations requiring centralized management, access control, and monitoring of LLM resources across departments
Cloud Provider Integration - Seamlessly route requests to OpenAI, Google, or Anthropic APIs through the same unified endpoint

Supported Endpoints

llmlb routes requests to any OpenAI-compatible inference server:

Endpoint Type	Description
xLLM	Project-native C++ inference engine (GitHub)
Ollama	Popular local inference server
vLLM	High-throughput serving engine
LM Studio	Desktop inference application
OpenAI-compatible	Any server implementing the OpenAI API

For supported model formats and engine details, see xLLM documentation.

Multimodal Support

Beyond text generation, LLM Load Balancer provides OpenAI-compatible APIs for:

Text-to-Speech (TTS): /v1/audio/speech - Generate natural speech from text
Speech-to-Text (ASR): /v1/audio/transcriptions - Transcribe audio to text
Image Generation: /v1/images/generations - Generate images from text prompts

Text generation should use the Responses API (/v1/responses) by default. Chat Completions remains available for compatibility.

Key Features

Unified API Endpoint: Access multiple LLM runtime instances through a single URL
Automatic Load Balancing: Latency-based request distribution across available endpoints
Endpoint Management: Centralized management of Ollama, vLLM, xLLM and other OpenAI-compatible servers
Model Sync: Automatic model discovery via GET /v1/models from registered endpoints
Automatic Failure Detection: Detect offline endpoints and exclude them from routing
Real-time Monitoring: Comprehensive visualization of endpoint states and performance metrics via web dashboard
Request History Tracking: Complete request/response logging with 7-day retention
WebUI Management: Manage endpoints, monitoring, and control through browser-based dashboard
Cross-Platform Support: Works on Windows 10+, macOS 12+, and Linux
Self Update (User-Approved): Detect new GitHub Releases, notify via dashboard/tray, drain in-flight inference, then restart into the new version — with update scheduling, automatic rollback, and download progress tracking
GPU-Aware Routing: Intelligent request routing based on GPU capabilities and availability
Cloud Model Prefixes: Add openai: google: or anthropic: in the model name to proxy to the corresponding cloud provider while keeping the same OpenAI-compatible endpoint.

Assistant CLI for LLM Assistants

As of February 17, 2026, the former MCP server package (@llmlb/mcp-server) and npm distribution have been removed. Assistant workflows now use built-in CLI commands.

Why `llmlb assistant`?

Feature	`llmlb assistant`	Manual Bash + curl
Authentication	Auto-injected from env	Manual header management
Security	Host whitelist + injection prevention	No built-in protection
Sensitive data	Masked in command echo/output	Exposed in shell history
API docs	`assistant openapi` / `assistant guide`	External reference needed
Timeout handling	Built-in per request	Manual implementation

Core Commands

# execute_curl equivalent
llmlb assistant curl --command "curl http://localhost:32768/v1/models"

# print OpenAPI JSON
llmlb assistant openapi

# print API guide text
llmlb assistant guide --category overview

Skills / Plugins

Claude Code plugin metadata: .claude-plugin/marketplace.json
Claude plugin entry: .claude-plugin/plugins/llmlb-cli/plugin.json
Claude skill mirror: .claude/skills/llmlb-cli-usage/SKILL.md
Codex skill: .codex/skills/llmlb-cli-usage/SKILL.md
Codex packaged output directory: codex-skills/dist/

python3 .codex/skills/.system/skill-creator/scripts/package_skill.py \
  .codex/skills/llmlb-cli-usage \
  codex-skills/dist

Quick Start

LLM Load Balancer (llmlb)

# Build
cargo build --release -p llmlb

# Run
./target/release/llmlb
# Default: http://0.0.0.0:32768

# Access dashboard
# Open http://localhost:32768/dashboard in browser
# (No internal API token required)

Environment Variables:

Variable	Default	Description
`LLMLB_HOST`	`0.0.0.0`	Bind address
`LLMLB_PORT`	`32768`	Listen port
`LLMLB_DATABASE_URL`	`sqlite:~/.llmlb/load balancer.db`	Database URL
`LLMLB_LOG_LEVEL`	`info`	Log level
`LLMLB_JWT_SECRET`	(auto-generated)	JWT signing secret
`LLMLB_ADMIN_USERNAME`	`admin`	Initial admin username
`LLMLB_ADMIN_PASSWORD`	(required)	Initial admin password

Backward compatibility: Legacy env var names are supported but deprecated (see full list below).

System Tray (Windows/macOS only):

On Windows 10+ and macOS 12+, the load balancer displays a system tray icon. Double-click to open the dashboard. Docker/Linux runs as a headless CLI process. Use llmlb serve --no-tray to force headless mode on supported platforms.

Self Update (notification + restart):

llmlb checks GitHub Releases in the background (best-effort, cached up to 24h). When an update is available, it notifies via the dashboard and (Windows/macOS) the tray menu.

When you approve the update ("Restart to update"), llmlb rejects new inference requests (/v1/*) with 503 + Retry-After, waits for in-flight inference requests (including streaming) to finish, then applies the update and restarts. A drain timeout of 300 seconds prevents indefinite waiting. For Windows -setup.exe updates, llmlb runs the installer silently with /VERYSILENT /CLOSEAPPLICATIONS /SUPPRESSMSGBOXES in user context (%LOCALAPPDATA% install), so no UAC prompt is required.

Update scheduling:

Immediate: Apply now (default) — drains in-flight requests, then restarts
On idle: Waits until no inference requests are in-flight, then applies automatically
Scheduled: Specify a date/time; llmlb starts the drain at the scheduled time

Configure via dashboard settings modal or the scheduling API (POST/GET/DELETE /api/system/update/schedule).

Rollback:

Automatic: After applying an update, llmlb monitors the new process for 30 seconds; if the health check fails, it automatically restores from the .bak backup
Manual: Use the dashboard "Rollback" button or POST /api/system/update/rollback when a .bak backup exists

Download progress: The dashboard shows a real-time progress bar with bytes downloaded and percentage during update asset downloads.

Auto-apply method depends on the platform/install:

Portable install: replace the executable in-place when writable
macOS .pkg / Windows -setup.exe: run the installer (Windows runs silently without UAC)
Linux non-writable installs: auto-apply is not supported; reinstall manually from GitHub Releases

For full details, see CLAUDE.md.

CLI Reference

The CLI provides a small set of management subcommands:

# Start server (tray on Windows/macOS, headless on Linux)
llmlb serve

# Force headless mode on supported platforms
llmlb serve --no-tray

# Show running server status (lockfile-based)
llmlb status
llmlb status --port 32768

# Stop a running server
llmlb stop --port 32768

Day-to-day management is still done via the Dashboard UI (/dashboard) or the HTTP APIs.

xLLM (C++)

The xLLM runtime has moved to a separate repository:

https://github.com/akiojin/xLLM

Build/run instructions and environment variables are documented there.

Load Balancing

LLM Load Balancer supports multiple load balancing strategies to optimize request distribution across runtimes.

Strategies

1. Metrics-Based Load Balancing (Recommended)

Selects runtimes based on real-time metrics (CPU usage, memory usage, active requests). This intelligent mode provides optimal performance by dynamically routing requests to the least loaded runtime, ensuring efficient resource utilization.

Configuration:

# Enable metrics-based load balancing
LLMLB_LOAD_BALANCER_MODE=metrics cargo run -p llmlb

Load Score Calculation:

score = cpu_usage + memory_usage + (active_requests × 10)

The runtime with the lowest score is selected. If all runtimes have CPU usage > 80%, the system automatically falls back to round-robin.

Example:

Runtime A: CPU 20%, Memory 30%, Active 1 → Score = 60 ✓ Selected
Runtime B: CPU 70%, Memory 50%, Active 5 → Score = 170

2. Advanced Load Balancing (Default)

Combines multiple factors including response time, active requests, and CPU usage to provide sophisticated runtime selection with adaptive performance optimization.

Configuration:

# Use default advanced load balancing (or omit LOAD_BALANCER_MODE)
LLMLB_LOAD_BALANCER_MODE=auto cargo run -p llmlb

Health / Metrics

llmlb performs pull-based health checks against registered endpoints. Endpoints do not push heartbeats to the load balancer (there is no POST /api/health).

Endpoint status is surfaced in the dashboard and GET /api/endpoints.
Prometheus metrics are exported via GET /api/metrics/cloud (JWT admin or API key with metrics.read).

Architecture

LLM Load Balancer coordinates local llama.cpp runtimes and optionally proxies to cloud LLM providers via model prefixes.

Components

LLM Load Balancer (Rust): Receives OpenAI-compatible traffic, chooses a path, and proxies requests. Exposes dashboard, metrics, and admin APIs.
Local Runtimes (C++ / llama.cpp): Serve GGUF models; register and send heartbeats to the load balancer.
Cloud Proxy: When a model name starts with openai: google: or anthropic: the load balancer forwards to the corresponding cloud API.
Storage: SQLite for load balancer metadata; model files live on each runtime.
Observability: Prometheus metrics, structured logs, dashboard stats.

System Overview

Draw.io source: docs/diagrams/architecture.drawio (Page: System Overview (README.md))

Request Flow

Client
  │ POST /v1/chat/completions
  ▼
LLM Load Balancer (OpenAI-compatible)
  ├─ Prefix? → Cloud API (OpenAI / Google / Anthropic)
  └─ No prefix → Scheduler → Local Runtime
                       └─ llama.cpp inference → Response

Communication Flow (Proxy Pattern)

LLM Load Balancer uses a Proxy Pattern - clients only need to know the load balancer URL.

Traditional Method (Without LLM Load Balancer)

# Direct access to each runtime API (default: runtime_port=32769)
curl http://machine1:32769/v1/responses -d '...'
curl http://machine2:32769/v1/responses -d '...'
curl http://machine3:32769/v1/responses -d '...'

With LLM Load Balancer (Proxy)

# Unified access to LLM Load Balancer - automatic routing to the optimal runtime
curl http://lb:32768/v1/responses -d '...'
curl http://lb:32768/v1/responses -d '...'
curl http://lb:32768/v1/responses -d '...'

Detailed Request Flow:

Client → LLM Load Balancer

POST http://lb:32768/v1/responses
Content-Type: application/json

{"model": "llama2", "input": "Hello!"}

LLM Load Balancer Internal Processing
- Select optimal runtime (Load Balancing)
- Forward request to selected runtime via HTTP client

LLM Load Balancer → Runtime (Internal Communication)

POST http://runtime1:32769/v1/responses
Content-Type: application/json

{"model": "llama2", "input": "Hello!"}

Runtime Local Processing
- Runtime loads model on-demand (from local cache or load-balancer-provided source)
- Runtime runs llama.cpp inference and returns an OpenAI-compatible response
LLM Load Balancer → Client (Return Response)

{ "id": "resp_123", "object": "response", "output": [ { "type": "message", "role": "assistant", "content": [ { "type": "output_text", "text": "Hello!" } ] } ] }


> **Note**: LLM Load Balancer supports OpenAI-compatible APIs and **recommends** the
> Responses API (`/v1/responses`). Chat Completions remains available for
> compatibility.

**From Client's Perspective**:
- LLM Load Balancer appears as the only OpenAI-compatible API server
- No need to be aware of multiple internal runtimes
- Complete with a single HTTP request

### Model Sync (No Push Distribution)

- the load balancer never pushes models to runtimes.
- Runtimes resolve models on-demand in this order:
- local cache (`LLM_RUNTIME_MODELS_DIR`)
- allowlisted origin download (Hugging Face, etc.; configure via `LLM_RUNTIME_ORIGIN_ALLOWLIST`)
- manifest-based selection from the load balancer (`GET /api/models/registry/:model_name/manifest.json`)

### Scheduling & Health
- Endpoints are registered via `/api/endpoints` (dashboard UI or API). CPU-only endpoints are also supported.
- Health is pull-based: llmlb periodically probes endpoints and uses status/latency for load balancing.
- Dashboard surfaces `*_key_present` flags so operators see which cloud keys are configured.

### Benefits of Proxy Pattern

1. **Unified Endpoint**
- Clients only need to know the load balancer URL
- No need to know each runtime location

2. **Transparent Load Balancing**
- LLM Load Balancer automatically selects the optimal runtime
- Clients benefit from load distribution without awareness

3. **Automatic Retry on Failure**
- If Runtime1 fails → LLM Load Balancer automatically tries Runtime2
- No re-request needed from client

4. **Security**
- Runtime IP addresses not exposed to clients
- Only LLM Load Balancer needs to be publicly accessible

5. **Scalability**
- Adding runtimes automatically increases processing capacity
- No changes needed on client side

## Project Structure

llmlb/ ├── llmlb/ # Rust load balancer (HTTP APIs, dashboard, proxy, common types) ├── xllm (external) # https://github.com/akiojin/xLLM ├── .claude-plugin/ # Claude Code plugin metadata and bundled skills ├── .codex/skills/ # Codex skills (source) ├── codex-skills/dist/ # Packaged Codex .skill artifacts └── specs/ # Specifications (Spec-Driven Development)


## Dashboard

The dashboard is served by the load balancer at `/dashboard`.
Use it to monitor endpoints, view request history, inspect logs, and manage models.

### Quick usage

1. Start the load balancer:
   ```bash
   cargo run -p llmlb

Open:

http://localhost:32768/dashboard

Playground routes

Endpoint Playground: /dashboard/#playground/:endpointId
- For direct endpoint verification via POST /api/endpoints/:id/chat/completions (JWT session)
LB Playground: /dashboard/#lb-playground
- For load-balancer routing tests via GET /v1/models and POST /v1/chat/completions (API key)

Endpoint Management

the load balancer centrally manages external inference servers (Ollama, vLLM, xLLM, etc.) as "endpoints".

Supported Endpoints

Type	Description	Health Check
xLLM	In-house inference server (llama.cpp/whisper.cpp)	`GET /v1/models`
Ollama	Ollama server	`GET /v1/models`
vLLM	vLLM inference server	`GET /v1/models`
OpenAI-compatible	Other OpenAI-compatible APIs	`GET /v1/models`

Registration via Dashboard

Dashboard → Sidebar "Endpoints"
Click "New Endpoint"
Enter name and base URL (e.g., http://192.168.1.100:11434)
"Connection Test" → "Save"

Registration via REST API

# Register endpoint
curl -X POST http://localhost:32768/api/endpoints \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"name": "Ollama Server A", "base_url": "http://192.168.1.100:11434"}'

# List endpoints
curl http://localhost:32768/api/endpoints \
  -H "Authorization: Bearer sk_your_api_key"

# Sync models
curl -X POST http://localhost:32768/api/endpoints/{id}/sync \
  -H "Authorization: Bearer sk_your_api_key"

Status Transitions

pending: Just registered (awaiting health check)
online: Health check successful
offline: Health check failed
error: Connection error

For details, see CLAUDE.md.

Hugging Face registration (safetensors / GGUF)

Optional env vars: set HF_TOKEN to raise Hugging Face rate limits; set HF_BASE_URL when using a mirror/cache.
Web (recommended):
- Dashboard → Models → Register
- Choose format: safetensors (native engines) or gguf (llama.cpp fallback).
  - If the repo contains both safetensors and .gguf, format is required.
  - Safetensors text generation is available only when the safetensors.cpp engine is enabled (Metal/CUDA). Use gguf for GGUF-only models.
- Enter a Hugging Face repo or file URL (e.g. nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16).
- For format=gguf:
  - Either specify an exact .gguf filename, or choose gguf_policy (quality / memory / speed) to auto-pick from GGUF siblings.
- For format=safetensors:
  - The HF snapshot must include config.json and tokenizer.json.
  - Sharded weights must include an .index.json.
  - If official GPU artifacts are provided (for example model.metal.bin), they may be used as execution cache when supported. Otherwise, safetensors are used directly.
  - Windows requires CUDA builds (BUILD_WITH_CUDA=ON). DirectML is not supported.
- LLM Load Balancer stores metadata + manifest only (no binary download).
- Model IDs are the Hugging Face repo ID (e.g. org/model).
- /v1/models lists models including queued/caching/error with lifecycle_status + download_progress.
- Runtimes pull models on-demand via the model registry endpoints:
  - GET /api/models/registry/:model_name/manifest.json
  - GET /api/models/registry/:model_name/files/:file_name
  - (Legacy) GET /api/models/blob/:model_name for single-file GGUF.
API:
- POST /api/models/register with repo and optional filename.
/v1/models lists registered models; ready reflects runtime sync status.

Installation

Prerequisites

Linux/macOS/Windows x64 (GPU recommended)
Rust toolchain (stable) and cargo
Docker (optional)
CUDA Driver (for NVIDIA GPU) - see CUDA Setup

Pre-built Binaries

Download platform-specific binaries from GitHub Releases.

Platform	Files
Linux x86_64	`llmlb-linux-x86_64.tar.gz`
macOS ARM64 (Apple Silicon)	`llmlb-macos-arm64.tar.gz`, `llmlb-macos-arm64.pkg`
macOS x86_64 (Intel)	`llmlb-macos-x86_64.tar.gz`, `llmlb-macos-x86_64.pkg`
Windows x86_64	`llmlb-windows-x86_64.zip`, `llmlb-windows-x86_64-setup.exe`

macOS Notes

The macOS .pkg installers are not code-signed, so you'll see a security warning on first run.

To install:

Right-click the .pkg file in Finder → Select "Open"
Click "Open" to proceed

Or remove the quarantine attribute via Terminal:

sudo xattr -d com.apple.quarantine llmlb-macos-*.pkg

CUDA Setup (NVIDIA GPU)

For NVIDIA GPU acceleration, you need:

Component	Build Environment	Runtime Environment
CUDA Driver	Required	Required
CUDA Toolkit	Required (for `nvcc`)	Not required

Installing CUDA Driver

The CUDA Driver is typically installed with NVIDIA graphics drivers.

# Verify driver installation
nvidia-smi

If nvidia-smi shows your GPU, the driver is installed.

Installing CUDA Toolkit (Build Environment Only)

Required only for building the runtime with CUDA support (BUILD_WITH_CUDA=ON).

Windows:

Download from CUDA Toolkit Downloads
Select: Windows → x86_64 → 11 → exe (local)
Run the installer (Express installation recommended)
Verify: Open new terminal and run nvcc --version

Linux (Ubuntu/Debian):

# Add NVIDIA package repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install CUDA Toolkit
sudo apt install cuda-toolkit-12-4

# Add to PATH (add to ~/.bashrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Verify
nvcc --version

Note: Runtime environments (runtimes running pre-built binaries) only need the CUDA Driver, not the full Toolkit.

1) Build from Rust source (Recommended)

git clone https://github.com/akiojin/llmlb.git
cd llmlb
make quality-checks   # fmt/clippy/test/markdownlint
cargo build -p llmlb --release

Artifact: target/release/llmlb

2) Run with Docker

docker build -t llmlb:latest .
docker run --rm -p 32768:32768 --gpus all \
  -e OPENAI_API_KEY=... \
  llmlb:latest

If not using GPU, remove --gpus all or set CUDA_VISIBLE_DEVICES="".

3) C++ Runtime Build

See https://github.com/akiojin/xLLM for runtime build/run details.

Requirements

LLM Load Balancer: Rust toolchain (stable)
Runtime: CMake + a C++ toolchain, and a supported GPU (NVIDIA / AMD / Apple Silicon)

Usage

Basic Usage

Start LLM Load Balancer

./target/release/llmlb
# Default: http://0.0.0.0:32768

Start Runtimes on Multiple Machines

# Machine 1
LLMLB_URL=http://lb:32768 \
# Replace with your actual API key (scope: runtime)
LLM_RUNTIME_API_KEY=sk_your_runtime_register_key \

# Machine 2
LLMLB_URL=http://lb:32768 \
# Replace with your actual API key (scope: runtime)
LLM_RUNTIME_API_KEY=sk_your_runtime_register_key \

Send Inference Requests to LLM Load Balancer (OpenAI-compatible, Responses API recommended)

curl http://lb:32768/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "model": "gpt-oss-20b",
    "input": "Hello!"
  }'

Image generation example

curl http://lb:32768/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "model": "stable-diffusion/v1-5-pruned-emaonly.safetensors",
    "prompt": "A white cat sitting on a windowsill",
    "size": "512x512",
    "n": 1,
    "response_format": "b64_json"
  }'

Image understanding example

curl http://lb:32768/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "model": "llava-v1.5-7b",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg=="}}
        ]
      }
    ],
    "max_tokens": 300
  }'

List Registered Endpoints

curl http://lb:32768/api/endpoints \
  # JWT (admin/viewer):
  -H "Authorization: Bearer <jwt>"
  # or API key (permissions: endpoints.read):
  # -H "X-API-Key: sk_your_endpoints_read_key"

Environment Variables

LLM Load Balancer (llmlb)

Variable	Default	Description	Legacy / Notes
`LLMLB_HOST`	`0.0.0.0`	Bind address	-
`LLMLB_PORT`	`32768`	Listen port	-
`LLMLB_DATABASE_URL`	`sqlite:~/.llmlb/load balancer.db`	Database URL	`DATABASE_URL`
`LLMLB_DATA_DIR`	`~/.llmlb`	Base directory for logs, request history, and self-update cache/payload	-
`LLMLB_JWT_SECRET`	(auto-generated)	JWT signing secret	`JWT_SECRET`
`LLMLB_ADMIN_USERNAME`	`admin`	Initial admin username	`ADMIN_USERNAME`
`LLMLB_ADMIN_PASSWORD`	(required, first run)	Initial admin password	`ADMIN_PASSWORD`
`LLMLB_LOG_LEVEL`	`info`	Log level (`EnvFilter`)	`LLM_LOG_LEVEL`, `RUST_LOG`
`LLMLB_LOG_DIR`	`~/.llmlb/logs`	Log directory	`LLM_LOG_DIR` (deprecated)
`LLMLB_LOG_RETENTION_DAYS`	`7`	Log retention days	`LLM_LOG_RETENTION_DAYS`
`LLMLB_HEALTH_CHECK_INTERVAL`	`30`	Endpoint health check interval (seconds)	`HEALTH_CHECK_INTERVAL`
`LLMLB_LOAD_BALANCER_MODE`	`auto`	Load balancer mode (`auto` / `metrics`)	`LOAD_BALANCER_MODE`
`LLMLB_QUEUE_MAX`	`100`	Admission queue limit	`QUEUE_MAX`
`LLMLB_QUEUE_TIMEOUT_SECS`	`60`	Admission queue timeout (seconds)	`QUEUE_TIMEOUT_SECS`
`LLMLB_REQUEST_HISTORY_RETENTION_DAYS`	`7`	Request history retention days	`REQUEST_HISTORY_RETENTION_DAYS`
`LLMLB_REQUEST_HISTORY_CLEANUP_INTERVAL_SECS`	`3600`	Request history cleanup interval (seconds)	`REQUEST_HISTORY_CLEANUP_INTERVAL_SECS`
`LLMLB_DEFAULT_EMBEDDING_MODEL`	`nomic-embed-text-v1.5`	Default embedding model	`LLM_DEFAULT_EMBEDDING_MODEL`
`LLM_DEFAULT_EMBEDDING_MODEL`	`nomic-embed-text-v1.5`	Default embedding model	deprecated (use `LLMLB_DEFAULT_EMBEDDING_MODEL`)
`REQUEST_HISTORY_RETENTION_DAYS`	`7`	Request history retention days	deprecated (use `LLMLB_REQUEST_HISTORY_RETENTION_DAYS`)
`REQUEST_HISTORY_CLEANUP_INTERVAL_SECS`	`3600`	Request history cleanup interval (seconds)	deprecated (use `LLMLB_REQUEST_HISTORY_CLEANUP_INTERVAL_SECS`)

Cloud / external services:

Variable	Default	Description	Notes
`OPENAI_API_KEY`	-	API key for `openai:` models	required
`OPENAI_BASE_URL`	`https://api.openai.com`	Override OpenAI base URL	optional
`GOOGLE_API_KEY`	-	API key for `google:` models	required
`GOOGLE_API_BASE_URL`	`https://generativelanguage.googleapis.com/v1beta`	Override Google base URL	optional
`ANTHROPIC_API_KEY`	-	API key for `anthropic:` models	required
`ANTHROPIC_API_BASE_URL`	`https://api.anthropic.com`	Override Anthropic base URL	optional
`HF_TOKEN`	-	Hugging Face token for model pulls	optional
`LLMLB_API_KEY`	-	API key used by e2e tests/clients	client/test use

Runtime (llm-runtime)

Variable	Default	Description	Legacy / Notes
`LLMLB_URL`	`http://127.0.0.1:32768`	Load balancer URL to register with	-
`LLM_RUNTIME_API_KEY`	-	API key for runtime registration / model registry download	scope: `runtime`
`LLM_RUNTIME_PORT`	`32769`	Runtime listen port	-
`LLM_RUNTIME_MODELS_DIR`	`~/.llmlb/models`	Model storage directory	`LLM_MODELS_DIR`
`LLM_RUNTIME_ORIGIN_ALLOWLIST`	`huggingface.co/,cdn-lfs.huggingface.co/`	Allowlist for direct origin downloads (comma-separated)	`LLM_ORIGIN_ALLOWLIST`
`LLM_RUNTIME_BIND_ADDRESS`	`0.0.0.0`	Bind address	`LLM_BIND_ADDRESS`
`LLM_RUNTIME_IP`	auto-detected	Runtime IP reported to load balancer	-
`LLM_RUNTIME_HEARTBEAT_SECS`	`10`	Heartbeat interval (seconds)	`LLM_HEARTBEAT_SECS`
`LLM_RUNTIME_LOG_LEVEL`	`info`	Log level	`LLM_LOG_LEVEL`, `LOG_LEVEL`
`LLM_RUNTIME_LOG_DIR`	`~/.llmlb/logs`	Log directory	`LLM_LOG_DIR`
`LLM_RUNTIME_LOG_RETENTION_DAYS`	`7`	Log retention days	`LLM_LOG_RETENTION_DAYS`
`LLM_RUNTIME_CONFIG`	`~/.llmlb/config.json`	Path to runtime config file	-
`LLM_MODEL_IDLE_TIMEOUT`	unset	Seconds before unloading idle models	enabled when set
`LLM_MAX_LOADED_MODELS`	unset	Cap on simultaneously loaded models	enabled when set
`LLM_MAX_MEMORY_BYTES`	unset	Max memory for loaded models	enabled when set

Backward compatibility: Legacy names are read for fallback but are deprecated—prefer the new names above.

Note: Engine plugins were removed in favor of built-in managers. See <https://github.com/akiojin/xLLM/blob/main/docs/migrations/plugin-to-manager.md>.

Troubleshooting

GPU not found at startup

Check: nvidia-smi or CUDA_VISIBLE_DEVICES
Disable via env var: Runtime side LLM_ALLOW_NO_GPU=true (disabled by default)
If it still fails, check for NVML library presence

Cloud models return 401/400

Check if OPENAI_API_KEY / GOOGLE_API_KEY / ANTHROPIC_API_KEY are set on the load balancer side
If *_key_present is false in Dashboard /api/dashboard/stats, it's not set
Models without prefixes are routed locally, so do not add a prefix if you don't have cloud keys

Port conflict

LLM Load Balancer: Change LLMLB_PORT (e.g., LLMLB_PORT=18080)
Runtime: Change LLM_RUNTIME_PORT or use --port

SQLite file creation failed

Check write permissions for the directory in LLMLB_DATABASE_URL path
On Windows, check if the path contains spaces

Dashboard does not appear

Clear browser cache
Try cargo clean -> cargo run to check if bundled static files are broken
Check static delivery settings for /dashboard/* if using a reverse proxy

OpenAI compatible API returns 503 / Model not registered

Returns 503 if there are no online endpoints. Wait for endpoint startup/model load or check status at /api/dashboard/endpoints (JWT) or /api/endpoints (JWT/API key).
If specified model does not exist locally, wait for runtime to auto-pull

Too many / too few logs

Control via LLMLB_LOG_LEVEL or RUST_LOG env var (e.g., LLMLB_LOG_LEVEL=info or RUST_LOG=llmlb=debug)
Runtime logs use spdlog. Structured logs can be configured via tracing_subscriber

Development

For detailed development guidelines, testing procedures, and contribution workflow, see CLAUDE.md.

# Full quality gate
make quality-checks

PoCs

gpt-oss (auto): make poc-gptoss
gpt-oss (macOS / Metal): make poc-gptoss-metal
gpt-oss (Linux / CUDA via GGUF, experimental): make poc-gptoss-cuda
- Logs/workdir are created under tmp/poc-gptoss-cuda/ (lb/runtime logs, request JSON, etc.)

Notes:

gpt-oss-20b uses safetensors (index + shards + config/tokenizer) as the source of truth.
GPU is required. Supported backends: macOS (Metal) and Windows (CUDA). Linux/CUDA is experimental.

GitHub Issue-first Development

This project manages specifications via GitHub Issues (gwt-spec label). Use /gwt-issue-spec-ops to define requirements and task breakdowns on Issues. Execute tasks following a strict TDD cycle.

See CLAUDE.md for details.

Claude Code Worktree Hooks

This project uses Claude Code PreToolUse Hooks to enforce Worktree environment boundaries and prevent accidental operations that could disrupt the development workflow.

Features:

Git Branch Protection: Blocks git checkout, git switch, git worktree commands to prevent branch switching
Directory Navigation Control: Blocks cd commands that would move outside the Worktree boundary
Smart Allow Lists: Permits read-only operations like git branch --list
Fast Execution: Average response time < 50ms (target: < 100ms)

Installation & Configuration:

For detailed setup instructions, see CLAUDE.md.

Running Hook Tests:

# Run all Hook contract tests (13 test cases)
make test-hooks

# Or run manually with Bats
npx bats tests/hooks/test-block-git-branch-ops.bats tests/hooks/test-block-cd-command.bats

# Run performance benchmark
tests/hooks/benchmark-hooks.sh

Automated Testing:

Hook tests are automatically executed in CI/CD:

GitHub Actions: .github/workflows/test-hooks.yml (standalone)
Quality Checks: .github/workflows/quality-checks.yml (integrated)
Makefile: make quality-checks includes test-hooks target

Request History

LLM Load Balancer automatically logs all requests and responses for debugging, auditing, and analysis purposes.

Features

Complete Request/Response Logging: Captures full request bodies, response bodies, and metadata
Automatic Retention: Keeps history for 7 days with automatic cleanup
Web Dashboard: View, filter, and search request history through the web interface
Export Capabilities: Export history as CSV
Filtering Options: Filter by model, runtime, status, and time range

Accessing Request History

Via Web Dashboard

Open the load balancer dashboard: http://localhost:32768/dashboard
Navigate to the "Request History" section
Use filters to narrow down specific requests
Click on any request to view full details including request/response bodies

Via API

List Request History:

GET /api/dashboard/request-responses?page=1&per_page=50

Get Request Details:

GET /api/dashboard/request-responses/{id}

Export History:

GET /api/dashboard/request-responses/export

Storage

Request history is stored in SQLite at:

Linux/macOS: ~/.llmlb/lb.db
Windows: %USERPROFILE%\.llmlb\lb.db

Legacy request_history.json files (if present) are automatically imported on startup and renamed to .migrated.

API Specification

LLM Load Balancer API

Authentication Endpoints

Method	Path	Description	Auth
POST	`/api/auth/login`	User authentication, JWT issuance (sets HttpOnly cookie)	None
POST	`/api/auth/logout`	Logout	JWT (HttpOnly cookie or Authorization header)
GET	`/api/auth/me`	Get authenticated user info	JWT (HttpOnly cookie or Authorization header)

Note: When using the JWT cookie for mutating dashboard requests, include the CSRF token via X-CSRF-Token header (token is provided in llmlb_csrf cookie). Origin/Referer must match the dashboard origin.

Roles & API Key Permissions

User roles (JWT):

Role	Capabilities
`admin`	Full access to `/api` management APIs and all dashboard features
`viewer`	Read-only access to dashboard and endpoint read APIs; cannot access admin management APIs

API key permissions:

Permission	Grants
`openai.inference`	OpenAI-compatible inference endpoints (`POST /v1/` except `GET /v1/models`)
`openai.models.read`	Model discovery (`GET /v1/models*`)
`endpoints.read`	Read-only endpoint APIs (`GET /api/endpoints*`)
`endpoints.manage`	Endpoint mutations (`POST/PUT/DELETE /api/endpoints*`, `POST /api/endpoints/:id/test`, `POST /api/endpoints/:id/sync`, `POST /api/endpoints/:id/download`)
`users.manage`	User management (`/api/users*`)
`invitations.manage`	Invitation management (`/api/invitations*`)
`models.manage`	Model register/delete (`POST /api/models/register`, `DELETE /api/models/*`)
`registry.read`	Model registry and lists (`GET /api/models/registry/*`, `GET /api/models`, `GET /api/models/hub`)
`logs.read`	Endpoint log proxy (`GET /api/endpoints/:id/logs`)
`metrics.read`	Metrics export (`GET /api/metrics/cloud`)

Debug builds accept sk_debug, sk_debug_runtime, sk_debug_api, sk_debug_admin (see docs/authentication.md).

Note: /api/dashboard/* is JWT-only (API keys are rejected). POST /api/me/api-keys permission rules by role:

admin: must provide a non-empty permissions array.
viewer: must not provide permissions; server assigns fixed OpenAI permissions (openai.inference, openai.models.read).

User Management Endpoints

Method	Path	Description	Auth
GET	`/api/users`	List users	JWT+Admin or API key (`users.manage`)
POST	`/api/users`	Create user	JWT+Admin or API key (`users.manage`)
PUT	`/api/users/:id`	Update user	JWT+Admin or API key (`users.manage`)
DELETE	`/api/users/:id`	Delete user	JWT+Admin or API key (`users.manage`)

API Key Management Endpoints (Self-service)

Method	Path	Description	Auth
GET	`/api/me/api-keys`	List own API keys	JWT
POST	`/api/me/api-keys`	Create own API key (admin: explicit permissions, viewer: fixed OpenAI permissions)	JWT
PUT	`/api/me/api-keys/:id`	Update own API key	JWT
DELETE	`/api/me/api-keys/:id`	Delete own API key	JWT

Invitation Management Endpoints

Method	Path	Description	Auth
GET	`/api/invitations`	List invitations	JWT+Admin or API key (`invitations.manage`)
POST	`/api/invitations`	Create invitation	JWT+Admin or API key (`invitations.manage`)
DELETE	`/api/invitations/:id`	Revoke invitation	JWT+Admin or API key (`invitations.manage`)

Endpoint Management Endpoints

Method	Path	Description	Auth
GET	`/api/endpoints`	List endpoints	JWT (admin/viewer) or API key (`endpoints.read`)
GET	`/api/endpoints/:id`	Get endpoint details	JWT (admin/viewer) or API key (`endpoints.read`)
GET	`/api/endpoints/:id/models`	List endpoint models	JWT (admin/viewer) or API key (`endpoints.read`)
GET	`/api/endpoints/:id/models/:model/info`	Get endpoint model info	JWT (admin/viewer) or API key (`endpoints.read`)
GET	`/api/endpoints/:id/download/progress`	Download progress	JWT (admin/viewer) or API key (`endpoints.read`)
POST	`/api/endpoints`	Register endpoint	JWT+Admin or API key (`endpoints.manage`)
PUT	`/api/endpoints/:id`	Update endpoint	JWT+Admin or API key (`endpoints.manage`)
DELETE	`/api/endpoints/:id`	Delete endpoint	JWT+Admin or API key (`endpoints.manage`)
POST	`/api/endpoints/:id/test`	Connection test	JWT+Admin or API key (`endpoints.manage`)
POST	`/api/endpoints/:id/sync`	Sync models	JWT+Admin or API key (`endpoints.manage`)
POST	`/api/endpoints/:id/download`	Download model	JWT+Admin or API key (`endpoints.manage`)

OpenAI-Compatible Endpoints

Method	Path	Description	Auth
POST	`/v1/chat/completions`	Chat completions API	API key (`openai.inference`)
POST	`/v1/completions`	Text completions API	API key (`openai.inference`)
POST	`/v1/embeddings`	Embeddings API	API key (`openai.inference`)
POST	`/v1/responses`	Responses API	API key (`openai.inference`)
POST	`/v1/audio/transcriptions`	Audio transcriptions API	API key (`openai.inference`)
POST	`/v1/audio/speech`	Audio speech API	API key (`openai.inference`)
POST	`/v1/images/generations`	Image generations API	API key (`openai.inference`)
POST	`/v1/images/edits`	Image edits API	API key (`openai.inference`)
POST	`/v1/images/variations`	Image variations API	API key (`openai.inference`)
GET	`/v1/models`	List models (Azure-style capabilities)	API key (`openai.models.read`)
GET	`/v1/models/:model_id`	Get specific model info	API key (`openai.models.read`)

Model Management Endpoints

Method	Path	Description	Auth
GET	`/api/models`	List registered models	JWT+Admin or API key (`registry.read`)
GET	`/api/models/hub`	List supported models + status	JWT+Admin or API key (`registry.read`)
POST	`/api/models/register`	Register model (HF)	JWT+Admin or API key (`models.manage`)
DELETE	`/api/models/*model_name`	Delete model	JWT+Admin or API key (`models.manage`)
GET	`/api/models/registry/:model_name/manifest.json`	Get model manifest (file list)	API key (`registry.read`)

Dashboard Endpoints

Method	Path	Description	Auth
GET	`/api/dashboard/endpoints`	Endpoint info list	JWT only
GET	`/api/dashboard/models`	Dashboard models list	JWT only
GET	`/api/dashboard/stats`	System statistics	JWT only
GET	`/api/dashboard/request-history`	Request history (legacy)	JWT only
GET	`/api/dashboard/overview`	Dashboard overview	JWT only
GET	`/api/dashboard/metrics/:runtime_id`	Endpoint metrics history	JWT only
GET	`/api/dashboard/request-responses`	Request/response list	JWT only
GET	`/api/dashboard/request-responses/:id`	Request/response details	JWT only
GET	`/api/dashboard/request-responses/export`	Export request/responses	JWT only
GET	`/api/dashboard/stats/tokens`	Token stats	JWT only
GET	`/api/dashboard/stats/tokens/daily`	Daily token stats	JWT only
GET	`/api/dashboard/stats/tokens/monthly`	Monthly token stats	JWT only
GET	`/api/dashboard/logs/lb`	Load balancer logs	JWT only

Log & Metrics Endpoints

Method	Path	Description	Auth
GET	`/api/endpoints/:id/logs`	Endpoint logs proxy	JWT+Admin or API key (`logs.read`)
GET	`/api/metrics/cloud`	Prometheus metrics export	JWT+Admin or API key (`metrics.read`)

Playground Proxy

Method	Path	Description	Auth
POST	`/api/endpoints/:id/chat/completions`	Proxy to endpoint for dashboard playground	JWT only

Notes:

The proxy above is for endpoint-specific playground (#playground/:endpointId).
LB Playground (#lb-playground) uses the standard OpenAI-compatible routes (/v1/models, /v1/chat/completions) with API key auth.

Static Files & Metrics

Method	Path	Description	Auth
GET	`/dashboard`	Dashboard UI	None
GET	`/dashboard/*path`	Dashboard static files	None
GET	`/api/metrics/cloud`	Prometheus metrics export	JWT+Admin or API key (`metrics.read`)

Runtime API (C++)

OpenAI-Compatible Endpoints

Method	Path	Description
GET	`/v1/models`	List available models
POST	`/v1/chat/completions`	Chat completions (streaming supported)
POST	`/v1/completions`	Text completions
POST	`/v1/embeddings`	Embeddings generation

Runtime Management Endpoints

Method	Path	Description
GET	`/health`	Health check
GET	`/startup`	Startup status check
GET	`/metrics`	Metrics (JSON format)
GET	`/metrics/prom`	Prometheus metrics
GET	`/api/logs?tail=200`	Tail runtime logs (JSON)
GET	`/log/level`	Get current log level
POST	`/log/level`	Change log level
GET	`/internal-error`	Intentional error (debug)

Request/Response Examples

POST /api/endpoints

Register an endpoint.

Headers:

JWT (admin): Authorization: Bearer <jwt>
or API key (permissions: endpoints.manage): X-API-Key: <api_key>

Request:

{
  "name": "my-vllm",
  "base_url": "http://127.0.0.1:8000",
  "api_key": "sk-optional"
}

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "my-vllm",
  "base_url": "http://127.0.0.1:8000",
  "status": "pending",
  "created_at": "2026-02-09T00:00:00Z"
}

GET /v1/models

List available models with Azure OpenAI-style capabilities.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/llama-3.1-8b",
      "object": "model",
      "created": 0,
      "owned_by": "lb",
      "capabilities": {
        "chat_completion": true,
        "completion": true,
        "embeddings": false,
        "fine_tune": false,
        "inference": true,
        "text_to_speech": false,
        "speech_to_text": false,
        "image_generation": false
      },
      "ready": true
    }
  ]
}

Note: capabilities uses Azure OpenAI-style boolean object format. ready is a load balancer extension derived from runtime sync state.

POST /v1/responses

Responses API (recommended).

Request:

{
  "model": "gpt-oss-20b",
  "input": "Hello!"
}

Response:

{
  "id": "resp_123",
  "object": "response",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "Hello! How can I help you?" }
      ]
    }
  ]
}

Compatibility: /v1/chat/completions remains available for legacy clients. Important: LLM Load Balancer only supports OpenAI-compatible response format.

License

MIT License

Contributing

Issues and Pull Requests are welcome.

For detailed development guidelines, see CLAUDE.md.

Cloud model prefixes (OpenAI-compatible API)

Supported prefixes: openai:, google:, anthropic: (alias ahtnorpic:)
Usage: set model to e.g. openai:gpt-4o, google:gemini-1.5-pro, anthropic:claude-3-opus
Environment variables:
- OPENAI_API_KEY (required), OPENAI_BASE_URL (optional, default https://api.openai.com)
- GOOGLE_API_KEY (required), GOOGLE_API_BASE_URL (optional, default https://generativelanguage.googleapis.com/v1beta)
- ANTHROPIC_API_KEY (required), ANTHROPIC_API_BASE_URL (optional, default https://api.anthropic.com)
Behavior: prefix is stripped before forwarding; responses remain OpenAI-compatible. Streaming is passthrough as SSE.
Metrics: /api/metrics/cloud exports Prometheus text with per-provider counters (cloud_requests_total{provider,status}) and latency histogram (cloud_request_latency_seconds{provider}).

Name		Name	Last commit message	Last commit date
Latest commit History 2,871 Commits
.cargo		.cargo
.claude-plugin		.claude-plugin
.claude		.claude
.codex/skills		.codex/skills
.cursor/skills		.cursor/skills
.gemini/skills		.gemini/skills
.github		.github
.husky		.husky
.opencode/skill		.opencode/skill
assets/icons		assets/icons
benchmarks		benchmarks
codex-skills/dist		codex-skills/dist
docs		docs
installers/windows		installers/windows
llmlb		llmlb
memory		memory
poc		poc
scripts		scripts
tests		tests
.chore-trigger		.chore-trigger
.dockerignore		.dockerignore
.eslintignore		.eslintignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint-cli2.jsonc		.markdownlint-cli2.jsonc
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
.mcp.json		.mcp.json
.prettierignore		.prettierignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.ja.md		README.ja.md
README.md		README.md
a.md		a.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
commitlint.config.js		commitlint.config.js
docker-compose.yml		docker-compose.yml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
rustfmt.toml		rustfmt.toml

Folders and files

Latest commit

History

Repository files navigation

LLM Load Balancer

Overview

Vision

Supported Endpoints

Multimodal Support

Key Features

Assistant CLI for LLM Assistants

Why llmlb assistant?

Core Commands

Skills / Plugins

Quick Start

LLM Load Balancer (llmlb)

CLI Reference

xLLM (C++)

Load Balancing

Strategies

1. Metrics-Based Load Balancing (Recommended)

2. Advanced Load Balancing (Default)

Health / Metrics

Architecture

Components

System Overview

Request Flow

Communication Flow (Proxy Pattern)

Traditional Method (Without LLM Load Balancer)

With LLM Load Balancer (Proxy)

Playground routes

Endpoint Management

Supported Endpoints

Registration via Dashboard

Registration via REST API

Status Transitions

Hugging Face registration (safetensors / GGUF)

Installation

Prerequisites

Pre-built Binaries

macOS Notes

CUDA Setup (NVIDIA GPU)

Installing CUDA Driver

Installing CUDA Toolkit (Build Environment Only)

1) Build from Rust source (Recommended)

2) Run with Docker

3) C++ Runtime Build

Requirements

Usage

Basic Usage

Environment Variables

LLM Load Balancer (llmlb)

Runtime (llm-runtime)

Troubleshooting

GPU not found at startup

Cloud models return 401/400

Port conflict

SQLite file creation failed

Dashboard does not appear

OpenAI compatible API returns 503 / Model not registered

Too many / too few logs

Development

PoCs

GitHub Issue-first Development

Claude Code Worktree Hooks

Request History

Features

Accessing Request History

Via Web Dashboard

Via API

Storage

API Specification

LLM Load Balancer API

Authentication Endpoints

Roles & API Key Permissions

User Management Endpoints

API Key Management Endpoints (Self-service)

Invitation Management Endpoints

Endpoint Management Endpoints

OpenAI-Compatible Endpoints

Why `llmlb assistant`?

Packages