#artificial-intelligence #openai #ollama #llm

bin+lib nexus-orchestrator

Distributed LLM model serving orchestrator - unified API gateway for heterogeneous inference backends

2 unstable releases

new 0.2.0 Feb 14, 2026
0.1.0 Feb 9, 2026

#1493 in Command line utilities

Apache-2.0

430KB
10K SLoC

Nexus

Rust License

Crates.io docs.rs

CI codecov

GitHub Release Docker

One API endpoint. Any backend. Zero configuration.

Nexus is a distributed LLM model serving orchestrator that unifies heterogeneous inference backends behind a single, intelligent API gateway.

Features

  • πŸ” Auto-Discovery: Automatically finds LLM backends on your network via mDNS
  • 🎯 Intelligent Routing: Routes requests based on model capabilities and load
  • πŸ”„ Transparent Failover: Automatically retries with fallback backends
  • πŸ”Œ OpenAI-Compatible: Works with any OpenAI API client
  • ⚑ Zero Config: Just run it - works out of the box with Ollama
  • πŸ“Š Structured Logging: Queryable JSON logs for every request with correlation IDs (quickstart)

Supported Backends

Backend Status Notes
Ollama βœ… Supported Auto-discovery via mDNS
LM Studio βœ… Supported OpenAI-compatible API
vLLM βœ… Supported Static configuration
llama.cpp server βœ… Supported Static configuration
exo βœ… Supported Auto-discovery via mDNS
OpenAI βœ… Supported Cloud fallback
LocalAI πŸ”œ Planned

Quick Start

From Source

# Install
cargo install --path .

# Generate a configuration file
nexus config init

# Run with auto-discovery
nexus serve

# Or with a custom config file
nexus serve --config nexus.toml

Docker

# Run with default settings
docker run -d -p 3000:3000 leocamello/nexus

# Run with custom config
docker run -d -p 3000:3000 \
  -v $(pwd)/nexus.toml:/home/nexus/nexus.toml \
  leocamello/nexus serve --config nexus.toml

# Run with host network (for mDNS discovery)
docker run -d --network host leocamello/nexus

From GitHub Releases

Download pre-built binaries from Releases.

CLI Commands

# Start the server
nexus serve [--config FILE] [--port PORT] [--host HOST]

# List backends
nexus backends list [--json] [--status healthy|unhealthy|unknown]

# Add a backend manually (auto-detects type)
nexus backends add http://localhost:11434 [--name NAME] [--type ollama|vllm|llamacpp]

# Remove a backend
nexus backends remove <ID>

# List available models
nexus models [--json] [--backend ID]

# Show system health
nexus health [--json]

# Generate config file
nexus config init [--output FILE] [--force] [--minimal]

# Generate shell completions
nexus completions bash > ~/.bash_completion.d/nexus
nexus completions zsh > ~/.zsh/completions/_nexus
nexus completions fish > ~/.config/fish/completions/nexus.fish

Environment Variables

Variable Description Default
NEXUS_CONFIG Config file path nexus.toml
NEXUS_PORT Listen port 8000
NEXUS_HOST Listen address 0.0.0.0
NEXUS_LOG_LEVEL Log level (trace/debug/info/warn/error) info
NEXUS_LOG_FORMAT Log format (pretty/json) pretty
NEXUS_DISCOVERY Enable mDNS discovery true
NEXUS_HEALTH_CHECK Enable health checking true

Precedence: CLI args > Environment variables > Config file > Defaults

API Usage

Once running, Nexus exposes an OpenAI-compatible API:

# Health check
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Chat completion (non-streaming)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Chat completion (streaming)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Web Dashboard

Nexus includes a web dashboard for real-time monitoring and observability. Access it at http://localhost:8000/ in your browser.

Features:

  • πŸ“Š Real-time backend health monitoring with status indicators
  • πŸ—ΊοΈ Model availability matrix showing which models are available on which backends
  • πŸ“ Request history with last 100 requests, durations, and error details
  • πŸ”„ WebSocket-based live updates (with HTTP polling fallback)
  • πŸ“± Fully responsive - works on desktop, tablet, and mobile
  • πŸŒ™ Dark mode support (system preference)
  • πŸš€ Works without JavaScript (graceful degradation with auto-refresh)

The dashboard provides a visual overview of your Nexus cluster, making it easy to monitor backend health, track model availability, and debug request issues in real-time.

With Claude Code / Continue.dev

Point your AI coding assistant to http://localhost:8000 as the API endpoint.

With OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3:70b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Observability

Nexus exposes metrics for monitoring and debugging:

# Prometheus metrics (for Grafana, Prometheus, etc.)
curl http://localhost:8000/metrics

# JSON stats (for dashboards and debugging)
curl http://localhost:8000/v1/stats | jq

Prometheus metrics include request counters, duration histograms, error rates, backend latency, token usage, and fleet state gauges. Configure your Prometheus scraper to target http://<nexus-host>:8000/metrics.

JSON stats provide an at-a-glance view with uptime, per-backend request counts, latency, and pending request depth.

Configuration

# nexus.toml

[server]
host = "0.0.0.0"
port = 8000

[discovery]
enabled = true

[[backends]]
name = "local-ollama"
url = "http://localhost:11434"
type = "ollama"
priority = 1

[[backends]]
name = "gpu-server"
url = "http://192.168.1.100:8000"
type = "vllm"
priority = 2

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Nexus Orchestrator                β”‚
β”‚  - Discovers backends via mDNS              β”‚
β”‚  - Tracks model capabilities                β”‚
β”‚  - Routes to best available backend         β”‚
β”‚  - OpenAI-compatible API                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚           β”‚           β”‚
        β–Ό           β–Ό           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Ollama β”‚  β”‚  vLLM  β”‚  β”‚  exo   β”‚
   β”‚  7B    β”‚  β”‚  70B   β”‚  β”‚  32B   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Development

# Build
cargo build

# Run tests
cargo test

# Run with logging
RUST_LOG=debug cargo run -- serve

# Check formatting
cargo fmt --check

# Lint
cargo clippy

License

Apache License 2.0 - see LICENSE for details.

  • exo - Distributed AI inference
  • LM Studio - Desktop app for local LLMs
  • Ollama - Easy local LLM serving
  • vLLM - High-throughput LLM serving
  • LiteLLM - Cloud LLM API router

Dependencies

~29–48MB
~616K SLoC