2 releases (1 stable)
| 1.0.0 | Jan 21, 2026 |
|---|---|
| 0.0.1 | Jan 6, 2026 |
#52 in Caching
29,805 downloads per month
Used in 2 crates
180KB
3.5K
SLoC
Shepherd Model Gateway
High-performance model-routing gateway for large-scale LLM deployments. Centralizes worker lifecycle management, balances traffic across HTTP/gRPC/OpenAI-compatible backends, and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows.
Why SMG?
| 🚀 Maximize GPU Utilization | Cache-aware routing understands your inference engine's KV cache state—whether vLLM, SGLang, or TensorRT-LLM—to reuse prefixes and reduce redundant computation. |
| 🔌 One API, Any Backend | Route to self-hosted models (vLLM, SGLang, TensorRT-LLM) or cloud providers (OpenAI, Anthropic, Gemini, Bedrock, and more) through a single unified endpoint. |
| ⚡ Built for Speed | Native Rust with gRPC pipelines, sub-millisecond routing decisions, and zero-copy tokenization. Circuit breakers and automatic failover keep things running. |
| 🔒 Enterprise Control | Multi-tenant rate limiting with OIDC, WebAssembly plugins for custom logic, and a privacy boundary that keeps conversation history within your infrastructure. |
| 📊 Full Observability | 40+ Prometheus metrics, OpenTelemetry tracing, and structured JSON logs with request correlation—know exactly what's happening at every layer. |
API Coverage: OpenAI Chat/Completions/Embeddings, Responses API for agents, Anthropic Messages, and MCP tool execution.
Quick Start
Install — pick your preferred method:
# Docker
docker pull lightseekorg/smg:latest
# Python
pip install smg
# Rust
cargo install smg
Run — point SMG at your inference workers:
# Single worker
smg --worker-urls http://localhost:8000
# Multiple workers with cache-aware routing
smg --worker-urls http://gpu1:8000 http://gpu2:8000 --policy cache_aware
# With high availability mesh
smg --worker-urls http://gpu1:8000 --ha-mesh --seeds 10.0.0.2:30001,10.0.0.3:30001
Use — send requests to the gateway:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello!"}]}'
That's it. SMG is now load-balancing requests across your workers.
Supported Backends
| Self-Hosted | Cloud Providers |
|---|---|
| vLLM | OpenAI |
| SGLang | Anthropic |
| TensorRT-LLM | Google Gemini |
| Ollama | AWS Bedrock |
| Any OpenAI-compatible server | Azure OpenAI |
Features
| Feature | Description |
|---|---|
| 8 Routing Policies | cache_aware, round_robin, power_of_two, consistent_hashing, prefix_hash, manual, random, bucket |
| gRPC Pipeline | Native gRPC with streaming, reasoning extraction, and tool call parsing |
| MCP Integration | Connect external tool servers via Model Context Protocol |
| High Availability | Mesh networking with SWIM protocol for multi-node deployments |
| Chat History | Pluggable storage: PostgreSQL, Oracle, Redis, or in-memory |
| WASM Plugins | Extend with custom WebAssembly logic |
| Resilience | Circuit breakers, retries with backoff, rate limiting |
Documentation
| Getting Started | Installation and first steps |
| Architecture | How SMG works |
| Configuration | CLI reference and options |
| API Reference | OpenAI-compatible endpoints |
| Kubernetes Setup | In-cluster discovery and production setup |
Contributing
We welcome contributions! See Contributing Guide for details.
lib.rs:
Radix tree implementations for prefix matching and cache-aware routing.
This module provides radix tree data structures that mirror SGLang's scheduler memory management patterns. Two implementations are available:
StringTree: Character-based tree for HTTP router (text input)TokenTree: Token-based tree for gRPC router (pre-tokenized input)
Both implementations support:
- Multi-tenant prefix tracking with LRU eviction
- Concurrent access via DashMap and RwLock
- Efficient prefix matching with match counts
Dependencies
~6–8.5MB
~74K SLoC