1 unstable release
| 0.1.0 | Mar 16, 2026 |
|---|
#1099 in Graphics APIs
Used in 3 crates
3MB
68K
SLoC
RMLX
Rust ML runtime for Apple Silicon — zero-copy Metal GPU pipeline with RDMA distributed inference
E2E 32-layer: Prefill 1.1–1.6× · Decode ~2.1× vs MLX · 24.05T FP16 GEMM · TP=2 5.0× · EP 34×
Install · Quickstart · Performance · Features · Architecture · Roadmap · Docs
🇰🇷 한국어 문서: docs/README_ko.md
🧠 What is RMLX?
RMLX reimplements Apple's MLX Metal GPU pipeline entirely in Rust, built on objc2-metal / objc2 / block2 / objc2-foundation. The framework is organized into 7 crates spanning GPU compute, memory allocation, neural network layers, and RDMA-based distributed inference.
End-to-end on a 32-layer Qwen 7B-style model (M3 Ultra, f16, random weights): prefill 1.1–1.6× faster, decode ~2.1× faster (~44 tok/s vs 21 tok/s) than MLX compiled. Kernel-level: FP16 GEMM 24.05 TFLOPS (MLX parity), QMM Q4 17.43 TFLOPS (+28% vs MLX). Distributed: TP=2 5.0× faster decode than MLX JACCL, EP 34× vs MLX end-to-end. Phase 9 adds 7 model architectures and generate() API; Phase 10 delivers 20 kernel/infra optimizations.
Single Rust binary. Zero-copy unified memory. No Python runtime, no framework overhead.
📦 Install
Prerequisites: macOS 14+ on Apple Silicon (M1 or later). See full prerequisites.
git clone https://github.com/0xDaizz/RMLX.git
cd RMLX
cargo build --workspace
🚀 Quickstart
Build & Test
cargo build --workspace # Build all 7 crates
cargo test --workspace # Run 1,298 tests
Benchmark
cargo bench -p rmlx-nn --bench pipeline_bench
Distributed RDMA (2-node)
# One-time install
cargo install --path crates/rmlx-cli
# Auto-detect TB5 topology, assign IPs, configure interfaces
rmlx config --hosts node1,node2 --auto-setup --output rmlx-hosts.json --verbose
# Launch distributed job
rmlx launch --backend rdma --hostfile rmlx-hosts.json -- ibv_devices
--auto-setupdiscovers Thunderbolt connections viasystem_profiler, assigns point-to-point IPs, and configures RDMA interfaces automatically.
📊 Performance
⚡ E2E Model Performance (PRIMARY) — 32-layer Qwen 7B-style, M3 Ultra, f16
All E2E numbers measured on 32-layer model with random weights (Qwen 7B-style config), M3 Ultra, f16.
Prefill (32 layers):
| seq_len | RMLX compiled (μs) | MLX compiled (μs) | RMLX/MLX |
|---|---|---|---|
| 32 | 46,319 | 74,907 | 1.62× |
| 128 | 116,647 | 122,254 | 1.05× |
| 512 | 356,021 | 446,571 | 1.25× |
| 1024 | 673,126 | 762,791 | 1.13× |
RMLX Prefill TFLOPS:
| seq_len | TFLOPS |
|---|---|
| 32 | 9.49 |
| 128 | 15.08 |
| 512 | 19.76 |
| 1024 | 20.90 |
Decode (32 layers, M=1, kv=128 tokens):
| RMLX (est. from 60L) | MLX compiled | Ratio | |
|---|---|---|---|
| Latency | ~22,500 μs (704 μs/layer × 32) | 47,253 μs | ~2.1× |
| tok/s | ~44 | 21.2 | ~2.1× |
Note: RMLX decode is estimated from 60-layer pipeline_bench (704 μs/layer × 32 = 22,528 μs). A direct 32-layer E2E decode bench should be added for exact comparison.
⚡ Kernel-Level Benchmarks
All kernel-level numbers measured on Apple Silicon (M3 Ultra), single transformer layer, Qwen 3.5 MoE A22B config, f16 unless noted.
Decode — 110 ms → 703 μs (156× internal speedup)
| Stage | Latency | vs Naive | Key Technique |
|---|---|---|---|
| Naive (per-op sync) | 110 ms | 1× | 65 CBs, GPU idle between dispatches |
| ExecGraph | 2.8 ms | 39× | CB batching 65 → 5 |
| 9-Dispatch + PSO Cache | 1,081 μs | 102× | Single-CB decode, kernel optimization |
| 7-Dispatch Fusion (60L) | 703 μs | 156× | fused_rms_gemv + fused_swiglu_down |
BW efficiency 73.6% at 60L pipeline floor.
Prefill GEMM — 24.05 TFLOPS
| Metric | RMLX | MLX | Delta |
|---|---|---|---|
| FP16 GEMM (M=512) | 24.05T | 24T | Pipe parity |
| FP16 GEMM peak | 46.3T | ~23T | 2× MLX |
| Small-M dispatch (M=32–256) | 1.24–2.83× | 1× | RMLX faster |
| QMM Q4 (M=512) | 17.43T | 13.6T | +28% |
⚡ Distributed (2-node TB5 RDMA)
TP Decode — 5.0× vs MLX at TP=2
| Config | RMLX | MLX | Speedup |
|---|---|---|---|
| TP=2 Split-CB | 375.5 μs | 1,880 μs (JACCL) | 5.0× |
| Per-op → Split-CB | 18,193 μs → 375.5 μs | — | 48× (architecture) |
| RDMA allreduce (1×) | 14.6 μs | 87 μs (JACCL 2×) | 6.0× |
Split-CB TP batches an entire layer into 2 command buffers with inter-CB allreduce, eliminating per-op GPU sync overhead.
TP Prefill — Split-CB
| Config | M=128 | M=512 |
|---|---|---|
| RMLX TP=1 | 5,033.7 μs | 12,723.2 μs |
| RMLX TP=2 Split-CB | 3,017.2 μs | 6,979.3 μs |
| MLX TP=1 | 3,890 μs | 11,795 μs |
| MLX TP=2 JACCL | 3,700 μs | 8,049 μs |
Expert Parallelism — 30–178× vs MLX
| Config | RMLX | MLX (mx.compile) | Speedup |
|---|---|---|---|
| Single expert FFN (M=1..512) | 42–54 μs | 1,338–9,609 μs | 30–178× |
| MoE grouped seq=4 (8 experts) | 359 μs | — | 68× vs pre-pooling |
| MoE grouped seq=32 | 665 μs | — | — |
| MoE grouped seq=128 | 1,658 μs | — | — |
Buffer-pooled
grouped_forward: 32 allocations → 4 bulk allocations (14 ms → 359 μs, 39×).commandBufferWithUnretainedReferencesremoves CB retain/release overhead.
EP-2 End-to-End (4 experts/rank + RDMA)
| Seq Length | RMLX EP-2 | MLX EP-2 (JACCL) | Speedup |
|---|---|---|---|
| seq=2 (grouped) | 200.8 μs | 6,895 μs | 34× |
| seq=32 (grouped) | 549.0 μs | — | — |
| seq=64 | 672 μs | — | — |
EP-2 grouped seq=2: was 212.0 μs baseline, now 200.8 μs (-5.3%). EP-2 grouped seq=32: was 578.8 μs, now 549.0 μs (-5.1%).
RDMA vs JACCL Transport
| Payload | RMLX RDMA | MLX JACCL | Speedup |
|---|---|---|---|
| 28 KB | 12 μs | 79 μs | 6.6× |
| 896 KB | 97 μs | 308 μs | 3.2× |
✨ Features
RMLX vs MLX vs CUDA
| Feature | RMLX | MLX | CUDA |
|---|---|---|---|
| Unified memory (zero-copy) | ✅ | ✅ | ❌ |
| 7-dispatch fused decode | ✅ | ❌ | ❌ |
| Single-CB prefill pipeline | ✅ | ❌ | ❌ |
| Expert parallelism (MoE, 30–178×) | ✅ | ❌ | ⚠️ |
| Zero-copy RDMA | ✅ | ❌ | ❌ |
| Flash Attention 2 | ✅ | ✅ | ✅ |
| MLA (DeepSeek-V3) | ✅ | ❌ | ⚠️ |
| GGUF model loading | ✅ | ✅ | ✅ |
| Quantized inference (Q4/Q8) | ✅ | ✅ | ✅ |
| Single Rust binary | ✅ | ❌ | ❌ |
| Metal 4 support (macOS 26+) | ✅ | ❌ | ❌ |
🔧 Key Capabilities
32+ GPU Ops
- Flash Attention 2 Metal kernel (tiled online softmax, D up to 256)
- SIMD group MMA matmul, BM=8 GEMV with dynamic tile selection
- Batched SDPA decode with slab KV cache
- FP8 (E4M3/E5M2), AWQ/GPTQ INT4, K-quant (Q2K–Q6K)
- Fused kernels: SiLU-mul, RMSNorm+residual, GEMV+bias, GEMM+residual epilogue
- GEMM: MLX-architecture kernel (BK=16, 2 SG, serpentine MMA)
- QMM MMA Q4/Q8, QMV qdot pattern — no CPU fallback
Infrastructure
- ExecGraph: command buffer batching (65 CB → 5)
- CachedDecode: pre-resolved PSOs, zero per-token allocation
- Metal:
objc2-metal 0.3with ComputePass zero-cost abstraction, ChipTuning (M1–M4), DiskPipelineCache - Allocator: zero-copy (posix_memalign + MTLBuffer), BFC, residency manager
- RDMA: ibverbs FFI, TB5 multi-port, ring/allreduce/allgather collectives
- Distributed: TP with Split-CB, expert parallelism (3-zone auto), tree allreduce, topology-aware CLI
Neural Network Layers
- Models: Qwen2, Qwen2.5-MoE, LLaMA3, GLM-4, MiniMax-01, DeepSeek-V3, Mixtral, Kimi K2.5
- Attention: Multi-Head, GQA, MLA, Sliding Window
- KV cache: static, rotating, paged (vLLM-style), quantized, slab decode
- Quantization: QuantizedLinear, AWQ, GPTQ GPU kernels, K-Quant GPU (Q2K–Q6K)
- Formats: FP16, Q4/Q8 Affine, AWQ INT4, GPTQ INT4, K-Quant (Q2–Q6), FP8 (E4M3/E5M2)
- Loading: GGUF v2/v3 with tensor mapping
- Generation:
generate()API with temperature, top-k, top-p sampling
🏗️ Architecture
graph TD
CLI[🖥️ rmlx-cli] --> NN[🧠 rmlx-nn]
NN --> CORE[⚙️ rmlx-core]
CORE --> METAL[🔩 rmlx-metal]
CORE --> ALLOC[📦 rmlx-alloc]
DIST[🌐 rmlx-distributed] --> CORE
DIST --> RDMA[🔗 rmlx-rdma]
ALLOC --> RDMA
METAL -.-> ALLOC
| Crate | Role |
|---|---|
| rmlx-cli | Launch, config, topology discovery |
| rmlx-nn | Models, attention, MoE, KV cache, GGUF loader |
| rmlx-core | 32+ op modules, Array/DType, autodiff |
| rmlx-metal | Device, ExecGraph, ChipTuning, pipeline cache, ComputePass (objc2-metal), Metal 4 |
| rmlx-alloc | Zero-copy allocator, BFC, residency manager |
| rmlx-distributed | Expert parallelism, allreduce, topology, TP |
| rmlx-rdma | ibverbs FFI, TB5 multi-port, collectives |
🗺️ Roadmap
| Era | Phases | Key Result | Status |
|---|---|---|---|
| Foundation | Phase 0 → 7C | Core framework, Metal bindings, ExecGraph, RDMA infra | ✅ Complete |
| Decode Optimization | KO → Phase 11 | E2E ~44 tok/s (~2.1× MLX), 703 μs/layer (156×) | ✅ Complete |
| GEMM & Prefill | Phase A → D | 24.05T TFLOPS, MLX parity, single-CB prefill | ✅ Complete |
| Quantized Kernels | Phase F → J | QMM Q4 17.43T (+28% vs MLX), QMV near-parity | ✅ Complete |
| Distributed RDMA | EP-1 → 6, RDMA-7 | TP=2 5.0× vs MLX, Split-CB, 14 μs allreduce | ✅ Complete |
| EP + Buffer Pooling | Phase 8 | EP 30–178× vs MLX, MoE 68× improvement, EP-2 e2e 54× | ✅ Complete |
| Model Coverage + API | Phase 9 | 7 architectures, generate() API, AWQ/GPTQ/K-Quant GPU kernels | ✅ Complete |
| Final Optimization | Phase 10 | 20 kernel/infra opts (14 merged), decode 175.6 μs, EP-2 200.8 μs | ✅ Complete |
| Next | KO-2, KO-3, EP-7 | Multi-token decode, speculative decoding, EP scaling | 🔜 Planned |
See full roadmap and benchmark report for details.
📚 Docs
| Document | Description |
|---|---|
| Architecture Overview | System design and crate responsibilities |
| GPU Pipeline | Metal compute pipeline internals |
| Implementation Roadmap | Full phase-by-phase history |
| RMLX vs MLX vs CUDA | Detailed framework comparison |
| Getting Started | Prerequisites and setup guide |
🙏 Acknowledgments
- MLX by Apple — the Metal GPU compute framework that RMLX reimplements in Rust
- mlx-lm by Apple — LLM inference patterns and Metal kernel references
- vllm-mlx — distributed inference architecture and RDMA transport patterns
📄 License
MIT — see LICENSE.
Dependencies
~13–27MB
~274K SLoC