1 unstable release

0.1.0 Mar 16, 2026

#1099 in Graphics APIs


Used in 3 crates

MIT license

3MB
68K SLoC

RMLX

Rust ML runtime for Apple Silicon — zero-copy Metal GPU pipeline with RDMA distributed inference

E2E 32-layer: Prefill 1.1–1.6× · Decode ~2.1× vs MLX · 24.05T FP16 GEMM · TP=2 5.0× · EP 34×

CI License: MIT Rust 1.80+ Tests macOS Apple Silicon

Install · Quickstart · Performance · Features · Architecture · Roadmap · Docs


🇰🇷 한국어 문서: docs/README_ko.md

🧠 What is RMLX?

RMLX reimplements Apple's MLX Metal GPU pipeline entirely in Rust, built on objc2-metal / objc2 / block2 / objc2-foundation. The framework is organized into 7 crates spanning GPU compute, memory allocation, neural network layers, and RDMA-based distributed inference.

End-to-end on a 32-layer Qwen 7B-style model (M3 Ultra, f16, random weights): prefill 1.1–1.6× faster, decode ~2.1× faster (~44 tok/s vs 21 tok/s) than MLX compiled. Kernel-level: FP16 GEMM 24.05 TFLOPS (MLX parity), QMM Q4 17.43 TFLOPS (+28% vs MLX). Distributed: TP=2 5.0× faster decode than MLX JACCL, EP 34× vs MLX end-to-end. Phase 9 adds 7 model architectures and generate() API; Phase 10 delivers 20 kernel/infra optimizations.

Single Rust binary. Zero-copy unified memory. No Python runtime, no framework overhead.

📦 Install

Prerequisites: macOS 14+ on Apple Silicon (M1 or later). See full prerequisites.

git clone https://github.com/0xDaizz/RMLX.git
cd RMLX
cargo build --workspace

🚀 Quickstart

Build & Test

cargo build --workspace           # Build all 7 crates
cargo test  --workspace           # Run 1,298 tests

Benchmark

cargo bench -p rmlx-nn --bench pipeline_bench

Distributed RDMA (2-node)

# One-time install
cargo install --path crates/rmlx-cli

# Auto-detect TB5 topology, assign IPs, configure interfaces
rmlx config --hosts node1,node2 --auto-setup --output rmlx-hosts.json --verbose

# Launch distributed job
rmlx launch --backend rdma --hostfile rmlx-hosts.json -- ibv_devices

--auto-setup discovers Thunderbolt connections via system_profiler, assigns point-to-point IPs, and configures RDMA interfaces automatically.

📊 Performance

⚡ E2E Model Performance (PRIMARY) — 32-layer Qwen 7B-style, M3 Ultra, f16

All E2E numbers measured on 32-layer model with random weights (Qwen 7B-style config), M3 Ultra, f16.

Prefill (32 layers):

seq_len RMLX compiled (μs) MLX compiled (μs) RMLX/MLX
32 46,319 74,907 1.62×
128 116,647 122,254 1.05×
512 356,021 446,571 1.25×
1024 673,126 762,791 1.13×

RMLX Prefill TFLOPS:

seq_len TFLOPS
32 9.49
128 15.08
512 19.76
1024 20.90

Decode (32 layers, M=1, kv=128 tokens):

RMLX (est. from 60L) MLX compiled Ratio
Latency ~22,500 μs (704 μs/layer × 32) 47,253 μs ~2.1×
tok/s ~44 21.2 ~2.1×

Note: RMLX decode is estimated from 60-layer pipeline_bench (704 μs/layer × 32 = 22,528 μs). A direct 32-layer E2E decode bench should be added for exact comparison.


⚡ Kernel-Level Benchmarks

All kernel-level numbers measured on Apple Silicon (M3 Ultra), single transformer layer, Qwen 3.5 MoE A22B config, f16 unless noted.

Decode — 110 ms → 703 μs (156× internal speedup)

Stage Latency vs Naive Key Technique
Naive (per-op sync) 110 ms 65 CBs, GPU idle between dispatches
ExecGraph 2.8 ms 39× CB batching 65 → 5
9-Dispatch + PSO Cache 1,081 μs 102× Single-CB decode, kernel optimization
7-Dispatch Fusion (60L) 703 μs 156× fused_rms_gemv + fused_swiglu_down

BW efficiency 73.6% at 60L pipeline floor.

Prefill GEMM — 24.05 TFLOPS

Metric RMLX MLX Delta
FP16 GEMM (M=512) 24.05T 24T Pipe parity
FP16 GEMM peak 46.3T ~23T 2× MLX
Small-M dispatch (M=32–256) 1.24–2.83× RMLX faster
QMM Q4 (M=512) 17.43T 13.6T +28%

⚡ Distributed (2-node TB5 RDMA)

TP Decode — 5.0× vs MLX at TP=2

Config RMLX MLX Speedup
TP=2 Split-CB 375.5 μs 1,880 μs (JACCL) 5.0×
Per-op → Split-CB 18,193 μs → 375.5 μs 48× (architecture)
RDMA allreduce (1×) 14.6 μs 87 μs (JACCL 2×) 6.0×

Split-CB TP batches an entire layer into 2 command buffers with inter-CB allreduce, eliminating per-op GPU sync overhead.

TP Prefill — Split-CB

Config M=128 M=512
RMLX TP=1 5,033.7 μs 12,723.2 μs
RMLX TP=2 Split-CB 3,017.2 μs 6,979.3 μs
MLX TP=1 3,890 μs 11,795 μs
MLX TP=2 JACCL 3,700 μs 8,049 μs

Expert Parallelism — 30–178× vs MLX

Config RMLX MLX (mx.compile) Speedup
Single expert FFN (M=1..512) 42–54 μs 1,338–9,609 μs 30–178×
MoE grouped seq=4 (8 experts) 359 μs 68× vs pre-pooling
MoE grouped seq=32 665 μs
MoE grouped seq=128 1,658 μs

Buffer-pooled grouped_forward: 32 allocations → 4 bulk allocations (14 ms → 359 μs, 39×). commandBufferWithUnretainedReferences removes CB retain/release overhead.

EP-2 End-to-End (4 experts/rank + RDMA)

Seq Length RMLX EP-2 MLX EP-2 (JACCL) Speedup
seq=2 (grouped) 200.8 μs 6,895 μs 34×
seq=32 (grouped) 549.0 μs
seq=64 672 μs

EP-2 grouped seq=2: was 212.0 μs baseline, now 200.8 μs (-5.3%). EP-2 grouped seq=32: was 578.8 μs, now 549.0 μs (-5.1%).

RDMA vs JACCL Transport

Payload RMLX RDMA MLX JACCL Speedup
28 KB 12 μs 79 μs 6.6×
896 KB 97 μs 308 μs 3.2×

✨ Features

RMLX vs MLX vs CUDA

Feature RMLX MLX CUDA
Unified memory (zero-copy)
7-dispatch fused decode
Single-CB prefill pipeline
Expert parallelism (MoE, 30–178×) ⚠️
Zero-copy RDMA
Flash Attention 2
MLA (DeepSeek-V3) ⚠️
GGUF model loading
Quantized inference (Q4/Q8)
Single Rust binary
Metal 4 support (macOS 26+)

🔧 Key Capabilities

32+ GPU Ops
  • Flash Attention 2 Metal kernel (tiled online softmax, D up to 256)
  • SIMD group MMA matmul, BM=8 GEMV with dynamic tile selection
  • Batched SDPA decode with slab KV cache
  • FP8 (E4M3/E5M2), AWQ/GPTQ INT4, K-quant (Q2K–Q6K)
  • Fused kernels: SiLU-mul, RMSNorm+residual, GEMV+bias, GEMM+residual epilogue
  • GEMM: MLX-architecture kernel (BK=16, 2 SG, serpentine MMA)
  • QMM MMA Q4/Q8, QMV qdot pattern — no CPU fallback
Infrastructure
  • ExecGraph: command buffer batching (65 CB → 5)
  • CachedDecode: pre-resolved PSOs, zero per-token allocation
  • Metal: objc2-metal 0.3 with ComputePass zero-cost abstraction, ChipTuning (M1–M4), DiskPipelineCache
  • Allocator: zero-copy (posix_memalign + MTLBuffer), BFC, residency manager
  • RDMA: ibverbs FFI, TB5 multi-port, ring/allreduce/allgather collectives
  • Distributed: TP with Split-CB, expert parallelism (3-zone auto), tree allreduce, topology-aware CLI
Neural Network Layers
  • Models: Qwen2, Qwen2.5-MoE, LLaMA3, GLM-4, MiniMax-01, DeepSeek-V3, Mixtral, Kimi K2.5
  • Attention: Multi-Head, GQA, MLA, Sliding Window
  • KV cache: static, rotating, paged (vLLM-style), quantized, slab decode
  • Quantization: QuantizedLinear, AWQ, GPTQ GPU kernels, K-Quant GPU (Q2K–Q6K)
  • Formats: FP16, Q4/Q8 Affine, AWQ INT4, GPTQ INT4, K-Quant (Q2–Q6), FP8 (E4M3/E5M2)
  • Loading: GGUF v2/v3 with tensor mapping
  • Generation: generate() API with temperature, top-k, top-p sampling

🏗️ Architecture

graph TD
    CLI[🖥️ rmlx-cli] --> NN[🧠 rmlx-nn]
    NN --> CORE[⚙️ rmlx-core]
    CORE --> METAL[🔩 rmlx-metal]
    CORE --> ALLOC[📦 rmlx-alloc]
    DIST[🌐 rmlx-distributed] --> CORE
    DIST --> RDMA[🔗 rmlx-rdma]
    ALLOC --> RDMA
    METAL -.-> ALLOC
Crate Role
rmlx-cli Launch, config, topology discovery
rmlx-nn Models, attention, MoE, KV cache, GGUF loader
rmlx-core 32+ op modules, Array/DType, autodiff
rmlx-metal Device, ExecGraph, ChipTuning, pipeline cache, ComputePass (objc2-metal), Metal 4
rmlx-alloc Zero-copy allocator, BFC, residency manager
rmlx-distributed Expert parallelism, allreduce, topology, TP
rmlx-rdma ibverbs FFI, TB5 multi-port, collectives

🗺️ Roadmap

Era Phases Key Result Status
Foundation Phase 0 → 7C Core framework, Metal bindings, ExecGraph, RDMA infra ✅ Complete
Decode Optimization KO → Phase 11 E2E ~44 tok/s (~2.1× MLX), 703 μs/layer (156×) ✅ Complete
GEMM & Prefill Phase A → D 24.05T TFLOPS, MLX parity, single-CB prefill ✅ Complete
Quantized Kernels Phase F → J QMM Q4 17.43T (+28% vs MLX), QMV near-parity ✅ Complete
Distributed RDMA EP-1 → 6, RDMA-7 TP=2 5.0× vs MLX, Split-CB, 14 μs allreduce ✅ Complete
EP + Buffer Pooling Phase 8 EP 30–178× vs MLX, MoE 68× improvement, EP-2 e2e 54× ✅ Complete
Model Coverage + API Phase 9 7 architectures, generate() API, AWQ/GPTQ/K-Quant GPU kernels ✅ Complete
Final Optimization Phase 10 20 kernel/infra opts (14 merged), decode 175.6 μs, EP-2 200.8 μs ✅ Complete
Next KO-2, KO-3, EP-7 Multi-token decode, speculative decoding, EP scaling 🔜 Planned

See full roadmap and benchmark report for details.

📚 Docs

Document Description
Architecture Overview System design and crate responsibilities
GPU Pipeline Metal compute pipeline internals
Implementation Roadmap Full phase-by-phase history
RMLX vs MLX vs CUDA Detailed framework comparison
Getting Started Prerequisites and setup guide

🙏 Acknowledgments

  • MLX by Apple — the Metal GPU compute framework that RMLX reimplements in Rust
  • mlx-lm by Apple — LLM inference patterns and Metal kernel references
  • vllm-mlx — distributed inference architecture and RDMA transport patterns

📄 License

MIT — see LICENSE.

Dependencies

~13–27MB
~274K SLoC