1 unstable release

0.1.0	Mar 16, 2026

#1099 in Graphics APIs

Used in 3 crates

MIT license

3MB
68K SLoC

RMLX

Rust ML runtime for Apple Silicon — zero-copy Metal GPU pipeline with RDMA distributed inference

E2E 32-layer: Prefill 1.1–1.6× · Decode ~2.1× vs MLX · 24.05T FP16 GEMM · TP=2 5.0× · EP 34×

Install · Quickstart · Performance · Features · Architecture · Roadmap · Docs

🇰🇷 한국어 문서: docs/README_ko.md

🧠 What is RMLX?

RMLX reimplements Apple's MLX Metal GPU pipeline entirely in Rust, built on objc2-metal / objc2 / block2 / objc2-foundation. The framework is organized into 7 crates spanning GPU compute, memory allocation, neural network layers, and RDMA-based distributed inference.

End-to-end on a 32-layer Qwen 7B-style model (M3 Ultra, f16, random weights): prefill 1.1–1.6× faster, decode ~2.1× faster (~44 tok/s vs 21 tok/s) than MLX compiled. Kernel-level: FP16 GEMM 24.05 TFLOPS (MLX parity), QMM Q4 17.43 TFLOPS (+28% vs MLX). Distributed: TP=2 5.0× faster decode than MLX JACCL, EP 34× vs MLX end-to-end. Phase 9 adds 7 model architectures and generate() API; Phase 10 delivers 20 kernel/infra optimizations.

Single Rust binary. Zero-copy unified memory. No Python runtime, no framework overhead.

📦 Install

Prerequisites: macOS 14+ on Apple Silicon (M1 or later). See full prerequisites.

git clone https://github.com/0xDaizz/RMLX.git
cd RMLX
cargo build --workspace

🚀 Quickstart

Build & Test

cargo build --workspace           # Build all 7 crates
cargo test  --workspace           # Run 1,298 tests

Benchmark

cargo bench -p rmlx-nn --bench pipeline_bench

Distributed RDMA (2-node)

# One-time install
cargo install --path crates/rmlx-cli

# Auto-detect TB5 topology, assign IPs, configure interfaces
rmlx config --hosts node1,node2 --auto-setup --output rmlx-hosts.json --verbose

# Launch distributed job
rmlx launch --backend rdma --hostfile rmlx-hosts.json -- ibv_devices

--auto-setup discovers Thunderbolt connections via system_profiler, assigns point-to-point IPs, and configures RDMA interfaces automatically.

📊 Performance

⚡ E2E Model Performance (PRIMARY) — 32-layer Qwen 7B-style, M3 Ultra, f16

All E2E numbers measured on 32-layer model with random weights (Qwen 7B-style config), M3 Ultra, f16.

Prefill (32 layers):

seq_len	RMLX compiled (μs)	MLX compiled (μs)	RMLX/MLX
32	46,319	74,907	1.62×
128	116,647	122,254	1.05×
512	356,021	446,571	1.25×
1024	673,126	762,791	1.13×

RMLX Prefill TFLOPS:

seq_len	TFLOPS
32	9.49
128	15.08
512	19.76
1024	20.90

Decode (32 layers, M=1, kv=128 tokens):

	RMLX (est. from 60L)	MLX compiled	Ratio
Latency	~22,500 μs (704 μs/layer × 32)	47,253 μs	~2.1×
tok/s	~44	21.2	~2.1×

Note: RMLX decode is estimated from 60-layer pipeline_bench (704 μs/layer × 32 = 22,528 μs). A direct 32-layer E2E decode bench should be added for exact comparison.

⚡ Kernel-Level Benchmarks

All kernel-level numbers measured on Apple Silicon (M3 Ultra), single transformer layer, Qwen 3.5 MoE A22B config, f16 unless noted.

Decode — 110 ms → 703 μs (156× internal speedup)

Stage	Latency	vs Naive	Key Technique
Naive (per-op sync)	110 ms	1×	65 CBs, GPU idle between dispatches
ExecGraph	2.8 ms	39×	CB batching 65 → 5
9-Dispatch + PSO Cache	1,081 μs	102×	Single-CB decode, kernel optimization
7-Dispatch Fusion (60L)	703 μs	156×	fused_rms_gemv + fused_swiglu_down

BW efficiency 73.6% at 60L pipeline floor.

Prefill GEMM — 24.05 TFLOPS

Metric	RMLX	MLX	Delta
FP16 GEMM (M=512)	24.05T	24T	Pipe parity
FP16 GEMM peak	46.3T	~23T	2× MLX
Small-M dispatch (M=32–256)	1.24–2.83×	1×	RMLX faster
QMM Q4 (M=512)	17.43T	13.6T	+28%

⚡ Distributed (2-node TB5 RDMA)

TP Decode — 5.0× vs MLX at TP=2

Config	RMLX	MLX	Speedup
TP=2 Split-CB	375.5 μs	1,880 μs (JACCL)	5.0×
Per-op → Split-CB	18,193 μs → 375.5 μs	—	48× (architecture)
RDMA allreduce (1×)	14.6 μs	87 μs (JACCL 2×)	6.0×

Split-CB TP batches an entire layer into 2 command buffers with inter-CB allreduce, eliminating per-op GPU sync overhead.

TP Prefill — Split-CB

Config	M=128	M=512
RMLX TP=1	5,033.7 μs	12,723.2 μs
RMLX TP=2 Split-CB	3,017.2 μs	6,979.3 μs
MLX TP=1	3,890 μs	11,795 μs
MLX TP=2 JACCL	3,700 μs	8,049 μs

Expert Parallelism — 30–178× vs MLX

Config	RMLX	MLX (mx.compile)	Speedup
Single expert FFN (M=1..512)	42–54 μs	1,338–9,609 μs	30–178×
MoE grouped seq=4 (8 experts)	359 μs	—	68× vs pre-pooling
MoE grouped seq=32	665 μs	—	—
MoE grouped seq=128	1,658 μs	—	—

Buffer-pooled grouped_forward: 32 allocations → 4 bulk allocations (14 ms → 359 μs, 39×). commandBufferWithUnretainedReferences removes CB retain/release overhead.

EP-2 End-to-End (4 experts/rank + RDMA)

Seq Length	RMLX EP-2	MLX EP-2 (JACCL)	Speedup
seq=2 (grouped)	200.8 μs	6,895 μs	34×
seq=32 (grouped)	549.0 μs	—	—
seq=64	672 μs	—	—

EP-2 grouped seq=2: was 212.0 μs baseline, now 200.8 μs (-5.3%). EP-2 grouped seq=32: was 578.8 μs, now 549.0 μs (-5.1%).

RDMA vs JACCL Transport

Payload	RMLX RDMA	MLX JACCL	Speedup
28 KB	12 μs	79 μs	6.6×
896 KB	97 μs	308 μs	3.2×

✨ Features

RMLX vs MLX vs CUDA

Feature	RMLX	MLX	CUDA
Unified memory (zero-copy)	✅	✅	❌
7-dispatch fused decode	✅	❌	❌
Single-CB prefill pipeline	✅	❌	❌
Expert parallelism (MoE, 30–178×)	✅	❌	⚠️
Zero-copy RDMA	✅	❌	❌
Flash Attention 2	✅	✅	✅
MLA (DeepSeek-V3)	✅	❌	⚠️
GGUF model loading	✅	✅	✅
Quantized inference (Q4/Q8)	✅	✅	✅
Single Rust binary	✅	❌	❌
Metal 4 support (macOS 26+)	✅	❌	❌

🔧 Key Capabilities

32+ GPU Ops

Flash Attention 2 Metal kernel (tiled online softmax, D up to 256)
SIMD group MMA matmul, BM=8 GEMV with dynamic tile selection
Batched SDPA decode with slab KV cache
FP8 (E4M3/E5M2), AWQ/GPTQ INT4, K-quant (Q2K–Q6K)
Fused kernels: SiLU-mul, RMSNorm+residual, GEMV+bias, GEMM+residual epilogue
GEMM: MLX-architecture kernel (BK=16, 2 SG, serpentine MMA)
QMM MMA Q4/Q8, QMV qdot pattern — no CPU fallback

Infrastructure

ExecGraph: command buffer batching (65 CB → 5)
CachedDecode: pre-resolved PSOs, zero per-token allocation
Metal: objc2-metal 0.3 with ComputePass zero-cost abstraction, ChipTuning (M1–M4), DiskPipelineCache
Allocator: zero-copy (posix_memalign + MTLBuffer), BFC, residency manager
RDMA: ibverbs FFI, TB5 multi-port, ring/allreduce/allgather collectives
Distributed: TP with Split-CB, expert parallelism (3-zone auto), tree allreduce, topology-aware CLI

Neural Network Layers

Models: Qwen2, Qwen2.5-MoE, LLaMA3, GLM-4, MiniMax-01, DeepSeek-V3, Mixtral, Kimi K2.5
Attention: Multi-Head, GQA, MLA, Sliding Window
KV cache: static, rotating, paged (vLLM-style), quantized, slab decode
Quantization: QuantizedLinear, AWQ, GPTQ GPU kernels, K-Quant GPU (Q2K–Q6K)
Formats: FP16, Q4/Q8 Affine, AWQ INT4, GPTQ INT4, K-Quant (Q2–Q6), FP8 (E4M3/E5M2)
Loading: GGUF v2/v3 with tensor mapping
Generation: generate() API with temperature, top-k, top-p sampling

🏗️ Architecture

graph TD
    CLI[🖥️ rmlx-cli] --> NN[🧠 rmlx-nn]
    NN --> CORE[⚙️ rmlx-core]
    CORE --> METAL[🔩 rmlx-metal]
    CORE --> ALLOC[📦 rmlx-alloc]
    DIST[🌐 rmlx-distributed] --> CORE
    DIST --> RDMA[🔗 rmlx-rdma]
    ALLOC --> RDMA
    METAL -.-> ALLOC

Crate	Role
rmlx-cli	Launch, config, topology discovery
rmlx-nn	Models, attention, MoE, KV cache, GGUF loader
rmlx-core	32+ op modules, Array/DType, autodiff
rmlx-metal	Device, ExecGraph, ChipTuning, pipeline cache, ComputePass (`objc2-metal`), Metal 4
rmlx-alloc	Zero-copy allocator, BFC, residency manager
rmlx-distributed	Expert parallelism, allreduce, topology, TP
rmlx-rdma	ibverbs FFI, TB5 multi-port, collectives

🗺️ Roadmap

Era	Phases	Key Result	Status
Foundation	Phase 0 → 7C	Core framework, Metal bindings, ExecGraph, RDMA infra	✅ Complete
Decode Optimization	KO → Phase 11	E2E ~44 tok/s (~2.1× MLX), 703 μs/layer (156×)	✅ Complete
GEMM & Prefill	Phase A → D	24.05T TFLOPS, MLX parity, single-CB prefill	✅ Complete
Quantized Kernels	Phase F → J	QMM Q4 17.43T (+28% vs MLX), QMV near-parity	✅ Complete
Distributed RDMA	EP-1 → 6, RDMA-7	TP=2 5.0× vs MLX, Split-CB, 14 μs allreduce	✅ Complete
EP + Buffer Pooling	Phase 8	EP 30–178× vs MLX, MoE 68× improvement, EP-2 e2e 54×	✅ Complete
Model Coverage + API	Phase 9	7 architectures, generate() API, AWQ/GPTQ/K-Quant GPU kernels	✅ Complete
Final Optimization	Phase 10	20 kernel/infra opts (14 merged), decode 175.6 μs, EP-2 200.8 μs	✅ Complete
Next	KO-2, KO-3, EP-7	Multi-token decode, speculative decoding, EP scaling	🔜 Planned

See full roadmap and benchmark report for details.

📚 Docs

Document	Description
Architecture Overview	System design and crate responsibilities
GPU Pipeline	Metal compute pipeline internals
Implementation Roadmap	Full phase-by-phase history
RMLX vs MLX vs CUDA	Detailed framework comparison
Getting Started	Prerequisites and setup guide

🙏 Acknowledgments

MLX by Apple — the Metal GPU compute framework that RMLX reimplements in Rust
mlx-lm by Apple — LLM inference patterns and Metal kernel references
vllm-mlx — distributed inference architecture and RDMA transport patterns

📄 License

MIT — see LICENSE.

Dependencies

~13–27MB
~274K SLoC