2 unstable releases
| new 0.2.0 | Feb 13, 2026 |
|---|---|
| 0.1.0 | Jan 13, 2026 |
#199 in Concurrency
320KB
5.5K
SLoC
FerroMPI
Safe, generic Rust bindings for MPI 4.x with persistent collectives support.
FerroMPI provides safe, generic Rust bindings to MPI through a thin C wrapper layer, enabling access to MPI 4.0+ features like persistent collectives that are not available in other Rust MPI bindings. All communication operations are generic over MpiDatatype, supporting f32, f64, i32, i64, u8, u32, and u64.
Features
- ๐ MPI 4.0+ support: Persistent collectives, large-count operations
- ๐ชถ Lightweight: Minimal C wrapper (~2400 lines), focused API
- ๐ Safe: Rust-idiomatic API with proper error handling and RAII
- ๐ง Flexible: Works with MPICH, OpenMPI, Intel MPI, and Cray MPI
- โก Fast: Zero-cost abstractions, direct FFI calls
- ๐งฌ Generic: Type-safe API for all supported MPI datatypes
- ๐งต Thread-safe:
CommunicatorisSend + Syncfor hybrid MPI+threads programs - ๐ช Shared memory: RMA windows with RAII lock guards (feature:
rma) - ๐ SLURM integration: Job topology helpers (feature:
numa)
Why FerroMPI?
| Feature | FerroMPI | rsmpi |
|---|---|---|
| MPI Version | 4.1 | 3.1 |
| Persistent Collectives | โ | โ |
| Large Count (>2ยณยน) | โ | โ |
| Generic API | โ | โ |
| Shared Memory Windows | โ | โ |
| Thread Safety | Send + Sync |
!Send |
| API Style | Minimal, focused | Comprehensive |
| C Wrapper | ~2400 lines | None (direct bindings) |
FerroMPI is ideal for:
- Iterative algorithms benefiting from persistent collectives (10-30% speedup)
- Applications with large data transfers (>2GB)
- Hybrid MPI+threads programs (OpenMP, Rayon,
std::thread) - Intra-node shared memory communication
- Users who want a simple, focused MPI API
Supported Types
All communication operations are generic over MpiDatatype:
| Rust Type | MPI Equivalent |
|---|---|
f32 |
MPI_FLOAT |
f64 |
MPI_DOUBLE |
i32 |
MPI_INT32_T |
i64 |
MPI_INT64_T |
u8 |
MPI_UINT8_T |
u32 |
MPI_UINT32_T |
u64 |
MPI_UINT64_T |
Feature Flags
| Feature | Description | Dependencies |
|---|---|---|
rma |
RMA shared memory window operations | โ |
numa |
NUMA-aware shared memory windows and SLURM helpers | rma |
debug |
Detailed debug output from the C layer | โ |
Enable features in your Cargo.toml:
[dependencies]
ferrompi = { version = "0.2", features = ["rma"] }
Quick Start
Installation
Add to your Cargo.toml:
[dependencies]
ferrompi = "0.2"
Requirements
- Rust 1.74+
- MPICH 4.0+ (recommended) or OpenMPI 5.0+
Ubuntu/Debian:
sudo apt install mpich libmpich-dev
macOS:
brew install mpich
Hello World
use ferrompi::{Mpi, ReduceOp};
fn main() -> ferrompi::Result<()> {
let mpi = Mpi::init()?;
let world = mpi.world();
let rank = world.rank();
let size = world.size();
println!("Hello from rank {} of {}", rank, size);
// Generic all-reduce โ works with any MpiDatatype
let sum = world.allreduce_scalar(rank as f64, ReduceOp::Sum)?;
println!("Rank {}: sum = {}", rank, sum);
Ok(())
}
cargo build --release
mpiexec -n 4 ./target/release/my_program
Examples
Blocking Collectives
use ferrompi::{Mpi, ReduceOp};
let mpi = Mpi::init()?;
let world = mpi.world();
// Broadcast (generic โ works with f64, i32, u8, etc.)
let mut data = vec![0.0f64; 100];
if world.rank() == 0 {
data.fill(42.0);
}
world.broadcast(&mut data, 0)?;
// All-reduce
let send = vec![1.0f64; 100];
let mut recv = vec![0.0f64; 100];
world.allreduce(&send, &mut recv, ReduceOp::Sum)?;
// Gather
let my_data = vec![world.rank() as f64];
let mut gathered = vec![0.0f64; world.size() as usize];
world.gather(&my_data, &mut gathered, 0)?;
// Works with integers too!
let mut int_data = vec![0i32; 100];
world.broadcast(&mut int_data, 0)?;
Nonblocking Collectives
use ferrompi::{Mpi, ReduceOp, Request};
let mpi = Mpi::init()?;
let world = mpi.world();
let send = vec![1.0f64; 1000];
let mut recv = vec![0.0f64; 1000];
// Start nonblocking operation
let request = world.iallreduce(&send, &mut recv, ReduceOp::Sum)?;
// Do other work while communication proceeds...
expensive_computation();
// Wait for completion
request.wait()?;
// recv now contains the result
Persistent Collectives (MPI 4.0+)
use ferrompi::{Mpi, ReduceOp};
let mpi = Mpi::init()?;
let world = mpi.world();
// Buffer used for all iterations
let mut data = vec![0.0f64; 1000];
// Initialize ONCE
let mut persistent = world.bcast_init(&mut data, 0)?;
// Use MANY times โ amortizes setup cost!
for iter in 0..10000 {
if world.rank() == 0 {
data.fill(iter as f64);
}
persistent.start()?;
persistent.wait()?;
// data contains broadcast result on all ranks
}
// Cleanup on drop
Point-to-Point Communication
use ferrompi::Mpi;
let mpi = Mpi::init()?;
let world = mpi.world();
if world.rank() == 0 {
let data = vec![1.0f64, 2.0, 3.0];
world.send(&data, 1, 0)?;
} else if world.rank() == 1 {
let mut buf = vec![0.0f64; 3];
let (source, tag, count) = world.recv(&mut buf, 0, 0)?;
println!("Received {:?} from rank {}", buf, source);
}
Available Examples
Run examples with mpiexec:
cargo build --release --examples
cargo build --release --examples --features rma
# Core examples
mpiexec -n 4 ./target/release/examples/hello_world
mpiexec -n 4 ./target/release/examples/ring
mpiexec -n 4 ./target/release/examples/allreduce
mpiexec -n 4 ./target/release/examples/nonblocking
mpiexec -n 4 ./target/release/examples/persistent_bcast
mpiexec -n 4 ./target/release/examples/pi_monte_carlo
# Communicator management
mpiexec -n 4 ./target/release/examples/comm_split
# Scan and variable-length collectives
mpiexec -n 4 ./target/release/examples/scan
mpiexec -n 4 ./target/release/examples/gatherv
# Shared memory (requires --features rma)
mpiexec -n 4 ./target/release/examples/shared_memory
# Hybrid MPI+threads
mpiexec -n 2 ./target/release/examples/hybrid_openmp
| Example | Description | Feature |
|---|---|---|
hello_world |
Basic MPI initialization and rank/size query | โ |
ring |
Point-to-point ring communication pattern | โ |
allreduce |
Blocking and nonblocking allreduce | โ |
nonblocking |
Nonblocking collective operations | โ |
persistent_bcast |
Persistent broadcast (MPI 4.0+) | โ |
pi_monte_carlo |
Monte Carlo Pi estimation with reduce | โ |
comm_split |
Communicator splitting and management | โ |
scan |
Prefix scan and exclusive scan operations | โ |
gatherv |
Variable-length gather (gatherv) | โ |
shared_memory |
Shared memory windows with RAII lock guards | rma |
hybrid_openmp |
Hybrid MPI + threads with thread-level init | โ |
API Reference
Core Types
| Type | Description |
|---|---|
Mpi |
MPI environment handle (init/finalize) |
Communicator |
MPI communicator wrapper |
Request |
Nonblocking operation handle |
PersistentRequest |
Persistent operation handle (MPI 4.0+) |
MpiDatatype |
Trait for types usable in MPI ops |
Status |
Message status (source, tag, count) |
Info |
MPI_Info object with RAII |
SharedWindow<T> |
Shared memory window (feature: rma) |
LockGuard |
RAII window lock (feature: rma) |
LockAllGuard |
RAII window lock-all (feature: rma) |
Collective Operations
| Operation | Blocking | Nonblocking | Persistent |
|---|---|---|---|
| Broadcast | broadcast |
ibroadcast |
bcast_init |
| Reduce | reduce |
ireduce |
reduce_init |
| Allreduce | allreduce |
iallreduce |
allreduce_init |
| Gather | gather |
igather |
gather_init |
| Allgather | allgather |
iallgather |
allgather_init |
| Scatter | scatter |
iscatter |
scatter_init |
| Alltoall | alltoall |
ialltoall |
alltoall_init |
| Scan | scan |
iscan |
scan_init |
| Exscan | exscan |
iexscan |
exscan_init |
| Reduce-scatter-block | reduce_scatter_block |
ireduce_scatter_block |
reduce_scatter_block_init |
| Barrier | barrier |
ibarrier |
โ |
Additional scalar and in-place variants:
| Variant | Description |
|---|---|
reduce_scalar |
Reduce a single value (returns scalar on root) |
reduce_inplace |
In-place reduce (root's buffer is both send/recv) |
allreduce_scalar |
Allreduce a single value (returns scalar) |
allreduce_inplace |
In-place allreduce |
allreduce_init_inplace |
Persistent in-place allreduce |
scan_scalar |
Prefix scan on a single value |
exscan_scalar |
Exclusive prefix scan on a single value |
Variable-length (V-variant) collectives:
| Operation | Blocking | Nonblocking | Persistent |
|---|---|---|---|
| Gatherv | gatherv |
igatherv |
gatherv_init |
| Scatterv | scatterv |
iscatterv |
scatterv_init |
| Allgatherv | allgatherv |
iallgatherv |
allgatherv_init |
| Alltoallv | alltoallv |
ialltoallv |
alltoallv_init |
Point-to-Point Operations
| Operation | Description |
|---|---|
send |
Blocking send |
recv |
Blocking receive (returns source, tag, count) |
isend |
Nonblocking send (returns Request) |
irecv |
Nonblocking receive (returns Request) |
sendrecv |
Simultaneous send and receive |
probe |
Blocking probe (returns Status) |
iprobe |
Nonblocking probe (returns Option<Status>) |
Reduction Operations
pub enum ReduceOp {
Sum, // MPI_SUM
Max, // MPI_MAX
Min, // MPI_MIN
Prod, // MPI_PROD
}
Thread Safety
Communicator is Send + Sync, enabling hybrid MPI + threads programs where MPI handles inter-node communication and threads (via std::thread, Rayon, or OpenMP) handle intra-node parallelism.
The thread-safety guarantee depends on the level requested at initialization:
| Thread Level | Who can call MPI | Use case |
|---|---|---|
Single |
Main thread only | Pure MPI, no threads |
Funneled |
Main thread only | Threads compute, main calls MPI |
Serialized |
Any thread | User serializes MPI calls |
Multiple |
Any thread | Full concurrent MPI access |
use ferrompi::{Mpi, ThreadLevel, ReduceOp};
// Request funneled support for hybrid MPI + threads
let mpi = Mpi::init_thread(ThreadLevel::Funneled)?;
assert!(mpi.thread_level() >= ThreadLevel::Funneled);
let world = mpi.world();
// Worker threads compute locally, main thread calls MPI
let local = 42.0_f64;
let global = world.allreduce_scalar(local, ReduceOp::Sum)?;
See examples/hybrid_openmp.rs for a complete hybrid MPI + threads pattern.
SLURM Configuration
The numa feature flag enables the slurm module with helpers for reading SLURM job topology at runtime. These functions return None when not running under SLURM.
[dependencies]
ferrompi = { version = "0.2", features = ["numa"] }
| Function | SLURM Variable | Description |
|---|---|---|
is_slurm_job() |
SLURM_JOB_ID |
Check if running under SLURM |
job_id() |
SLURM_JOB_ID |
Unique job identifier |
local_rank() |
SLURM_LOCALID |
Task ID relative to this node |
local_size() |
SLURM_NTASKS_PER_NODE |
Number of tasks on this node |
num_nodes() |
SLURM_NNODES |
Total number of allocated nodes |
cpus_per_task() |
SLURM_CPUS_PER_TASK |
CPUs allocated per task |
node_name() |
SLURM_NODENAME |
Name of this compute node |
node_list() |
SLURM_NODELIST |
Compact list of allocated nodes |
Example SLURM batch script for hybrid MPI + threads:
#!/bin/bash
#SBATCH --ntasks-per-node=4 # MPI ranks per node
#SBATCH --cpus-per-task=8 # threads per rank
#SBATCH --bind-to core
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./target/release/my_program
RMA / Shared Memory Windows
The rma feature flag enables SharedWindow<T>, a safe wrapper around MPI_Win_allocate_shared with RAII lifecycle management. Shared memory windows allow processes on the same node to directly access each other's memory without message passing.
[dependencies]
ferrompi = { version = "0.2", features = ["rma"] }
use ferrompi::{Mpi, SharedWindow, LockType};
let mpi = Mpi::init()?;
let world = mpi.world();
let node = world.split_shared()?;
// Each process allocates 100 f64s in shared memory
let mut win = SharedWindow::<f64>::allocate(&node, 100)?;
// Write to local portion
{
let local = win.local_slice_mut();
for (i, x) in local.iter_mut().enumerate() {
*x = (node.rank() * 100 + i as i32) as f64;
}
}
// Fence synchronization โ all processes participate
win.fence()?;
// Read from any rank's memory (zero-copy!)
let remote = win.remote_slice(0)?;
println!("Rank 0's first value: {}", remote[0]);
Synchronization modes:
- Active target (
fence): Bulk-synchronous, all processes participate - Passive target (
lock/lock_all): Fine-grained one-sided access with RAII guards
See examples/shared_memory.rs for a complete shared memory example.
Running Tests
# Unit tests (no MPI required)
cargo test
cargo test --features numa
# MPI integration tests (requires mpiexec)
./tests/run_mpi_tests.sh # Default features
./tests/run_mpi_tests.sh rma # With RMA/shared memory tests
./tests/run_mpi_tests.sh numa # With NUMA features (implies rma)
MPI_NP=8 ./tests/run_mpi_tests.sh # Custom process count
# Build and run individual examples
cargo build --release --examples
mpiexec -n 4 ./target/release/examples/hello_world
Configuration
Environment Variables
| Variable | Description | Example |
|---|---|---|
MPI_PKG_CONFIG |
pkg-config name | mpich, ompi |
MPICC |
MPI compiler wrapper | /opt/mpich/bin/mpicc |
CRAY_MPICH_DIR |
Cray MPI installation | /opt/cray/pe/mpich/8.1.25 |
Build Configuration
FerroMPI automatically detects MPI installations via:
MPI_PKG_CONFIGenvironment variable- pkg-config (
mpich,ompi,mpi) mpicc -showoutputCRAY_MPICH_DIR(for Cray systems)- Common installation paths
Troubleshooting
"Could not find MPI installation"
# Check if MPI is installed
which mpiexec
mpiexec --version
# Set pkg-config name explicitly
export MPI_PKG_CONFIG=mpich
cargo build
"Persistent collectives not available"
Persistent collectives require MPI 4.0+. Check your MPI version:
mpiexec --version
# MPICH Version: 4.2.0 โ
# Open MPI 5.0.0 โ
# MPICH Version: 3.4.2 โ (too old)
macOS linking issues
export DYLD_LIBRARY_PATH=$(brew --prefix mpich)/lib:$DYLD_LIBRARY_PATH
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Rust Application โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ferrompi (Safe Rust) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ffi.rs (bindings) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ferrompi.c (C layer) โ โ ~2400 lines
โโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ MPICH / OpenMPI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
The C layer provides:
- Handle tables for MPI opaque objects (256 comms, 16384 requests, 256 windows, 64 infos)
- Automatic large-count operation selection
- Thread-safe request management
- Graceful degradation for MPI <4.0
License
Licensed under:
- MIT license (LICENSE)
Contributing
Contributions welcome! Please ensure:
- All examples pass with
mpiexec -n 4 - New features include tests and documentation
- Code follows Rust style guidelines (
cargo fmt,cargo clippy)
Acknowledgments
FerroMPI was inspired by:
- rsmpi - Comprehensive MPI bindings for Rust
- The MPI Forum for the excellent MPI 4.0 specification
Dependencies
~150โ780KB
~17K SLoC