1 unstable release
| 0.1.0 | Jan 31, 2026 |
|---|
#113 in Biology
615KB
12K
SLoC
RustDESeq2
A pure Rust implementation of the DESeq2 algorithm for differential expression analysis of RNA-seq count data. Produces results identical to R DESeq2 while running 8-12x faster with ~90x less memory.
Features
- Complete DESeq2 pipeline: size factor estimation → dispersion estimation → GLM fitting → statistical testing
- Statistical tests: Wald test, Likelihood Ratio Test (LRT)
- Three dispersion fit types: parametric (Gamma GLM), local (LOESS), mean
- LFC shrinkage: normal, apeglm, ashr methods
- Outlier handling: Cook's distance detection and replacement
- Multi-factor designs: batch correction with categorical and continuous covariates
- t-distribution p-values: more conservative testing for small samples (
--use-t) - Transformations: VST (Variance Stabilizing Transformation), rlog (Regularized Log)
- Normalization: median-of-ratios, positive counts, iterative methods
- Convenience functions: fpkm, fpm, collapseReplicates
- 100% significance agreement with R DESeq2 across all validated datasets
- 8-12x faster, ~90x less memory than R DESeq2
Installation
# From crates.io
cargo install rust_deseq2
# From source
git clone https://github.com/necoli1822/rust_deseq2.git
cd rust_deseq2
cargo build --release
# Binary will be at target/release/rust_deseq2
Quick Start
# Basic differential expression analysis
rust_deseq2 run -c counts.csv -m metadata.csv -d condition \
--numerator treated --denominator control -o results.csv
# With batch correction
rust_deseq2 run -c counts.csv -m metadata.csv -d treatment \
--covariate batch --numerator drug --denominator placebo
# Likelihood Ratio Test
rust_deseq2 run -c counts.csv -m metadata.csv -d treatment \
--test lrt --reduced "~1" -o results.csv
# With local dispersion fit and t-distribution
rust_deseq2 run -c counts.csv -m metadata.csv -d condition \
--numerator treated --denominator control --fit-type local --use-t
# LFC shrinkage (apeglm)
rust_deseq2 run -c counts.csv -m metadata.csv -d condition \
--numerator treated --denominator control --shrinkage --shrinkage-method apeglm
# Variance Stabilizing Transformation (for PCA, heatmaps)
rust_deseq2 vst -c counts.csv -m metadata.csv -d condition -o vst.tsv --blind
# Regularized log transformation
rust_deseq2 rlog -c counts.csv -m metadata.csv -d condition -o rlog.tsv
# Normalize counts only
rust_deseq2 normalize -c counts.csv -o normalized.tsv
CLI Reference
run — Full DESeq2 analysis
| Option | Description | Default |
|---|---|---|
-c, --counts |
Count matrix CSV/TSV file | required |
-m, --metadata |
Sample metadata CSV file | required |
-d, --design |
Design variable (column in metadata) | required |
--covariate |
Categorical covariate for batch correction | — |
--continuous |
Continuous covariate | — |
--reference |
Reference level (format: factor=level) | alphabetical |
--test |
Statistical test: wald or lrt |
wald |
--numerator |
Numerator level for Wald contrast | — |
--denominator |
Denominator level for Wald contrast | — |
--reduced |
Reduced model formula for LRT (e.g., ~1) |
— |
--fit-type |
Dispersion fit: parametric, local, mean |
parametric |
--use-t |
Use t-distribution for Wald test p-values | false |
--shrinkage |
Apply LFC shrinkage | false |
--shrinkage-method |
Shrinkage: normal, apeglm, ashr |
normal |
--replace-outliers |
Replace Cook's distance outliers | true |
-o, --output |
Output file path | deseq2_results.csv |
-a, --alpha |
Significance threshold | 0.1 |
--maxit |
Max IRLS iterations (GLM + dispersion) | 100 |
--beta-tol |
Beta convergence tolerance for GLM | 1e-8 |
--min-disp |
Minimum dispersion value | 1e-8 |
--disp-tol |
Dispersion convergence tolerance | 1e-6 |
--kappa-0 |
Initial step size for dispersion line search | 1.0 |
--outlier-sd |
Outlier SD threshold for MAP shrinkage | 2.0 |
--trim |
Trim fraction for outlier replacement | 0.2 |
--upper-quantile |
Upper quantile for beta prior variance | 0.05 |
-t, --threads |
Number of threads (0 = auto) | 0 |
normalize — Normalize counts
| Option | Description | Default |
|---|---|---|
-c, --counts |
Count matrix CSV/TSV file | required |
-o, --output |
Output file path | required |
-m, --method |
Method: ratio, poscounts, iterate |
ratio |
vst — Variance Stabilizing Transformation
| Option | Description | Default |
|---|---|---|
-c, --counts |
Count matrix CSV/TSV file | required |
-m, --metadata |
Sample metadata CSV file | required |
-d, --design |
Design variable | required |
-o, --output |
Output file path | vst_transformed.tsv |
--method |
Fit method: parametric, mean, local |
parametric |
--blind |
Blind to experimental design | false |
rlog — Regularized Log Transformation
| Option | Description | Default |
|---|---|---|
-c, --counts |
Count matrix CSV/TSV file | required |
-m, --metadata |
Sample metadata CSV file | required |
-d, --design |
Design variable | required |
-o, --output |
Output file path | rlog_transformed.tsv |
--blind |
Blind to experimental design | false |
Input File Format
Count matrix (CSV/TSV, auto-detected)
gene_id,sample1,sample2,sample3,...
GENE001,100,200,150,...
GENE002,50,75,60,...
Metadata (CSV)
sample,condition
sample1,control
sample2,control
sample3,treated
Benchmark
Environment
| Version | |
|---|---|
| CPU | Intel Core i9-14900 |
| RAM | 32 GB DDR5 |
| OS | Linux 6.6 (WSL2) |
| Rust | rustc 1.92.0 |
| R | 4.3.3 |
| R DESeq2 | 1.42.1 |
All benchmarks are median of 5 runs on Linux-native filesystem. R timings are proc.time() internal (exclude package loading overhead).
Execution Time
| Dataset | Genes | Samples | R DESeq2 | RustDESeq2 | Speedup |
|---|---|---|---|---|---|
| A. baumannii GSE151925 | 3,851 | 6 | 0.839 s | 0.090 s | 9.3x |
| M. tuberculosis GSE100097 | 5,220 | 12 | 1.138 s | 0.130 s | 8.8x |
| Salmonella GSE46391 | 4,418 | 6 | 1.167 s | 0.100 s | 11.7x |
Memory Usage
| Dataset | R DESeq2 | RustDESeq2 | Reduction |
|---|---|---|---|
| A. baumannii GSE151925 | 797 MB | 8 MB | ~100x |
| M. tuberculosis GSE100097 | 797 MB | 15 MB | ~53x |
| Salmonella GSE46391 | 797 MB | 10 MB | ~80x |
R memory is dominated by the R runtime, S4 object overhead, and garbage collector. Rust allocates only the data structures needed for computation.
Accuracy
Validation Against R DESeq2
All 8 validated datasets achieve 100% significance agreement — every gene is classified identically (significant or not at padj < 0.05) between R and Rust.
| Dataset | Genes | Pearson r (log2FC) | Max |padj diff| | Significance Agreement | |---|---:|---:|---:|---:| | A. baumannii GSE151925 | 3,851 | 1.00000000 | ~1e-7 | 3,851 / 3,851 (100%) | | M. tuberculosis GSE100097 | 5,220 | 1.00000000 | ~1e-7 | 5,220 / 5,220 (100%) | | Salmonella GSE46391 | 4,418 | 1.00000000 | ~1e-7 | 4,418 / 4,418 (100%) | | P. aeruginosa GSE55197 | 6,014 | 1.00000000 | 5.00e-7 | 6,014 / 6,014 (100%) | | E. coli GSE220559 | 4,498 | 1.00000000 | 5.01e-7 | 4,498 / 4,498 (100%) | | S. aureus GSE130777 | 2,648 | 1.00000000 | 3.27e-5 | 2,648 / 2,648 (100%) | | S. pneumoniae GSE137447 | 2,280 | 0.99999200 | 3.80e-2 | 2,280 / 2,280 (100%) | | B. subtilis (simulated) | 4,200 | 1.00000000 | 1.21e-5 | 4,200 / 4,200 (100%) |
Across all 33,129 genes tested, there are zero discordant significance calls.
Why Are There Small Numerical Differences?
The minor differences in log2 fold-change values (typically < 1e-06) are inherent to floating-point arithmetic and do not indicate a bug. They arise because:
-
Different linear algebra backends. R DESeq2 calls LAPACK/BLAS routines written in Fortran. RustDESeq2 uses pure Rust implementations (ndarray). Even with identical algorithms, different compilers emit different instruction orderings and SIMD vectorizations.
-
Floating-point non-associativity. IEEE 754 arithmetic does not obey the associative law:
(a + b) + cmay differ froma + (b + c)by the least significant bit. Each matrix multiplication, QR decomposition, or dot product accumulates these rounding differences. -
IRLS convergence paths. DESeq2 fits a negative binomial GLM via Iteratively Reweighted Least Squares. When two implementations reach the convergence threshold at slightly different iteration counts, the final coefficient estimates can diverge at the last few decimal places.
These differences are a fundamental property of numerical computing — even recompiling the same code with a different compiler version or on a different CPU architecture can change the last few bits of a floating-point result.
R Function Mapping
Key equivalences between R DESeq2 and RustDESeq2:
| R DESeq2 | RustDESeq2 | Description |
|---|---|---|
DESeq() |
run_deseq() |
Full pipeline |
estimateSizeFactors() |
estimate_size_factors() |
Normalization |
estimateDispersions() |
estimate_dispersions() |
Dispersion estimation |
nbinomWaldTest() |
wald_test() |
Wald test |
nbinomLRT() |
likelihood_ratio_test() |
LRT |
results() |
results() / results_extended() |
Extract results |
resultsNames() |
results_names() |
Coefficient names |
lfcShrink(type="normal") |
shrink_lfc_normal() |
Normal shrinkage |
lfcShrink(type="apeglm") |
shrink_lfc_apeglm() |
Apeglm shrinkage |
lfcShrink(type="ashr") |
apply_ashr_shrinkage() |
Ashr shrinkage |
varianceStabilizingTransformation() |
vst() |
VST |
rlog() |
rlog() |
Regularized log |
fpkm() |
fpkm() |
FPKM normalization |
fpm() |
fpm() |
FPM normalization |
collapseReplicates() |
collapse_replicates() |
Collapse technical replicates |
Roadmap: Single-Cell RNA-seq Support
RustDESeq2 currently targets bulk RNA-seq. The following plan covers full single-cell support, implementing all scRNA-seq features from R DESeq2 in pure Rust (no R/glmGamPoi dependency).
Phase 1: Parameter Exposure (Easy)
Already-implemented algorithms that just need CLI/API configurability.
| Feature | R Parameter | Current State | Work Required |
|---|---|---|---|
| Configurable minmu | DESeq(minmu=1e-6) |
Hardcoded 0.5 |
Propagate through pipeline, add --min-mu CLI flag |
| Disable outlier replacement | minReplicatesForReplace=Inf |
Hardcoded 7 |
Add --min-replicates CLI flag, support Inf |
| Non-integer counts | skipIntegerMode=TRUE |
Already uses f64 | Add optional integer validation toggle |
--single-cell preset |
N/A | N/A | Convenience flag: sets minmu=1e-6, minReplicatesForReplace=Inf, sfType=poscounts, useT=TRUE |
Phase 2: Observation Weights (Medium)
Per-gene-per-sample weight matrix for zero-inflation models (zinbwave integration).
| Feature | R Function | Work Required |
|---|---|---|
| Weight storage | assays(dds)[["weights"]] |
Add Option<Array2<f64>> to DESeqDataSet |
| Weighted base means | getBaseMeansAndVariances() |
Multiply normalized counts by weights |
| Weighted GLM fitting | C++ fitBeta(weightsSEXP=) |
Modify IRLS to use W_ij * w_ij in weight matrix |
| Weighted dispersion estimation | C++ fitDisp(weightsSEXP=) |
Pass weights through dispersion IRLS |
| Weight-aware df for useT | rowSums(weights) |
df = sum(weights_per_gene) - n_coefs |
| Degenerate design detection | getAndCheckWeights() |
Flag genes where weights collapse design rank as allZero |
Phase 3: glmGamPoi Equivalent (Hard)
Pure Rust reimplementation of the glmGamPoi algorithms, optimized for datasets with many cells and sparse counts.
| Feature | R (glmGamPoi) Function | Description |
|---|---|---|
| Overdispersion MLE | overdispersion_mle() |
Gene-wise dispersion estimation via closed-form + Newton for NB with minmu=1e-6 |
| Local median trend | loc_median_fit() |
Dispersion-mean trend via running median (replaces parametric/local/mean) |
| Quasi-likelihood shrinkage | overdispersion_shrinkage() |
Empirical Bayes shrinkage producing qlDisp values (MLE, fit, MAP) |
| GLM fitting | glm_gp() |
NB GLM fit using quasi-likelihood framework |
| Quasi-likelihood F-test | test_de() |
F-statistic based LRT (replaces chi-squared) with quasi-likelihood df |
| New fitType variant | fitType="glmGamPoi" |
Add TrendFitMethod::GlmGamPoi to route through the above |
Implementation Notes
- All features will be implemented in pure Rust with no external R/Python dependencies
- Phase 1 can be released independently as it only requires parameter plumbing
- Phase 2 enables zinbwave-style workflows where weights are pre-computed externally
- Phase 3 is the most complex but provides the largest performance benefit for scRNA-seq
- glmGamPoi and observation weights are mutually exclusive (same as R)
License
MIT
Acknowledgments
This project is an independent Rust reimplementation of the DESeq2 algorithm originally developed by Michael Love, Wolfgang Huber, and Simon Anders.
Love, M.I., Huber, W., Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8
The original R implementation is available at https://bioconductor.org/packages/release/bioc/html/DESeq2.html (LGPL >= 3).
The LFC shrinkage methods reference:
- apeglm: Zhu, A., Ibrahim, J.G., Love, M.I. (2019). Heavy-tailed prior distributions for sequence count data. Bioinformatics, 35:2084-2092.
- ashr: Stephens, M. (2017). False discovery rates: a new deal. Biostatistics, 18:275-294.
Dependencies
~15MB
~277K SLoC