23 releases
Uses new Rust 2024
| 0.5.10 | Feb 1, 2026 |
|---|---|
| 0.5.0 | Jan 18, 2026 |
| 0.4.85 | Jan 9, 2026 |
| 0.4.83 | Nov 28, 2025 |
| 0.3.6 | Sep 11, 2025 |
#373 in Command line utilities
1MB
20K
SLoC
Profile 50GB datasets in seconds on your laptop.
DataProf is built for Data Scientists and Engineers who need to understand their data fast. No more MemoryError when trying to profile a CSV larger than your RAM.
Pandas-Profiling vs DataProf on a 10GB CSV:
| Feature | Pandas-Profiling / YData | DataProf |
|---|---|---|
| Memory Usage | 12GB+ (Crashes) | < 100MB (Streaming) |
| Speed | 15+ minutes | 45 seconds |
| Implementation | Python (Slow) | Rust (Fast) |
Quick Start
Installation
The easiest way to get started is via pip:
pip install dataprof
Python Usage
Forget complex configurations. Just point to your file:
import dataprof
# Analyze a huge file without crashing memory
# Generates a report.html with quality metrics and distributions
dataprof.profile("huge_dataset.csv").save("report.html")
CLI & Rust Usage (Advanced)
If you prefer the command line or are a Rust developer:
# Install via cargo
cargo install dataprof
# Generate report from CLI
dataprof-cli report huge_data.csv -o report.html
More options: dataprof-cli --help | Full CLI Guide
๐ก Key Features
- No Size Limits: Profiles files larger than RAM using streaming and memory mapping.
- Blazing Fast: Written in Rust with SIMD acceleration.
- Privacy Guaranteed: Data never leaves your machine.
- Format Support: CSV, Parquet, JSON/L, and Databases (Postgres, MySQL, etc.).
- Smart Detection: Automatically identifies Emails, IPs, IBANs, Credit Cards, and more.
๐ Beautiful Reports
Interactive Demo
Animated walkthrough of dataprof features and dashboards
Single File Analysis
Interactive dashboards with quality scoring and distributions
Batch Processing Dashboard
Aggregate metrics from hundreds of files in one view
Documentation
Advanced Examples
Batch Processing (Python)
# Process a whole directory of files in parallel
result = dataprof.batch_analyze_directory("/data_folder", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")
Database Integration (Python)
# Profile a SQL query directly
await dataprof.analyze_database_async(
"postgresql://user:pass@localhost/db",
"SELECT * FROM sales_data_2024"
)
Rust Library Usage
use dataprof::*;
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;
println!("Quality Score: {}", report.quality_score());
Development
# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release
# Test databases (optional)
docker-compose -f .devcontainer/compose.yml up -d
# Common tasks
cargo test # Run tests
cargo bench # Benchmarks
cargo clippy # Linting
Feature Flags
# Minimal (CSV/JSON only)
cargo build --release
# With Apache Arrow (large files >100MB)
cargo build --release --features arrow
# With Parquet support
cargo build --release --features parquet
# With databases
cargo build --release --features postgres,mysql,sqlite
# Python async support
maturin develop --features python-async,database,postgres
# All features
cargo build --release --all-features
When to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration
Documentation
User Guides: CLI Reference Database Connectors
๐ค Contributing
We welcome contributions from everyone! Whether you want to:
- Fix a bug ๐
- Add a feature โจ
- Improve documentation ๐
- Report an issue ๐
Quick Start for Contributors
-
Fork & clone:
git clone https://github.com/YOUR-USERNAME/dataprof.git cd dataprof -
Build & test:
cargo build cargo test -
Create a feature branch:
git checkout -b feature/your-feature-name -
Before submitting PR:
cargo fmt --all cargo clippy --all --all-targets cargo test --all -
Submit a Pull Request with clear description
All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.
License
Dual-licensed under either:
You may use this project under the terms of either license.
Dependencies
~28โ61MB
~1M SLoC