crush-gpu
GPU-accelerated tile-based compression engine with 32-way parallel decompression via GDeflate.
Overview
crush-gpu is the GPU acceleration crate for the Crush compression toolkit. It implements a GDeflate-inspired compression format designed for massively parallel GPU decompression.
Key design principles:
- 64 KB independent tiles enable parallel processing and random access
- 32-way sub-stream parallelism matches GPU warp/wavefront width
- Batched GPU dispatch minimizes host-GPU synchronization overhead
- Automatic CPU fallback when no GPU is available or decompression fails
Architecture
┌─────────────────────────────────┐
│ engine.rs │
│ compress() / decompress() │
└──────┬──────────────┬────────────┘
│ │
┌───────▼──────┐ ┌─────▼──────────┐
│ gdeflate.rs │ │ backend/ │
│ CPU compress│ │ GPU decompress │
│ CPU fallback│ │ (wgpu/CUDA) │
└──────────────┘ └─────┬──────────-┘
│
┌─────▼──────────┐
│ shader/ │
│ WGSL compute │
└────────────────-┘
Modules
| Module | Purpose |
|---|---|
engine |
Top-level compress/decompress API, tile management, random access |
backend |
GPU abstraction layer (ComputeBackend trait), wgpu implementation |
format |
CGPU binary file format (headers, tile index, footer) |
gdeflate |
GDeflate compression (CPU) and decompression (CPU fallback) |
scorer |
GPU eligibility scoring (file size, entropy, GPU availability) |
entropy |
Shannon entropy calculation for compressibility assessment |
vectorize |
Heuristic for text-heavy data detection |
lz77 |
LZ77 match finding for the v1 sub-stream format |
Performance
GPU Decompression Throughput (Batched Dispatch)
Test Environment:
- GPU: NVIDIA GeForce RTX 3060 Ti (Vulkan)
- Rust: 1.93.1 (stable), release mode
| Corpus | GPU Throughput | CPU Throughput | Winner |
|---|---|---|---|
| log-1MB | 137 MiB/s | 329 MiB/s | CPU (small data) |
| binary-1MB | 86 MiB/s | 126 MiB/s | CPU (small data) |
| mixed-1MB | 87 MiB/s | 181 MiB/s | CPU (small data) |
| mixed-10MB | 344 MiB/s | 186 MiB/s | GPU 1.85x |
GPU decompression outperforms CPU at larger data sizes where the per-tile dispatch overhead is amortized. The crossover point is around 2-4 MB.
Batched vs Per-Tile Dispatch
Batching multiple tiles into a single GPU submission eliminates per-tile host-GPU synchronization overhead:
| Dispatch Mode | 1 MB (16 tiles) | Improvement |
|---|---|---|
| Per-tile (old) | ~5-12 MiB/s | baseline |
| Batched (current) | 120+ MiB/s | 10-24x |
Compression Throughput (CPU)
| Corpus | Throughput |
|---|---|
| log-text-1MB | 178 MiB/s |
| binary-1MB | 70 MiB/s |
| mixed-1MB | 99 MiB/s |
| mixed-10MB | 104 MiB/s |
File Format (CGPU)
┌──────────────────────┐
│ GpuFileHeader (64B) │ Magic "CGPU", version, tile_size, tile_count
├──────────────────────┤
│ Tile 0 │ TileHeader (32B) + compressed payload
│ (128-byte aligned) │
├──────────────────────┤
│ Tile 1 │
├──────────────────────┤
│ ... │
├──────────────────────┤
│ Tile Index │ O(1) random access to any tile
├──────────────────────┤
│ GpuFileFooter (24B) │ Index offset, checksum, magic
└──────────────────────┘
- Tile size: 64 KB (default), independently decompressible
- Alignment: 128-byte boundaries for GPU memory coalescing
- Checksums: Optional per-tile CRC32 integrity verification
- Random access: Tile index enables O(1) decompression of any tile (~1 ms)
GDeflate Algorithm
GDeflate distributes DEFLATE across 32 parallel sub-streams:
- LZ77 match finding (greedy, 3-byte hash chains)
- Round-robin distribution of symbols across 32 sub-streams
- Fixed Huffman encoding (BTYPE=01) per sub-stream
- Interleaved serialization for GPU-friendly memory access
On decompression, 32 GPU threads each decode one sub-stream in parallel, then reconstruct the original data.
Reference: IETF draft draft-uralsky-gdeflate-00
Usage
As a Library
use ;
use AtomicBool;
let cancel = new;
let config = default;
// Compress (always on CPU)
let data = b"Hello, GPU compression!".repeat;
let compressed = compress.expect;
// Decompress (GPU if available, CPU fallback)
let decompressed = decompress.expect;
assert_eq!;
Configuration
use EngineConfig;
let config = EngineConfig ;
Random Access
use ;
let config = default;
let archive: & = &compressed_data;
// Load tile index (O(1) per tile)
let index = load_tile_index.expect;
// Decompress only the tile you need
let tile_data = decompress_tile_by_index
.expect;
As a crush-core Plugin
crush-gpu auto-registers as a crush-core plugin via linkme:
use ;
init_plugins.expect;
for plugin in list_plugins
GPU Backend
Supported APIs
| API | Platform | Status |
|---|---|---|
| Vulkan 1.2+ | Windows, Linux | Supported |
| Metal 2+ | macOS | Supported |
| DX12 | Windows | Supported |
| CUDA | NVIDIA (optional) | Feature-gated |
Requirements
- 2 GB+ VRAM (discrete GPU recommended)
- Vulkan, Metal, or DX12 driver
- CUDA feature requires
cudarcdependency
GPU Eligibility
The scorer module automatically determines whether GPU acceleration benefits a given workload:
| Criterion | Threshold |
|---|---|
| File size | > 100 MB |
| GPU available | Yes |
| Shannon entropy | < 7.5 bits/byte |
All criteria must pass for GPU dispatch. High-entropy (incompressible) data is rejected to avoid wasting GPU resources.
Running Benchmarks
# Throughput benchmarks (GPU vs CPU decompression)
# Compression ratio benchmarks
# All crush-gpu tests
Development
# Build
# Test
# Clippy
# Docs
License
MIT