Compression Algorithms

Comparison of compression algorithms supported by Hexz.

Supported Algorithms

Hexz supports two compression algorithms optimized for different use cases.

LZ4

Type: Fast compression/decompression Use Case: Hot data, frequent random access, local NVMe storage

Characteristics

Compression Speed: Very fast
Decompression Speed: Very fast
Compression Ratio: Moderate (2-3x typical)
CPU Usage: Low
Memory Usage: Low

When to Use LZ4

Random access workloads (ML training with shuffling)
VM boot scenarios requiring low latency
Local NVMe storage where I/O is fast
CPU-constrained environments
Real-time data processing

Configuration

# CLI
hexz data pack --disk data/ --output out.hxz --compression lz4

# Python
hexz.open("out.hxz", mode="w", compression="lz4")

Zstandard (Zstd)

Type: High compression ratio Use Case: Cold storage, S3 streaming, archival

Characteristics

Compression Speed: Moderate to slow (depending on level)
Decompression Speed: Fast to moderate
Compression Ratio: High (3-5x typical, up to 8x at level 22)
CPU Usage: Medium to high
Memory Usage: Medium

Compression Levels

Level	Speed	Ratio	Use Case
1	Fastest	Lowest	Real-time compression
3	Fast	Good	Default, balanced
9	Moderate	Better	S3 storage
15	Slow	High	Archival
22	Very slow	Highest	Long-term cold storage

When to Use Zstandard

S3 streaming (save bandwidth)
Archival storage
Sequential access patterns
Storage-constrained environments
Data with high redundancy

Configuration

# CLI with compression level
hexz data pack \
  --disk data/ \
  --output out.hxz \
  --compression zstd \
  --compression-level 9

# Python with compression level
hexz.open("out.hxz", mode="w", compression="zstd", compression_level=9)

Comparison

Performance

Validated on i7-14700K (single-threaded):

Algorithm	Compress	Decompress	Ratio	CPU
LZ4	22 GB/s	32 GB/s	2-3x [measured in benchmarks]	Low
Zstd-3	8-9 GB/s	[not measured]	3-4x [estimated]	Medium
Zstd-9+	[not measured]	[not measured]	[not measured]	High

Note: Actual performance varies by data characteristics. Compression ratios need validation with real datasets.

Storage Savings

[BENCHMARK NOT YET VALIDATED]

Compression ratios depend heavily on data characteristics. Typical ranges: - LZ4: 2-3× (measured in micro-benchmarks) - Zstd-3: 3-4× (estimated) - Zstd-9+: 4-5× (estimated)

Actual savings on real datasets (images, VMs, code) need validation through macro-benchmarks.

Use Case Recommendations

Scenario	Algorithm	Level	Reason
ML training (local)	LZ4	N/A	Fast decompression, random access
ML training (S3)	Zstd	3-9	Balance bandwidth and decompression
VM boot	LZ4	N/A	Low latency required
Archival	Zstd	15-22	Maximum compression
Frequent updates	LZ4	N/A	Faster repack operations
Infrequent access	Zstd	9+	Optimize storage cost

Block Size Impact

[BENCHMARK NOT YET VALIDATED]

Compression ratio generally improves with larger blocks, but this trades off against random access latency. A dedicated benchmark is needed to measure the actual relationship between block size, compression ratio, and access latency.

Default block size is 64KB as a balanced compromise. See ADR-0002 for block-level compression rationale.

Choosing an Algorithm

Decision tree:

Is storage/bandwidth the primary constraint?
Yes: Use Zstd (level 9+)
No: Continue
Is random access latency critical?
Yes: Use LZ4
No: Continue
Is data accessed from S3?
Yes: Use Zstd (level 3-9)
No: Use LZ4
Is CPU limited?
Yes: Use LZ4
No: Use Zstd

Changing Compression

To change compression algorithm on existing snapshot:

# Repack with different compression
hexz data pack \
  --disk original.hxz \
  --output recompressed.hxz \
  --compression lz4

Note: This requires reading and rewriting all data.

Future Algorithms

Potential additions in future versions: - LZMA (higher compression ratio) - Brotli (web-optimized) - Snappy (Google's fast compression)

Compression Algorithms

Supported Algorithms

LZ4

Characteristics

When to Use LZ4

Configuration

Zstandard (Zstd)

Characteristics

Compression Levels

When to Use Zstandard

Configuration

Comparison

Performance

Storage Savings

Use Case Recommendations

Block Size Impact

Choosing an Algorithm

Changing Compression

Future Algorithms

See Also