Understanding Compression and Deduplication
Time to Complete: 15 minutes
What You'll Learn: How Hexz's block-level compression and deduplication work through hands-on experimentation.
What You'll Build: Multiple snapshots demonstrating compression ratios and deduplication benefits.
Prerequisites
- Completed Getting Started
- Hexz CLI and Python package installed
- Basic understanding of file compression
Learning Objectives
By the end of this tutorial, you will:
- Understand the difference between file-level and block-level compression
- See how different data types compress
- Understand content-defined chunking (CDC)
- Observe deduplication in action
- Learn when to enable different compression features
Step 1: Observe Basic Compression
Let's create files with different characteristics and see how they compress.
Create compression_test.py:
import hexz
import os
# Create test files
os.makedirs("/tmp/compression_test", exist_ok=True)
# File 1: Highly repetitive data (compresses well)
with open("/tmp/compression_test/repetitive.bin", "wb") as f:
f.write(b"AAAA" * 25000) # 100KB of repeated pattern
# File 2: Random data (compresses poorly)
import random
with open("/tmp/compression_test/random.bin", "wb") as f:
f.write(bytes([random.randint(0, 255) for _ in range(100000)]))
# File 3: Text data (compresses moderately)
with open("/tmp/compression_test/text.txt", "w") as f:
text = "The quick brown fox jumps over the lazy dog. " * 2000
f.write(text)
print("Test files created")
print(f"repetitive.bin: {os.path.getsize('/tmp/compression_test/repetitive.bin'):,} bytes")
print(f"random.bin: {os.path.getsize('/tmp/compression_test/random.bin'):,} bytes")
print(f"text.txt: {os.path.getsize('/tmp/compression_test/text.txt'):,} bytes")
Run it:
python compression_test.py
Expected Output:
Test files created
repetitive.bin: 100,000 bytes
random.bin: 100,000 bytes
text.txt: 90,000 bytes
Step 2: Compare LZ4 vs Zstandard
Now pack these files with different compression algorithms.
import hexz
import os
def pack_and_report(input_dir, output_file, compression):
with hexz.open(output_file, mode="w", compression=compression) as writer:
for filename in os.listdir(input_dir):
filepath = os.path.join(input_dir, filename)
if os.path.isfile(filepath):
writer.add(filepath)
original_size = sum(
os.path.getsize(os.path.join(input_dir, f))
for f in os.listdir(input_dir)
if os.path.isfile(os.path.join(input_dir, f))
)
compressed_size = os.path.getsize(output_file)
ratio = original_size / compressed_size
print(f"{compression.upper():6s}: {compressed_size:,} bytes (ratio: {ratio:.2f}x)")
print("\nCompression comparison:")
original = sum(
os.path.getsize(f"/tmp/compression_test/{f}")
for f in ["repetitive.bin", "random.bin", "text.txt"]
)
print(f"Original: {original:,} bytes\n")
pack_and_report("/tmp/compression_test", "/tmp/test_lz4.hxz", "lz4")
pack_and_report("/tmp/compression_test", "/tmp/test_zstd.hxz", "zstd")
Expected Output:
Compression comparison:
Original: 290,000 bytes
LZ4 : 118,000 bytes (ratio: 2.46x)
ZSTD : 102,000 bytes (ratio: 2.84x)
What Just Happened: - Repetitive data compressed extremely well - Random data barely compressed (fundamental limit of compression) - Text compressed moderately well - Zstandard achieved better ratio than LZ4 (but slower)
Step 3: Understand Block-Level Compression
Block-level compression enables random access. Let's see how it works.
import hexz
# Create a large file with distinct sections
with open("/tmp/large_file.bin", "wb") as f:
# Section 1: Repeated 'A's
f.write(b"A" * 100000)
# Section 2: Repeated 'B's
f.write(b"B" * 100000)
# Section 3: Repeated 'C's
f.write(b"C" * 100000)
# Pack with block-level compression
with hexz.open("/tmp/blocks.hxz", mode="w", compression="lz4", block_size=65536) as writer:
writer.add("/tmp/large_file.bin")
# Now we can read any section without decompressing everything
with hexz.open("/tmp/blocks.hxz") as reader:
# Jump to middle (section 2) without decompressing section 1
reader.seek(100000)
data = reader.read(10)
print(f"Data at offset 100000: {data}") # Should be b'BBBBBBBBBB'
# Jump to end (section 3)
reader.seek(200000)
data = reader.read(10)
print(f"Data at offset 200000: {data}") # Should be b'CCCCCCCCCC'
Expected Output:
Data at offset 100000: b'BBBBBBBBBB'
Data at offset 200000: b'CCCCCCCCCC'
Key Insight: Each seek only decompresses the blocks needed, not the entire file.
Step 4: Observe Deduplication with CDC
Content-Defined Chunking (CDC) finds duplicate content even when inserted.
import hexz
import os
# Create base file
with open("/tmp/version1.txt", "w") as f:
f.write("Line 1\n" * 1000)
f.write("Line 2\n" * 1000)
f.write("Line 3\n" * 1000)
# Pack without CDC
with hexz.open("/tmp/v1_no_cdc.hxz", mode="w") as writer:
writer.add("/tmp/version1.txt")
# Pack with CDC
with hexz.open("/tmp/v1_with_cdc.hxz", mode="w", cdc=True) as writer:
writer.add("/tmp/version1.txt")
print("Version 1:")
print(f" Without CDC: {os.path.getsize('/tmp/v1_no_cdc.hxz'):,} bytes")
print(f" With CDC: {os.path.getsize('/tmp/v1_with_cdc.hxz'):,} bytes")
# Now create version 2 with insertion at the beginning
with open("/tmp/version2.txt", "w") as f:
f.write("NEW LINE INSERTED\n") # New content at start
f.write("Line 1\n" * 1000)
f.write("Line 2\n" * 1000)
f.write("Line 3\n" * 1000)
# Pack version 2 without CDC (everything shifts, no deduplication)
with hexz.open("/tmp/v2_no_cdc.hxz", mode="w") as writer:
writer.add("/tmp/version2.txt")
# Pack version 2 with CDC (finds common blocks despite shift)
with hexz.open("/tmp/v2_with_cdc.hxz", mode="w", cdc=True) as writer:
writer.add("/tmp/version2.txt")
print("\nVersion 2 (with insertion at start):")
print(f" Without CDC: {os.path.getsize('/tmp/v2_no_cdc.hxz'):,} bytes")
print(f" With CDC: {os.path.getsize('/tmp/v2_with_cdc.hxz'):,} bytes")
# Show the benefit
v1_size = os.path.getsize("/tmp/v1_with_cdc.hxz")
v2_size = os.path.getsize("/tmp/v2_with_cdc.hxz")
print(f"\nWith CDC, version 2 is only {v2_size - v1_size:,} bytes larger")
print("(Just the new content, not the shifted content)")
Expected Output:
Version 1:
Without CDC: 15,234 bytes
With CDC: 15,234 bytes
Version 2 (with insertion at start):
Without CDC: 15,250 bytes
With CDC: 15,245 bytes
With CDC, version 2 is only 11 bytes larger
(Just the new content, not the shifted content)
What Just Happened: - Without CDC: Fixed block boundaries, insertion shifts everything, poor deduplication - With CDC: Content-defined boundaries, insertion only affects nearby blocks, excellent deduplication
Step 5: Block Size Impact
Block size affects compression ratio and access latency.
# Create test data
dd if=/dev/urandom of=/tmp/test_data.bin bs=1M count=10
# Pack with different block sizes
hexz data pack --disk /tmp/test_data.bin --output /tmp/4kb.hxz --block-size 4096
hexz data pack --disk /tmp/test_data.bin --output /tmp/64kb.hxz --block-size 65536
hexz data pack --disk /tmp/test_data.bin --output /tmp/256kb.hxz --block-size 262144
# Compare sizes
ls -lh /tmp/*kb.hxz
Expected Output:
-rw-r--r-- 1 user user 10.2M 4kb.hxz
-rw-r--r-- 1 user user 10.1M 64kb.hxz
-rw-r--r-- 1 user user 10.0M 256kb.hxz
Observation: Larger blocks compress slightly better (more context), but slower random access.
What You've Accomplished
- [x] Observed different compression ratios for different data types
- [x] Compared LZ4 vs Zstandard compression
- [x] Understood block-level compression enables random access
- [x] Saw content-defined chunking (CDC) maintain deduplication despite insertions
- [x] Learned block size affects compression ratio and access speed
Key Takeaways
Compression Algorithm Choice: - LZ4: Fast decompression, use for random access workloads - Zstandard: Better ratio, use for storage-constrained scenarios
Block Size Selection: - Small (4KB): Fastest random access, lower ratio - Medium (64KB): Balanced (default) - Large (256KB+): Best ratio, slower random access
CDC (Content-Defined Chunking):
- Enable with --cdc flag
- Essential for deduplicating multiple versions
- Slight CPU overhead but major storage savings
When to Use What: - ML training (local): LZ4, 64KB blocks, no CDC - ML training (S3): Zstd, 64KB blocks, with CDC - VM boot: LZ4, 4-16KB blocks, no CDC - Archival: Zstd level 9+, 256KB blocks, with CDC - Dataset versions: Any compression, any block size, with CDC (essential)
Next Steps
- Tutorial: First ML Pipeline — Apply compression to real datasets
- How-To: Performance Tuning — Optimize for your workload
- Reference: Compression Algorithms — Detailed specs
- Explanation: Compression Strategy — Design rationale
Cleanup
rm -rf /tmp/compression_test /tmp/*.hxz /tmp/*.bin /tmp/*.txt