Python API Reference
Complete reference for the Hexz Python package.
Installation
pip install hexz
Or build from source:
git clone https://github.com/Alethic-Systems/hexz.git
cd hexz
make develop
Opening Snapshots
The primary way to open snapshots is using hexz.open():
open
open(path: PathLike, *, mode: str = 'r', **options: Any) -> Union[Reader, Writer]
Open a Hexz snapshot for reading or writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathLike
|
Path to .hxz file. Supports local paths, HTTP/HTTPS URLs, and S3 URIs. |
required |
mode
|
str
|
'r' for reading, 'w' for writing |
'r'
|
**options
|
Any
|
Additional options for Reader or Writer |
{}
|
Keyword Arguments (Read Mode): cache_size (str): Block cache size (e.g., "512M", "1G", "2GB"). Default: ~4MB prefetch (bool): Enable background prefetching for sequential reads. Default: True s3_region (str): AWS region for S3 URLs endpoint_url (str): Custom S3 endpoint URL (for MinIO, Ceph, etc.) allow_restricted (bool): Allow connections to private/internal IPs. Default: False
Keyword Arguments (Write Mode): compression (str): Compression algorithm ('lz4' or 'zstd') block_size (int): Block size in bytes packing (str): Packing strategy ('fast', 'tight', etc.)
Returns:
| Type | Description |
|---|---|
Union[Reader, Writer]
|
Reader or Writer instance |
Example
Read with default settings (cache_size=default, prefetch=True)
with hexz.open("data.hxz") as reader: ... data = reader.read(4096) ... chunk = reader.read(100, offset=0) # random access ...
Read with custom cache and prefetch disabled
with hexz.open("data.hxz", cache_size="2G", prefetch=False) as reader: ... data = reader.read(4096) ...
Write a new snapshot
with hexz.open("out.hxz", mode="w", packing="tight") as writer: ... writer.add("input.img")
Reading Snapshots
The Reader class is returned by hexz.open(path, mode='r') and provides methods for reading data:
Reader
High-level reader for Hexz snapshots with pythonic interface.
Provides a file-like interface with additional random access capabilities. Supports context managers, pickle serialization, and slice notation.
Example
with hexz.Reader("dataset.hxz") as reader: ... data = reader.read(4096) ... # Zero-copy into buffer ... buf = bytearray(4096) ... n = reader.read(buffer=buf) ... chunk = reader.read(100, offset=1000) # random access ... # Or slice notation ... chunk = reader[1000:1100]
size
property
size: int
Total size of the snapshot in bytes.
metadata
property
metadata: Metadata
File metadata (version, compression, etc.).
read
read(size: int = -1, *, offset: Optional[int] = None, buffer: Optional[Union[bytearray, memoryview]] = None) -> Union[bytes, int]
Read bytes or fill a buffer. Single method for stream and random access.
From current position (default) or at a specific offset. With a buffer, fills it and returns the number of bytes read.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
size
|
int
|
Number of bytes to read (-1 for all remaining). Ignored when buffer is provided (then up to len(buffer) bytes are read). |
-1
|
offset
|
Optional[int]
|
If given, read from this byte offset without moving the cursor. If None (default), read from current position and advance the cursor. |
None
|
buffer
|
Optional[Union[bytearray, memoryview]]
|
If provided, fill this writable buffer and return bytes read (int). Use when reusing one buffer in a loop; combine with offset= for random access. |
None
|
Returns:
| Type | Description |
|---|---|
Union[bytes, int]
|
If buffer is None: bytes. If buffer is provided: int (bytes read). |
Example
data = reader.read(4096) chunk = reader.read(100, offset=1000) n = reader.read(buffer=buf) n = reader.read(buffer=buf, offset=0)
seek
seek(offset: int, whence: int = 0) -> int
Seek to a position in the file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
offset
|
int
|
Offset to seek to |
required |
whence
|
int
|
0 (absolute), 1 (relative), 2 (from end) |
0
|
Returns:
| Type | Description |
|---|---|
int
|
New absolute position |
tell
tell() -> int
Get current position in the file.
Returns:
| Type | Description |
|---|---|
int
|
Current byte offset |
analyze
analyze() -> AnalysisReport
Analyze snapshot for deduplication statistics.
Returns:
| Type | Description |
|---|---|
AnalysisReport
|
AnalysisReport with dedup ratio and savings information |
Example
with hexz.open("snapshot.hxz") as reader: ... report = reader.analyze() ... print(f"Dedup savings: {report.savings_percent:.1f}%")
Async Reading
For async/await support, use AsyncReader:
AsyncReader
Async reader for Hexz snapshots.
Use as an async context manager; the snapshot is opened when you enter the context.
Example
async with hexz.AsyncReader("dataset.hxz") as reader: ... data = await reader.read(4096) ... chunk = await reader.read(100, offset=0)
size
size() -> int
Size of the primary stream in bytes.
read
async
read(size: Optional[int] = None, *, offset: Optional[int] = None) -> bytes
Read bytes. From current position (default) or at a specific offset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
size
|
Optional[int]
|
Number of bytes to read (None for all remaining). |
None
|
offset
|
Optional[int]
|
If given, read from this byte offset without moving the cursor. If None (default), read from current position and advance the cursor. |
None
|
Returns:
| Type | Description |
|---|---|
bytes
|
Bytes read from the snapshot |
seek
async
seek(offset: int, whence: int = 0) -> int
Seek to a position. Returns new position.
tell
tell() -> int
Current read position.
Writing Snapshots
The Writer class is returned by hexz.open(path, mode='w'):
Writer
High-level writer for creating Hexz snapshots with pythonic interface.
Provides a fluent API for building snapshots with automatic finalization via context managers.
Example
with hexz.Writer("output.hxz", compression="zstd") as writer: ... writer.add("disk.img") ... writer.add_metadata({"created": "2026-02-09"}) ... # Automatically finalized on exit
bytes_written
property
bytes_written: int
Total bytes written so far.
add
add(source: Any, *, kind: Optional[str] = None) -> Writer
Add any source to the snapshot (fluent API).
Dispatches to specific methods based on source type: - str/Path: add_file() - bytes: add_bytes() - numpy array: add_array()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Any
|
Source to add (file path, bytes, array, etc.) |
required |
kind
|
Optional[str]
|
Optional hint about source type ("disk", "memory", etc.) |
None
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
add_file
add_file(path: PathLike, *, kind: Optional[str] = None, **kwargs: Any) -> Writer
Add a file to the snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathLike
|
Path to file |
required |
kind
|
Optional[str]
|
Optional kind hint ("disk", "memory", etc.) |
None
|
**kwargs
|
Any
|
Extra arguments (e.g., name) - currently ignored |
{}
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
add_bytes
add_bytes(data: bytes, **kwargs: Any) -> Writer
Add raw bytes to the snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Bytes to add |
required |
**kwargs
|
Any
|
Extra arguments (e.g., name) - currently ignored |
{}
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
add_array
add_array(array: Any, *, offset: Optional[int] = None, name: Optional[str] = None, **kwargs: Any) -> Writer
Add a NumPy array to the snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
array
|
Any
|
NumPy array to add |
required |
offset
|
Optional[int]
|
Optional byte offset (currently ignored) |
None
|
name
|
Optional[str]
|
Optional name for the array (currently ignored) |
None
|
**kwargs
|
Any
|
Extra arguments - currently ignored |
{}
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
Note
Current implementation converts array to bytes. Named arrays and metadata require Rust support (TODO).
add_xor_delta
add_xor_delta(data, base_offset: int, base_length: int, element_size: int = 1) -> Writer
Add XOR delta bytes against the parent snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Buffer-protocol object with the new tensor bytes. |
required | |
base_offset
|
int
|
Byte offset of the base tensor in the parent. |
required |
base_length
|
int
|
Byte length of the base tensor in the parent. |
required |
element_size
|
int
|
Dtype width in bytes (e.g. 4 for float32). Used for byte-shuffle pre-processing that dramatically improves compression of XOR deltas. |
1
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining. |
add_xor_delta_from_buffers
add_xor_delta_from_buffers(data, base_data, element_size: int = 1) -> Writer
Add XOR delta from two explicit buffers (no parent file read).
Used when the parent's tensor is itself stored as an XOR delta, so the caller must reconstruct the actual parent bytes first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Buffer with the new tensor bytes. |
required | |
base_data
|
Buffer with the reconstructed parent tensor bytes. |
required | |
element_size
|
int
|
Dtype width in bytes (e.g. 4 for float32). |
1
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining. |
add_metadata
add_metadata(metadata: Dict[str, Any]) -> Writer
Add custom metadata to the snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
Dict[str, Any]
|
Dictionary of metadata |
required |
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
write
write(data: bytes, *, offset: Optional[int] = None) -> int
Write bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Bytes to write |
required |
offset
|
Optional[int]
|
Optional byte offset (currently ignored) |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of bytes written |
tell
tell() -> int
Get current write position.
merge_overlay
merge_overlay(*, base: PathLike, overlay: PathLike, thin: bool = False) -> Writer
Merge a copy-on-write overlay with a base snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base
|
PathLike
|
Path to the base .hxz snapshot |
required |
overlay
|
PathLike
|
Path to the overlay data file |
required |
thin
|
bool
|
If True, create a thin snapshot that references the base |
False
|
Returns:
| Type | Description |
|---|---|
Writer
|
Self for method chaining |
Example
with hexz.Writer("merged.hxz") as writer: ... writer.merge_overlay(base="base.hxz", overlay="overlay.img") ...
Thin snapshot (references base for unmodified blocks)
with hexz.Writer("thin.hxz") as writer: ... writer.merge_overlay(base="base.hxz", overlay="overlay.img", thin=True)
finalize
finalize() -> None
Finalize the snapshot and write all metadata.
This must be called to complete snapshot creation. It: - Writes the master index - Updates the header - Flushes all buffers
Building Snapshots
build
build(source: PathLike, output: PathLike, *, profile: BuildProfile = 'generic', **overrides: Any) -> Metadata
Build a snapshot using a preset profile.
This is a convenience function that combines Writer configuration and common build patterns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
PathLike
|
Source file, directory, or data |
required |
output
|
PathLike
|
Output .hxz file path |
required |
profile
|
BuildProfile
|
Build profile to use |
'generic'
|
**overrides
|
Any
|
Override any profile settings |
{}
|
Returns:
| Type | Description |
|---|---|
Metadata
|
Metadata object with snapshot information |
Example
ML dataset with defaults
meta = hexz.build("imagenet/", "imagenet.hxz", profile="ml") print(f"Compressed to {meta.size_compressed / 1e9:.1f} GB") ...
Archival with encryption
meta = hexz.build( ... "backup/", ... "backup.hxz", ... profile="archival", ... encrypt=True, ... password="secret", ... )
PROFILES
module-attribute
PROFILES: Dict[BuildProfile, Dict[str, Any]] = {'ml': {'mode': 'fast', 'block_size': 128 * 1024, 'dedup': True, 'compression': 'lz4'}, 'eda': {'mode': 'balanced', 'block_size': 64 * 1024, 'dedup': True, 'compression': 'lz4'}, 'embedded': {'mode': 'tight', 'block_size': 32 * 1024, 'dedup': True, 'compression': 'zstd'}, 'generic': {'mode': 'balanced', 'block_size': 64 * 1024, 'dedup': True, 'compression': 'lz4'}, 'archival': {'mode': 'tight', 'block_size': 256 * 1024, 'dedup': True, 'compression': 'zstd'}}
Inspection & Verification
inspect
inspect(path: PathLike) -> Metadata
Inspect a Hexz snapshot and return structured metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathLike
|
Path to .hxz file |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
Metadata object with snapshot information |
Example
meta = hexz.inspect("snapshot.hxz") print(f"Version: {meta.version}") print(f"Compression: {meta.compression}") print(f"Size: {meta.primary_size:,} bytes") print(meta) # Human-readable output meta.print() # Same as above
verify
verify(path: PathLike, *, checksum: bool = True, structure: bool = True, public_key: Optional[PathLike] = None) -> bool
Verify snapshot integrity and optionally signature.
Performs structural validation, checksum verification, and optional cryptographic signature verification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathLike
|
Path to snapshot to verify |
required |
checksum
|
bool
|
Verify block checksums by reading entire file |
True
|
structure
|
bool
|
Verify file structure (header and index) |
True
|
public_key
|
Optional[PathLike]
|
Optional path to public key for signature verification |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if all checks pass, False otherwise |
Example
Basic integrity check
valid = hexz.verify("snapshot.hxz") ...
With signature verification
valid = hexz.verify("snapshot.hxz", public_key="key.pub") if not valid: ... print("Snapshot verification failed!")