This document provides a high-level overview of the Apache Iceberg C++ library, its architecture, and its core components. It introduces the fundamental concepts, modular design, and key subsystems that make up the implementation.
For detailed information about specific subsystems:
Sources: src/iceberg/type_fwd.h1-195 src/iceberg/CMakeLists.txt1-227
Apache Iceberg C++ is a native C++ implementation of the Apache Iceberg table format. Apache Iceberg is an open table format for large analytic datasets that provides:
This C++ library enables native applications to read, write, and manage Iceberg tables without depending on Java Virtual Machine-based implementations.
Sources: src/iceberg/table_metadata.h src/iceberg/snapshot.h
The library follows a layered architecture with clear separation of concerns:
Sources: src/iceberg/CMakeLists.txt18-92 src/iceberg/type_fwd.h27-194
The library uses a two-tier modular design to balance functionality and minimal dependencies:
| Component | Library | Purpose | Dependencies |
|---|---|---|---|
| Core | libiceberg | Fundamental Iceberg operations, metadata management, schema evolution, expression system | nanoarrow, nlohmann_json, CRoaring, spdlog, zlib |
| Bundle | libiceberg_bundle | Data format support for Arrow, Parquet, and Avro | Core + Apache Arrow + Apache Parquet + Apache Avro |
libiceberg)Always built and provides:
libiceberg_bundle)Optional, enabled via ICEBERG_BUILD_BUNDLE flag:
Sources: src/iceberg/CMakeLists.txt99-220 src/iceberg/meson.build137-161
The type system forms the foundation of Iceberg's data model:
TypeId enum - Identifies primitive and nested types (src/iceberg/type_fwd.h35-53)Type class hierarchy - Base class with derived types like StructType, ListType, MapType, primitives (src/iceberg/type.h)Schema class - Inherits from StructType, manages field IDs and schema evolution (src/iceberg/schema.h)SchemaField - Individual fields with id, name, type, required/optional flag (src/iceberg/schema_field.h)Tables are the primary interface for interacting with Iceberg datasets:
Table class - Main table interface (src/iceberg/table.h37-154)
Table::Make() - Factory method for creating tablesTable::schema(), Table::spec(), Table::sort_order() - Access current metadataTable::NewScan() - Create table scan buildersTable::NewTransaction() - Begin multi-operation transactionsStagedTable - Table created but not yet committed (src/iceberg/table.h157-172)StaticTable - Read-only table without catalog (src/iceberg/table.h176-192)TableMetadata - Immutable metadata structure containing schemas, partition specs, sort orders, snapshots (src/iceberg/table_metadata.h)TableMetadataBuilder - Fluent API for building metadata updatesCatalogs manage namespace and table lifecycles:
Catalog abstract interface - Namespace and table operations (src/iceberg/catalog.h)InMemoryCatalog - Thread-safe in-memory implementation (src/iceberg/catalog/memory/in_memory_catalog.cc22)RestCatalog - HTTP-based catalog client (src/iceberg/catalog/rest/)Type-safe expression trees for filtering and optimization:
Expression base class - Root of expression hierarchy (src/iceberg/expression/expression.h)Predicate classes - UnaryPredicate, LiteralPredicate, SetPredicate (src/iceberg/expression/predicate.h)Term classes - UnboundTerm, BoundTerm, BoundReference, BoundTransform (src/iceberg/expression/term.h)Transform enum and classes - identity, bucket, truncate, temporal transforms (src/iceberg/transform.h)Expressions factory - Static methods to build expressions (src/iceberg/expression/expressions.h)Sources: src/iceberg/type_fwd.h83-194 src/iceberg/table.h37-154 src/iceberg/CMakeLists.txt20-92
The following diagram shows how a typical table operation flows through the system components:
Sources: src/iceberg/table.cc39-83 src/iceberg/catalog/memory/in_memory_catalog.cc
The following table lists the core dependencies and their roles:
| Dependency | Purpose | Usage |
|---|---|---|
| nanoarrow | Apache Arrow C interface | Lightweight Arrow data structure handling |
| nlohmann_json | JSON parsing and serialization | Metadata file format, REST API communication |
| CRoaring | Compressed bitmap library | Efficient data filtering and metrics |
| spdlog | Logging framework | Diagnostic logging throughout the library |
| zlib | Compression | GZIP compression for metadata files |
Bundle-only dependencies:
| Dependency | Purpose | Build Flag |
|---|---|---|
| Apache Arrow | Columnar data format and I/O | ICEBERG_BUILD_BUNDLE |
| Apache Parquet | Columnar file format | ICEBERG_BUILD_BUNDLE |
| Apache Avro | Row-oriented data format | ICEBERG_BUILD_BUNDLE |
Sources: src/iceberg/CMakeLists.txt99-124 src/iceberg/meson.build123-135
The library supports both CMake and Meson build systems with identical functionality:
| Flag | Default | Description |
|---|---|---|
ICEBERG_BUILD_BUNDLE | OFF | Build data format support (Arrow, Parquet, Avro) |
ICEBERG_BUILD_REST | OFF | Build REST catalog client |
ICEBERG_BUILD_TESTS | OFF | Build test suite |
ICEBERG_BUILD_BENCHMARKS | OFF | Build performance benchmarks |
libiceberg_static.a, libiceberg_bundle_static.alibiceberg_shared.so, libiceberg_bundle_shared.so (or .dll/.dylib)Sources: src/iceberg/CMakeLists.txt1-227 src/iceberg/meson.build1-216 src/iceberg/test/CMakeLists.txt31-53
The library uses a Result<T> pattern for error handling, avoiding exceptions for better performance and explicit error handling:
Key macros for error propagation:
ICEBERG_ASSIGN_OR_RAISE(lhs, rhs) - Assigns result or returns early on errorICEBERG_RETURN_IF_ERROR(expr) - Returns early if expression is an error statusSources: src/iceberg/result.h src/iceberg/util/macros.h
The source code is organized into logical subsystems:
src/iceberg/
├── catalog/ # Catalog implementations
│ ├── memory/ # In-memory catalog
│ └── rest/ # REST catalog (optional)
├── expression/ # Expression and predicate system
├── manifest/ # Manifest file handling
├── row/ # Row-based data structures
├── update/ # Table update operations
├── util/ # Utility functions
├── arrow/ # Arrow integration (bundle only)
├── avro/ # Avro integration (bundle only)
└── parquet/ # Parquet integration (bundle only)
Sources: src/iceberg/CMakeLists.txt20-150 src/iceberg/meson.build42-115
To learn more about specific subsystems:
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.