This document details the transform types available in the Iceberg C++ library and their implementations. Transforms are used for partitioning data and support predicate pushdown for efficient query planning. For information about how transforms are applied to predicates and used in partition specifications, see Transform System Overview. For details on predicate projection through transforms, see Predicate Projection.
The Iceberg transform system provides eight distinct transform types that can be applied to table columns for partitioning. Each transform type has specific source type compatibility requirements and produces predictable result types. Transforms are represented by two key abstractions:
Transform class: A lightweight descriptor that can be serialized and bound to source typesTransformFunction class: The executable implementation that performs the actual data transformationSources: src/iceberg/transform.h38-230 src/iceberg/transform_function.h1-206
All transform types are defined in the TransformType enum:
| TransformType | String Name | Parameters | Description |
|---|---|---|---|
kUnknown | "unknown" | None | Placeholder for unrecognized transforms |
kIdentity | "identity" | None | Returns source value unchanged |
kBucket | "bucket" | num_buckets: int32 | Hash of value mod N |
kTruncate | "truncate" | width: int32 | Value truncated to width W |
kYear | "year" | None | Extract years from 1970 |
kMonth | "month" | None | Extract months from 1970-01 |
kDay | "day" | None | Extract days from 1970-01-01 |
kHour | "hour" | None | Extract hours from 1970-01-01 00:00:00 |
kVoid | "void" | None | Always produces null |
The TransformTypeToString() function converts enum values to their string representations, and TransformFromString() parses string representations back into Transform objects.
Sources: src/iceberg/transform.h39-84 src/iceberg/transform.cc428-454
The Transform class provides static factory methods for creating transform instances:
Non-parameterized transforms (identity, year, month, day, hour, void) are implemented as singletons for memory efficiency. Parameterized transforms (bucket, truncate) create new instances for each unique parameter value.
Sources: src/iceberg/transform.h92-140 src/iceberg/transform.cc48-85
The identity transform returns the source value unchanged and works with any primitive type.
Sources: src/iceberg/transform.h92-96 src/iceberg/transform_function.h27-44 src/iceberg/test/transform_test.cc210-287
The bucket transform hashes input values into a fixed number of buckets using the Iceberg-specified 32-bit hash function.
The bucket transform implements the hash requirements specified in Appendix B of the Iceberg specification. Hash values are computed differently for each source type:
| Source Type | Hash Method |
|---|---|
| int, long, date, time, timestamp, timestamptz | BucketUtils::HashInt() or HashLong() |
| decimal, string, uuid, fixed, binary | BucketUtils::HashBytes() |
The final bucket value is computed as: (hash & Integer.MAX_VALUE) % num_buckets
Sources: src/iceberg/transform.h98-103 src/iceberg/transform_function.h46-74 src/iceberg/test/bucket_util_test.cc34-108 src/iceberg/test/transform_test.cc289-375
The truncate transform reduces values to a specified width, with behavior depending on the source type.
For integers and decimals, truncation rounds down to the nearest multiple of the width:
String truncation counts Unicode code points, not bytes:
Sources: src/iceberg/transform.h105-110 src/iceberg/transform_function.h76-101 src/iceberg/test/transform_test.cc377-430
The temporal transforms extract time components from date and timestamp values, representing them as counts from the Unix epoch (1970-01-01).
Note: The result type is DateType rather than int32 for improved human readability, though the physical representation is identical.
Temporal transforms use the TemporalUtils helper class which leverages C++20 <chrono> facilities for date arithmetic. All calculations handle timezone-aware timestamps by converting to UTC before extraction.
Sources: src/iceberg/transform.h112-134 src/iceberg/transform_function.h103-185 src/iceberg/util/temporal_util.cc30-239 src/iceberg/test/transform_test.cc432-593
The void transform ignores the input value and always returns null, maintaining the source type.
Sources: src/iceberg/transform.h136-140 src/iceberg/transform_function.h187-204 src/iceberg/test/transform_test.cc595-672
The TransformFunction abstract base class defines the interface that all transform implementations must provide:
Transform(const Literal& literal)
Result<Literal> containing the transformed value or an errorResultType()
transform_type()
TransformType enum valuesource_type()
All transforms must preserve null values: if the input literal is null, the output must also be null (though the type may differ).
Sources: src/iceberg/transform_function.h24-275 src/iceberg/transform.cc414-422
The Transform::CanTransform() method determines whether a transform can be applied to a given source type:
| Transform | Boolean | Int | Long | Float | Double | Decimal | Date | Time | Timestamp | TimestampTz | String | UUID | Fixed | Binary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Identity | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Bucket | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Truncate | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| Year | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| Month | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| Day | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| Hour | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| Void | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
When binding a transform to a source type, the Bind() method checks type compatibility. If the types are incompatible, it returns a NotSupported error:
Sources: src/iceberg/transform.cc134-201
Each transform produces a specific result type based on the transform type and source type:
int32 regardless of source typeint32 representing counts from epochdate type for better human readability (physically stored as int32)The ResultType() method on TransformFunction implementations returns the appropriate type for each transform.
Sources: src/iceberg/test/transform_test.cc123-169
Transforms can preserve the ordering of values, which is important for partition pruning and sort order optimization.
The PreservesOrder() method indicates whether a transform maintains the relative ordering of input values:
| Transform | Preserves Order | Explanation |
|---|---|---|
| Identity | Yes | Output equals input |
| Bucket | No | Hash values don't preserve order |
| Truncate | Yes | Truncation maintains relative ordering |
| Year | Yes | Coarser granularity, but monotonic |
| Month | Yes | Coarser granularity, but monotonic |
| Day | Yes | Coarser granularity, but monotonic |
| Hour | Yes | Coarser granularity, but monotonic |
| Void | No | All values become null |
The SatisfiesOrderOf() method determines whether ordering by one transform satisfies ordering by another. This is used to determine if a sort order can be simplified.
For example, sorting by day(timestamp) produces an ordering that also satisfies month(timestamp) and year(timestamp), since days are finer-grained than months and years.
For truncate transforms, ordering by truncate[W1] satisfies ordering by truncate[W2] if and only if W1 >= W2. For example, truncate[10] satisfies truncate[5] but not vice versa.
Sources: src/iceberg/transform.cc203-247
The binding process connects a Transform descriptor with a source type to create an executable TransformFunction.
CanTransform() method checks if the source type is compatibleTransformType, the appropriate Make() factory method is calledTransformFunction subclass instance is createdResult<>Sources: src/iceberg/transform.cc94-132
Transforms generate human-readable partition names using GeneratePartitionName():
| Transform | Source Name | Generated Partition Name |
|---|---|---|
| Identity | "id" | "id" |
| Bucket[16] | "user_id" | "user_id_bucket_16" |
| Truncate[4] | "name" | "name_trunc_4" |
| Year | "timestamp" | "timestamp_year" |
| Month | "timestamp" | "timestamp_month" |
| Day | "timestamp" | "timestamp_day" |
| Hour | "timestamp" | "timestamp_hour" |
| Void | "hidden" | "hidden_null" |
These names are used when creating partition fields in a partition specification.
Sources: src/iceberg/transform.cc393-412
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.