Apache Iceberg - Java and Python APIs
Apache Iceberg - Java and Python APIs
This chapter is a hands-on exploration of utilizing Apache Iceberg with Java and
Python, providing practical insights into table management, schema, and datafile
operations. It covers aspects such as creating tables, schemas, and partition specs,
as well as reading and updating data. The chapter aims to equip readers with the
necessary skills to efficiently interact with Apache Iceberg using these APIs.
Create a Table
As with any other compute engine, the first step to creating and interacting with
Apache Iceberg tables is to create a catalog. In general, with the Java API, tables are
created using either a Catalog or a Table interface implementation. In this chapter,
you will see how to use two of the built-in catalogs to create Iceberg tables. However,
you can also implement and use a custom catalog of your own.
Initializing a Hive catalog involves providing a name and catalog-specific properties.
Here is an example of creating an Iceberg table using the Hive catalog:
import org.apache.iceberg.hive.HiveCatalog;
import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;
1
Map <String, String> properties = new HashMap<String, String>();
properties.put("warehouse", "...");
properties.put("uri", "...");
catalog.initialize("hive", properties);
Create a Schema
To create a table schema, you will need to create an object of the Schema class in
Iceberg. Here is an example of defining a schema:
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
Here, we use .day("hire_date") to define the partitioning strategy for our table.
This means Iceberg will create a separate partition for each distinct day.
Read a Table
The Table interface in the Java API provides methods to interact with an Iceberg
table to get information on things such as the schema, partition spec, metadata, and
datafiles. Here are a few common methods:
schema()
Returns the current table schema
spec()
Provides the current table partition spec
currentSnapshot()
Retrieves the current snapshot of the table
location()
Provides the base location of the table
File-level scan
Reading from an Iceberg table begins with creating a TableScan object using the
newScan() method. An Iceberg TableScan returns a list of tasks (datafiles, delete files,
etc.) that are related to the desired scan. This is useful if you want your compute
engine to support Apache Iceberg, as it can use a table scan to generate the list of files
for the query engine to scan and process without having to reimplement the process
of reading the Iceberg metadata. Here is an example:
TableScan scan = table.newScan();
You can also use the filter() method on the TableScan object to filter specific data:
TableScan filteredScan = scan.filter(Expressions.equal("role", "Engineer"));
Row-level scan
With the Java API, initiating a row-level scan that returns individual records of data
begins by creating a ScanBuilder object with IcebergGenerics.read(). This may
not be practical for very large datasets, which should use a query engine. Those query
engines will likely use the TableScan construct previously discussed to help with their
queries. Here is an example:
ScanBuilder scanBuilder = IcebergGenerics.read(table);
You can then use the where() method on the ScanBuilder object to filter out records:
CloseableIterable<Record> result = IcebergGenerics.read(table)
.where(Expressions.lessThan("id", 5))
.build();
Update a Table
Apache Iceberg’s Table interface exposes methods that allow updates to the table.
These operations follow a builder pattern. For example, to update a table’s schema,
you have to first call updateSchema(), then add the updates to the builder, and finally
call commit() to push the changes to the table. An example follows:
table.updateSchema()
.addColumn("location", Types.StringType.get())
.commit();
Here, we add a new column called location of type String to the Iceberg table
using the addColumn() method. Other available methods include updateLocation,
updateProperties, newAppend, newOverwrite, and rewriteManifests.
// Overwrite files
t.newOverwrite()
.overwriteByRowFilter(Expressions.equal("id", 5))
.addFile(newDataFile)
.commit();
Install PyIceberg
The latest version of PyIceberg can be installed using PyPi:
pip3 install "pyiceberg[s3fs,hive]"
The preceding command will install PyIceberg with s3fs as a FileIO implementation
and support for the Hive Metastore. Note that you can specify various dependencies
to be installed with PyIceberg within the brackets, [ ], including AWS Glue, Amazon
Simple Storage Service (Amazon S3) readers, PyArrow, and more. Also, you will need
to install either s3fs, adlfs, or PyArrow to access files.
Load a Table
You can load a table with PyIceberg either from a catalog or directly from a metadata
file. Let’s see how to do this in action.
From a catalog
To load a table with PyIceberg from a catalog, start by connecting to a catalog. The
following code connects to a Glue catalog:
from pyiceberg.catalog import load_catalog
catalog = load_catalog("glue_catalog")
catalog.list_namespaces()
This will return the available namespaces. In this example case, it returns the name‐
space company.
Now, to list all the tables in the company namespace, you can call the list_tables()
method:
catalog.list_tables("company")
This will return all the tables that exist within the namespace as a list with tuples—for
example, [("company", "employees")].
Finally, to load the employees table, you can use the load_table() method:
catalog.load_table("company.employees")
# Alternatively:
catalog.load_table(("company", "employees"))
Create a Table
To create a table with PyIceberg, you need to define the schema, any partition
specification or sort order, and the catalog from which you’re creating the table. The
create_table() method then takes these arguments and creates the table. Here is an
example that creates an employee table with five fields:
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, StringType, NestedField
catalog = load_catalog("glue_catalog")
table = catalog.load_table("employee")
scan = table.scan(
row_filter=GreaterThanOrEqual("salary", 50000.0),
selected_fields=("id", "role", "salary"),
limit=100
)
After you perform a table scan, you must pass the list of tasks/files to something that
can process and read the list of files. PyIceberg has integrations to pass the scan plan
to PyArrow, DuckDB, and Ray to query and process the data.
Conclusion
This chapter covered the Java and Python APIs for working with Apache Iceberg
tables. These native table APIs allow you to create tables, create scan plans for
compute engines, create transactions, and more. New native table APIs in more
languages, such as Rust and GO, are being worked on, expanding the reach of Apache
Iceberg across different languages and tools created in those languages.