This quickstart shows you how to set up a Java development environment and run
an example pipeline written with the
Apache Beam Java SDK, using a
runner of your choice.
If you’re interested in contributing to the Apache Beam Java codebase, see the
Contribution Guide.
Download and install the
Java Development Kit (JDK)
version 8, 11, or 17. Verify that the
JAVA_HOME
environment variable is set and points to your JDK installation.
The example used in this tutorial, WordCount.java, defines a
Beam pipeline that counts words from an input file (by default, a .txt
file containing Shakespeare’s “King Lear”). To learn more about the examples,
see the WordCount Example Walkthrough.
Optional: Convert from Maven to Gradle
The steps below explain how to convert the build from Maven to Gradle for the
following runners:
In the directory with the pom.xml file, run the automated Maven-to-Gradle
conversion:
gradle init
You’ll be asked if you want to generate a Gradle build. Enter yes. You’ll
also be prompted to choose a DSL (Groovy or Kotlin). For this tutorial, enter
2 for Kotlin.
Open the generated build.gradle.kts file and make the following changes:
In repositories, replace mavenLocal() with mavenCentral().
In repositories, declare a repository for Confluent Kafka dependencies:
If you’re planning to use the DataflowRunner, you can skip this step. The
runner will pull text directly from Google Cloud Storage.
In the word-count-beam directory, create a file called sample.txt.
Add some text to the file. For this example, use the text of Shakespeare’s
King Lear.
Run a pipeline
A single Beam pipeline can run on multiple Beam
runners. The
DirectRunner is useful for getting started,
because it runs on your machine and requires no specific setup. If you’re just
trying out Beam and you’re not sure what to use, use the
DirectRunner.
The general process for running a pipeline goes like this:
Complete any runner-specific setup.
Build your command line:
Specify a runner with --runner=<runner> (defaults to the
DirectRunner).
Add any runner-specific required options.
Choose input files and an output location that are accessible to the
runner. (For example, you can’t access a local file if you are running
the pipeline on an external cluster.)
TODO: document Samza on Gradle: https://github.com/apache/beam/issues/21500
TODO: document Nemo on Gradle: https://github.com/apache/beam/issues/21503
TODO: document Jet on Gradle: https://github.com/apache/beam/issues/21501
Inspect the results
After the pipeline has completed, you can view the output. There might be
multiple output files prefixed by count. The number of output files is decided
by the runner, giving it the flexibility to do efficient, distributed execution.
View the output files in a Unix shell:
ls counts*
ls counts*
ls /tmp/counts*
ls counts*
gsutil ls gs://<your-gcs-bucket>/counts*
ls /tmp/counts*
ls counts*
ls counts*
The output files contain unique words and the number of occurrences of each
word.
View the output content in a Unix shell:
more counts*
more counts*
more /tmp/counts*
more counts*
gsutil cat gs://<your-gcs-bucket>/counts*
more /tmp/counts*
more counts*
more counts*
The order of elements is not guaranteed, to allow runners to optimize for
efficiency. But the output should look something like this: