Appointment streaming pipeline

Apache Beam streaming pipeline built in Python. Takes JSON messages from Cloud PubSub, streams messages to BigQuery after performing a validation step.

Solution design

The solution was built using products from Google Cloud Platform.

Components used

Cloud PubSub: Serverless message queue.
Cloud Dataflow: Managed runner for Apache Beam jobs. Built for large scale distributed data processing.
BigQuery: Serverless data warehouse with a SQL interface.

I used Cloud Scheduler and Compute Engine to simulate a message stream arriving to PubSub. Cloud Scheduler was used for development purposes. pubsub/pubsub_publisher.py was running on a Compute Engine Virtual Machine to publish each message from the sample data to PubSub.

Architecture diagram

Dataflow acts as a glue in this architecture. It reads messages from PubSub and validates these messages. If a message is invalid it gets inserted into a failed_rows table, while valid messages are inserted into a table called events_raw. A View is built on top of events_raw table in BigQuery which shows the latest state of each appointment.

Apache Beam pipeline design

There are two validation steps performed in the Apache Beam pipeline. One just after reading from PubSub which checks if the message is a valid JSON string. This step would filter out messages like { "Type": "AppointmentBooked", "Data": {. After filtering out invalid messages, valid messages get inserted into a BigQuery table. Unsuccessful inserts are captured and inserted into a separate BigQuery table as a string. Processing steps are shown on the Dataflow DAG below.

A long living PubSub subscription is used to read from the PubSub topic. By default Beam creates short linging subscriptions for each pipeline run. It's beneficial to use a long living subscription since messages are retained in the subscription after a pipeline is cancelled or failed.

Finally, I chose Pyhton over Java to develop this streaming pipeline due to the prototype nature of this project and the time constraints. Java has a lot of benefits over Python when it comes to Dataflow streaming. I would probably rewrite the pipeline in Java if I would have to productionise it.

BigQuery tables

BigQuery tables created:

events_raw: Stores valid messages. Dataflow inserts rows to this table using streaming inserts.
failed_rows: Stores invalid messages as string. Dataflow inserts rows to this table using streaming inserts.

BigQuery views created:

state_view: A BigQuery view on top of events_raw table which shows the latest rows for each AppointmentId. This view is based on the SQL query bigquery/appointment_state_view.sql.

Average appointment duration and average appointment duration by discipline can be easily calculated by querying the events_raw table. SQL query to calculate these metrics can be fount at bigquery/appointment_avg_duration.sql and bigquery/appointment_avg_duration_by_discipline.sql.

Below screenshot shows the preview of events_raw table in BigQuery.

Running instructions

You can run the Apache Beam pipeline either locally or on top of Dataflow in Google Cloud Platform.

Before running

pip install -r requirements.txt

DirectRunner

cd dataflow
python3 -m streaming_pipeline \
--subscription=projects/{project}/subscriptions/{subscription} \
--bigquery_table={project}:{dataset}.{table} \
--bigquery_table_for_failed_rows={project}:{dataset}.{table}

DataflowRunner

cd dataflow
python3 -m streaming_pipeline \
--project={project} \
--temp_location={bucket}/temp \
--staging_location={bucket}/staging \
--runner=DataflowRunner \
--subscription=projects/{project}/subscriptions/{subscription} \
--bigquery_table={project}:{dataset}.{table} \
--bigquery_table_for_failed_rows={project}:{dataset}.{table}

Considerations

Handling changing data schema

Some fields in the incoming JSON messages don't always exist. Fortunately I didn't have to worry to much about this issue since BigQuery can handle the absence of fields if the field type in the table schema is NULLABLE. If a field is absent in a message, BigQuery will automatically insert null as a value. On the other hand, additional fields can cause issues. Inserts fail when a row has additional fields that don't match the schema. With my current solution these inserts fail on events_raw table and will go to the failed_rows table.

Handling broken events

Malformed JSON events are handled by a validation step in the Apache Beam pipeline. These events get inserted into the failed_rows BigQuery table as a string.

Further improvements

This prototype doesn't have unit tests or integration tests. It could be improved by unit testing the transformations of the Beam pipeline, aslo by testing the pipeline end-to-end.

Name	Name	Last commit message	Last commit date
Latest commit Alexandra Nov 15, 2019 bba7b37 · Nov 15, 2019 History 36 Commits
bigquery	bigquery	Add description	Nov 15, 2019
dataflow	dataflow	Update documentation	Nov 15, 2019
imgs	imgs	Resize image	Nov 15, 2019
pubsub	pubsub	Update documentation	Nov 15, 2019
.gitignore	.gitignore	Add DS_Store	Nov 15, 2019
README.md	README.md	Change header size	Nov 15, 2019
requirements.txt	requirements.txt	Add beam	Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Appointment streaming pipeline

Solution design

Components used

Architecture diagram

Apache Beam pipeline design

BigQuery tables

Running instructions

Before running

DirectRunner

DataflowRunner

Considerations

Further improvements

About

Releases

Packages

Languages

alexandraabbas/appointment-pipeline

Folders and files

Latest commit

History

Repository files navigation

Appointment streaming pipeline

Solution design

Components used

Architecture diagram

Apache Beam pipeline design

BigQuery tables

Running instructions

Before running

DirectRunner

DataflowRunner

Considerations

Further improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages