Ade Mod 1 Incremental Processing With Spark Structured Streaming

Incremental
Processing with
Structured
Streaming
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved

Agenda
Incremental Processing with Spark Structured Streaming and Delta Lake
Lesson Name Lesson Name
Lecture: Streaming Data Concepts Lecture - Aggregations, Time Windows, Watermarks
Lecture: Introduction to Structured Streaming ADE 1.3L - Stream Aggregations Lab
ADE 1.1 - Follow Along Demo - Reading from a

Lecture: Delta Live Tables Review
Streaming Query
ADE 1.2L - Streaming Query Lab Lecture: Auto Loader

Streaming Data
Concepts

What is streaming data?
Continuously generated and unbounded data
Typical data sources
DB change data Machine & Mobile &

Clickstreams Application events
feeds application logs IoT data
The vast majority of the data in the world is streaming data!

What is stream processing?
Data Source Processing
Traditional batch-oriented
data processing is one-off 1 2
and bounded.
Data Source Processing
Stream processing is
continuous and unbounded
©2023 Databricks Inc. — All rights reserved 5

Stream Processing
Why is stream processing getting popular?
Data Velocity & Rising data velocity & volumes requires continuous, incremental
Volumes processing - cannot process all data in one batch on a schedule
Businesses demand access to fresh data for actionable insights

Real-time analytics
and faster, better business decisions
Operational Critical applications need real-time data for effective,

applications instantaneous response
The vast majority of the data in the world is streaming data!

Stream Processing Use Cases
Stream processing is a key component of big data applications across all
industries
Notifications Real-time reporting Incremental ETL
Update data to serve in

Real-time decision making Online ML
real-time

Bounded vs. Unbounded Dataset
Bounded Data Unbounded Data
• Has a finite and unchanging structure at a • Has an infinite and continuously changing
the time of processing. structure at the time of processing.
• The order is static. • The order not always sequential.
• Analogy: Vehicles in a parking lot. • Analogy: Vehicles on a highway

Batch vs. Stream Processing
Batch Processing
• Generally refers to processing & analysis of bounded datasets (ie. size

is well known, we can count the number of elements, etc.)
• Typical of applications where there are loose data latency
requirements (ie. day old, week old, month old).
• This was traditional ETL from transactional systems into analytical
systems.
Batch Processing
Bounded Dataset
Engine

Batch Processing
Traditional data processing pipeline
Ingestion (e.g.
Fivetran)
DBMS, Apps, Land data in staging Move into Delta tables ETL : Clean, Transform data Query from Gold
collection agents, IoT files (S3, ADLS) into Gold tables tables
devices, logs
Batch processing on a schedule
Bounded Data Batch Data Storage Layer & Query Engine

Sources Ingestion Batch Processing Engine

Batch vs. Stream Processing
Stream Processing
• Datasets are continuous and unbounded (data is constantly arriving,

and must be processed as long as there is new data)
• Enables low-latency use cases (ie. real-time, or near real-time)
• Provides fast, actionable insights (ie. Quality-of-Service, Device
Monitoring, Recommendations, etc.)
Unbounded Stream Processing

dataset - Data Engine
continuously flow
Micro-batch
(or 1-by-1)

Stream Processing Most data is created as a series of
Modern data processing pipeline events over time: e.g. transactions,
sensor events, user activity on a
Structured Streaming website
Streaming Transformations
Window aggregation Real-time analytics
Pattern detection
Enrichment Triggers and Alerts
DBMS / CDC, Click Streaming data Routing

Streams, App lands in message
Events/Logs, IoT bus (e.g. Kafka)
devices
Data continuously, incrementally processed as it appears
Stream Data Stream Data Stream Processing Engine Data Storage Query Engine
Sources Ingestion Layer

Stream Processing vs. Batch Processing
Similarities and differences between Stream and Batch Processing
Differences: Similarities:
How to process in Batch processing
Stream • Both have data transformation
processing
one run? engine
engine • Output of streaming job is often
Row by row /
Bounded dataset Big batch
mini-batch
queried in batch jobs
Unbounded NA (multiple Row by row / • Stream processing often include
dataset runs) mini-batch
batch processing (micro-batch)
Query
Only once Multiple
computation

Advantages of Stream Processing
Why use streaming (vs. batch) ?
A more intuitive way of

capturing and processing Automatic bookkeeping on
continuous and unbounded new data
data
Lower latency for time Higher compute utilization

sensitive applications and use and scalability through
cases continuous and incremental
processing
Better fault-tolerance
through checkpointing

Challenges of Stream Processing
Stream processing is not easy
• Processing out-of-order data based on application

TION
timestamps (also called event time) LU
SO
• Maintaining large amounts of state
• Processing each event exactly once despite
machine failures
• Handling load imbalance and stragglers Structured
• Determining how to update output sinks as new
Streaming
events arrive
• Writing data transactionally to output systems

Introduction to
Structured Streaming

What is Structured Streaming
Apache Spark Structured Streaming Basics
• A scalable, fault-tolerant stream processing framework built on Spark

SQL engine.
• Uses existing structured APIs (DataFrames, SQL Engine) and provides
similar API as batch processing API.
• Includes stream specific features; end-to-end, exactly-once
processing, fault-tolerance etc.

How Structured Streaming Works
Incremental Updates - Data stream as an unbounded table
• Streaming data is usually coming in very fast.

• The magic behind Spark Structured Streaming: Processing infinite data
as an incremental table updates.
Unbounded Table
Color Speed Model
💡 New rows appended to a unbounded table as new data in the

stream is processed

Micro-Batch Processing
• Micro-batch Execution: Accumulate small batches of data and process

each batch in parallel.
• Continuous Execution (EXPERIMENTAL): Continuously listen for new
data and process them individually.
Continuous Micro-batch
Immediately process any 1 minute batches.
car reaching this point. Processed in parallel

Execution mode
1. An input table is defined by

configuring a streaming read against
source.
2. A query is defined against the input
table.
3. This logical query on the input table
generates the results table.
4. The output of a streaming pipeline
will persist updates to the results
table by writing to an external sink.
5. New rows are appended to the input
table for each trigger interval.

Anatomy of a Streaming Query
Structured Streaming Core Concepts
Data Sources
• Example:
• Read JSON data from Kafka
Transformations
• Parse nested JSON
& Actions
• Store in structured Delta Lake table
• Core concepts:
• Input sources
• Sinks
• Transformations & actions Storage Layer
• Triggers

spark.readStream.format("kafka")
Source:
.option("kafka.bootstrap.servers",...)
• Specify where to read data from
.option("subscribe", "topic")
• OS Spark supports Kafka and file sources
.load()
• Databricks runtimes include connector
libraries supporting Delta, Event Hubs, and
Kinesis
Returns a Spark DataFrame
(common API for batch & streaming data)
spark.readStream.format(<source>)
.option(<>,<>)...
.load()

Transformations:
.option("subscribe", "topic") • 100s of built-in, optimized SQL functions
.load() like from_json
.selectExpr("cast (value as string) as json") • In this example, cast bytes from Kafka
.select(from_json("json", schema).as("data")) records to a string, parse it as JSON, and
generate nested columns

Sink: Write transformed output to
.option("subscribe", "topic") external storage systems
.load()
Databricks runtimes include
.selectExpr("cast (value as string) as json")
connector library supporting Delta
.select(from_json("json", schema).as("data"))
.writeStream OS Spark supports:
.format("delta")
• Files and Kafka for production
.option("path", "/deltaTable/")
• Console and memory for development
and debugging
• foreachBatch to execute arbitrary
code with the output data

• Checkpoint location: For tracking the
.option("subscribe", "topic") progress of the query
.load() • Output Mode: Defines how the data
.selectExpr("cast (value as string) as json") is written to the sink; Equivalent to
.select(from_json("json", schema).as("data")) “save” mode on static DataFrames
.writeStream
• Trigger: Defines how frequently the
.format("delta")
.option("path", "/deltaTable/") input table is checked for new data;
.trigger("1 minute") Each time a trigger fires, Sparks check
.option("checkpointLocation", "…") for new data and updates the results
.start()

Trigger Types:
Fixed interval micro .trigger(processingTime = Micro-batch processing kicked off at the user-specified
batches “2 minutes”) interval
Triggered One-time Process all of the available data as a single micro-batch and
.trigger(once=True)
micro batch then automatically stop the query
Triggered One-time Process all of the available data as multiple micro-batches
.trigger(availableNow=True)
micro batches and then automatically stop the query
Long-running tasks that continuously read, process, and write
Continuous .trigger(continuous= “2
data as soon events are available, with checkpoints at the
Processing seconds”)
specified frequency
Databricks: 500ms fixed interval
Default OS Apache Spark: Process each microbatch as soon as the
previous has been processed

spark.readStream.format("kafka") Output Mode:

.option("subscribe", "topic") • Defines how the data is written to the
.load() sink
.selectExpr("cast (value as string) as json") • Equivalent to “save” mode on static
.select(from_json("json", schema).as("data")) DataFrames
.writeStream
.format("delta")
.option("path", "/deltaTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.outputMode("complete")
.start()

Output Modes:
● The entire updated Result Table is written to the sink.
Complete ● The individual sink implementation decides how to handle writing the
entire table.
Only the new rows appended to the Result Table since the last trigger are
Append
written to the sink.
Only new rows and the rows in the Result Table that were updated since
Update
the last trigger will be outputted to the sink.
Note: The output modes supported depends on the type of

transformations and sinks used by the streaming query. Refer to the the
Structured Streaming Programming Guide for details.
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) Raw data from Kafka
.writeStream
.format("delta") available as structured
.option("path", "/deltaTable/") data in seconds, ready for
.trigger("1 minute")
querying
.option("checkpointLocation", "…")
.outputMode("append")
.start()

Benefits of Structured
Streaming

Unification
Unified API for Batch and Stream Processing
• Same API is used for batch and stream processing.

• Supports Python, SQL or Spark’s other supported languages.
• Spark’s built-in libraries can be called in a streaming context, including
ML libraries.
Note: Most operations on a streaming DataFrame are identical to a static

DataFrame. There are some exceptions to this, for example, sorting is not
supported with streaming data.

Fault Tolerance
End-to-end fault tolerance
• Structured Streaming ensures end-to-end exactly-once fault-tolerance

guarantees through checkpointing.
• In case of failures; the streaming engine attempts to restart and/or
reprocess the data.
• This approach requires;
• Replayable streaming source such as cloud-based object storage and pub/sub
services.
• Idempotent sinks - multiple writes of the same data (as identified by the offset) do
not result in duplicates being written to the sink.

Handle Out-of-Order Data
Support for “event time” to aggregate out of order data
• Supports event-time-window-based aggregation queries

• Supports watermarking which allows users to the threshold of late data

Structured Streaming
with Delta Lake

Delta Lake Benefits
An open format storage layer built for the Lakehouse
Extensibility
Streaming DataFrame API — Open file access — SQL API — Python API
Reliability Performance Governance Flexibility

• ACID transactions • Advanced indexing • Integrated with • Open format built
with Z-order data catalog on Parquet
• Schema
Enforcement & • Caching • Audit history • Streaming + Batch
Evolution • GDPR and CCPA Unstructured
• Auto tuning of compliance
•
• Rollbacks storage block sizes data types

• Data Easy replication with
• Time travel • Data skipping Pseudonymization
•
Delta Clones

Streaming from Delta Lake
Using a Delta table as a streaming source
• Each committed version represents new data to stream. Delta Lake

transactions logs identify the version’s new data files
• Structured Streaming assumes append-only sources. Any non-append
changes to a Delta table causes queries streaming from that table to
throw exceptions.
• Set delta.appendOnly = true to prevent non-append
modifications to a table.
• Use Delta Lake change data feed to propagate arbitrary change events
to downstream consumers (discussed later in this course).

Streaming from Delta Lake
Using a Delta table as a streaming source
• You can limit the input rate for micro-batches by setting

DataStreamReader options:
• maxFilesPerTrigger: Maximum files read per micro-batch (default 1,000)
• maxBytesPerTrigger: Soft limit to amount of data read per micro-batch (no
default)
• Note: Delta Live Tables pipelines auto-tune options for rate limiting, so you
should avoid setting these options explicitly for your pipelines.

Streaming to Delta Lake
Using a Delta table as a streaming sink
• Each micro-batch written to the Delta table is committed as a new

version.
• Delta Lake supports both append and complete output modes.
• Append is most common.
• Complete replaces the entire table with each micro-batch. It can be used for
streaming queries that perform arbitrary aggregations on streaming data.

Demo: Reading from a
Streaming Query

Lab: Streaming Query

Aggregations, Time
Windows, Watermarks

Types of Stream Processing
Stateless vs. Stateful Processing
• Stateless
• Typically trivial transformations. The way records are handled do
not depend on previously seen records.
• Example: Data Ingest (map-only), simple dimensional joins
• Stateful
• Previously seen records can influence new records
• Example: Aggregations over time, Fraud/Anomaly Detection

Stream Aggregations
What is the total number of passengers by

vehicle color?

Stream Aggregations
• Continuous applications often require near real-time decisions on

real-time, aggregated statistics
• Examples: Aggregating errors from IoT devices, behavior analysis on instant
messages via hashtags
• In the case of streams, you generally don't want to run aggregations

over the entire dataset. Why;
• There conceptually is no end to the flow of data, data is continuous
• The size of the dataset grows in perpetuity; will eventually run out of resources
• Solution: Instead of aggregating over the entire dataset, we can aggregate over data
grouped by windows of time (say, every 5 minutes). This is referred to as windowing.

Time Based Windows
Tumbling window vs. Sliding window
Tumbling Window Sliding Window
• No window overlap • Windows overlap

• Any given event gets aggregated • Any given event gets aggregated
into multiple window groups (e.g.
into only one window group (e.g.
1:00-2:00 am, 1:30–2:30 am,
1:00–2:00 am, 2:00–3:00 am,
2:00–3:00 am, …)
3:00-4:00 am, …)

Time Based Windows
Sliding window example

Reasoning About Time
What is the average vehicle speed by color?

Event time vs. Processing time

vehicle color?
Event Time (Speed

check Starting point)
Processing Time

Event time vs. Processing time
Event time vs. processing time

• Event Time: time at which the event (record in the data) actually
occurred.
• Processing time: time at which a record is actually processed.
• Important in every use case processing unbounded data in whatever
order (otherwise no guarantee on correctness)

Time Domain Skew
Time in batch vs. stream processing
When batch processing:

• Processing time per definition much later
(e.g. an hour or day) than event time
• Data assumed to be complete (or settle for
incompleteness)
When stream processing:
• Processing time >= event time but often
close (e.g. seconds, minutes)
• Challenge when processing time >>> event
time (late data): not able to conclude anything
easily, how long to wait for the data to be
complete?


vehicle color?
Every vehicle’s speed is

recorded and sent to
processing point
Event Time (Speed

check Starting point)
Let’s say using 5 minute windowing, what if a vehicle is very slow? Processing Time
Do we need to wait for it?

Handling Late Data and Watermarking
Watermark: Handle late data and limit how long to remember old data
• Analogy: Highway minimum speed limit

ADE 1.3L - Streaming
Aggregation

Delta Live Tables
Review

Multi-Hop in the Lakehouse
Streaming
analytics
CSV
JSON
TXT
Bronze Silver Gold

Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates
Data quality
AI and reporting

The Reality is Not so Simple
Bronze Silver Gold

Large scale ETL is complex and brittle
Complex pipeline Data quality and Difficult pipeline

development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious

Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake
Operate with agility Trust your data Scale with reliability
Declarative tools to DLT has built-in Easily scale

build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take

Delta Live Tables
Streaming data ingestion and transformation made simple
Accelerate ETL development

Declare SQL/Python and DLT automatically
orchestrates the DAG, handles retries, changing data
CREATE STREAMING TABLE raw_data
AS SELECT *
FROM cloud_files(“/raw_data”, Automatically manage your infrastructure
”json”) Automates complex tedious activities like recovery,
auto-scaling, and performance optimization
Ensure high data quality

CREATE LIVE TABLE clean_data
AS SELECT … Deliver reliable data with built-in quality controls,
FROM LIVE.raw_data testing, monitoring59 and enforcement
Unify batch and streaming

Get the simplicity of SQL with the freshness of
streaming with one unified API

What is a Live Table?
Live Tables are materialized views for the lakehouse.
A live table is: Live tables provides tools to:

• Defined by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
LIVE • Simplify collaboration
CREATE OR REFRESH TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales

What is a Streaming Live Table?
Based on SparkTM Structured Streaming
A streaming live table is “stateful”: • Streaming Live tables compute results

over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (files on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)

Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…
Write create live table Create a pipeline Click start
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.

BEST PRACTICE
Development vs Production
Fast iteration or enterprise grade reliability
Development Mode Production Mode
• Reuses a long-running cluster • Cuts costs by turning off

running for fast iteration. clusters as soon as they are done
(within 5 minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.
In the Pipelines
UI:

Auto Loader

Auto Loader Benefits
Highly Schema Inference

Highly Scalable Cost Effective
Performant & Evolution
Avoid costly LIST Optimized file

Detect schema
Discover billions operations with discovery with the
drifts and rescue
of files efficiently the file notification directory listing
data automatically
mode mode

Auto Loader Under the Hood
.option(“cloudFiles.useNotifications”,“true”)

Auto Loader Best Practices
● Leverage incremental listing for directory listing mode
○ Files must be lexicographically ordered
○ Determination is automatic
■ cloudFiles.useIncrementalListing = “auto” is default
○ DBR 9.1+
○ Use file notification mode If incremental listing is not possible
● Consider processing delays while configuring lifecycle policies on object

storage services

Knowledge Check

Which of the following could be used as sources in a
stream processing pipeline?
Select two responses
A. change data capture (CDC) feed

B. Kafka
C. Delta Lake
D. IoT devices

Which of the following statements about propagating deletes with change
data feed (CDF) are true?
Select two responses
A. Deletes cannot be processed at the same time as

appends and updates.
B. Commit messages can be specified as part of the
write options using the userMetadata option.
C. Deleting data will create new data files rather than
deleting existing data files.
D. In order to propagate deletes to a table, a MERGE
statement is required in SQL.

Which of the following are considerations to keep in mind
when choosing between micro-batch and continuous
execution mode?
Select two responses.
A. Desired latency
B. Total cost of operation (TCO)
C. Maximum throughput
D. Cloud object storage

Which of the following functions completes the following
code snippet to return a Spark DataFrame in a structured
streaming query?
_____
Select one response.
A. .load()
B. .print()
C. .return()
D. .merge()

In stream processing, datasets are _____.
Select one response
A. continuous and bounded

B. continuous and unbounded
C. micro-batch and unbounded
D. micro-batch and bounded

Ade Mod 1 Incremental Processing With Spark Structured Streaming

Uploaded by

Ade Mod 1 Incremental Processing With Spark Structured Streaming

Uploaded by

Incremental

©2023 Databricks Inc. — All rights reserved

Lesson Name Lesson Name

Lecture: Streaming Data Concepts Lecture - Aggregations, Time Windows, Watermarks

Lecture: Introduction to Structured Streaming ADE 1.3L - Stream Aggregations Lab

ADE 1.1 - Follow Along Demo - Reading from a

ADE 1.2L - Streaming Query Lab Lecture: Auto Loader

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Typical data sources

DB change data Machine & Mobile &

The vast majority of the data in the world is streaming data!

©2023 Databricks Inc. — All rights reserved

Data Source Processing

©2023 Databricks Inc. — All rights reserved 5

Businesses demand access to fresh data for actionable insights

Operational Critical applications need real-time data for effective,

The vast majority of the data in the world is streaming data!

Notiﬁcations Real-time reporting Incremental ETL

Update data to serve in

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

• Generally refers to processing & analysis of bounded datasets (ie. size

©2023 Databricks Inc. — All rights reserved

Bounded Data Batch Data Storage Layer & Query Engine

©2023 Databricks Inc. — All rights reserved

• Datasets are continuous and unbounded (data is constantly arriving,

Unbounded Stream Processing

©2023 Databricks Inc. — All rights reserved

Window aggregation Real-time analytics

Enrichment Triggers and Alerts

DBMS / CDC, Click Streaming data Routing

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

A more intuitive way of

Lower latency for time Higher compute utilization

©2023 Databricks Inc. — All rights reserved

• Processing out-of-order data based on application

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

• A scalable, fault-tolerant stream processing framework built on Spark

©2023 Databricks Inc. — All rights reserved

• Streaming data is usually coming in very fast.

Color Speed Model

💡 New rows appended to a unbounded table as new data in the

©2023 Databricks Inc. — All rights reserved

• Micro-batch Execution: Accumulate small batches of data and process

©2023 Databricks Inc. — All rights reserved

1. An input table is deﬁned by

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

spark.readStream.format("kafka") Output Mode:

©2023 Databricks Inc. — All rights reserved

Note: The output modes supported depends on the type of

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

• Same API is used for batch and stream processing.

Note: Most operations on a streaming DataFrame are identical to a static

©2023 Databricks Inc. — All rights reserved

• Structured Streaming ensures end-to-end exactly-once fault-tolerance

©2023 Databricks Inc. — All rights reserved

• Supports event-time-window-based aggregation queries

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Reliability Performance Governance Flexibility

• Rollbacks storage block sizes data types

©2023 Databricks Inc. — All rights reserved

• Each committed version represents new data to stream. Delta Lake

©2023 Databricks Inc. — All rights reserved

• You can limit the input rate for micro-batches by setting

©2023 Databricks Inc. — All rights reserved