Ade Mod 1 Incremental Processing With Spark Structured Streaming
Ade Mod 1 Incremental Processing With Spark Structured Streaming
Processing with
Structured
Streaming
Databricks Academy
2023
Traditional batch-oriented
data processing is one-off 1 2
and bounded.
Stream processing is
continuous and unbounded
Data Velocity & Rising data velocity & volumes requires continuous, incremental
Volumes processing - cannot process all data in one batch on a schedule
• Has a finite and unchanging structure at a • Has an infinite and continuously changing
the time of processing. structure at the time of processing.
• The order is static. • The order not always sequential.
• Analogy: Vehicles in a parking lot. • Analogy: Vehicles on a highway
Batch Processing
Bounded Dataset
Engine
Ingestion (e.g.
Fivetran)
DBMS, Apps, Land data in staging Move into Delta tables ETL : Clean, Transform data Query from Gold
collection agents, IoT files (S3, ADLS) into Gold tables tables
devices, logs
Batch processing on a schedule
Micro-batch
(or 1-by-1)
Streaming Transformations
Pattern detection
Stream Data Stream Data Stream Processing Engine Data Storage Query Engine
Sources Ingestion Layer
Differences: Similarities:
How to process in Batch processing
Stream • Both have data transformation
processing
one run? engine
engine • Output of streaming job is often
Row by row /
Bounded dataset Big batch
mini-batch
queried in batch jobs
Unbounded NA (multiple Row by row / • Stream processing often include
dataset runs) mini-batch
batch processing (micro-batch)
Query
Only once Multiple
computation
Better fault-tolerance
through checkpointing
Continuous Micro-batch
Immediately process any 1 minute batches.
car reaching this point. Processed in parallel
Transformations
• Parse nested JSON
& Actions
• Store in structured Delta Lake table
• Core concepts:
• Input sources
• Sinks
• Transformations & actions Storage Layer
• Triggers
spark.readStream.format("kafka")
Source:
.option("kafka.bootstrap.servers",...)
• Specify where to read data from
.option("subscribe", "topic")
• OS Spark supports Kafka and file sources
.load()
• Databricks runtimes include connector
libraries supporting Delta, Event Hubs, and
Kinesis
Returns a Spark DataFrame
(common API for batch & streaming data)
spark.readStream.format(<source>)
.option(<>,<>)...
.load()
spark.readStream.format("kafka")
Transformations:
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic") • 100s of built-in, optimized SQL functions
.load() like from_json
.selectExpr("cast (value as string) as json") • In this example, cast bytes from Kafka
.select(from_json("json", schema).as("data")) records to a string, parse it as JSON, and
generate nested columns
spark.readStream.format("kafka")
Sink: Write transformed output to
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic") external storage systems
.load()
Databricks runtimes include
.selectExpr("cast (value as string) as json")
connector library supporting Delta
.select(from_json("json", schema).as("data"))
.writeStream OS Spark supports:
.format("delta")
• Files and Kafka for production
.option("path", "/deltaTable/")
• Console and memory for development
and debugging
• foreachBatch to execute arbitrary
code with the output data
spark.readStream.format("kafka")
• Checkpoint location: For tracking the
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic") progress of the query
.load() • Output Mode: Defines how the data
.selectExpr("cast (value as string) as json") is written to the sink; Equivalent to
.select(from_json("json", schema).as("data")) “save” mode on static DataFrames
.writeStream
• Trigger: Defines how frequently the
.format("delta")
.option("path", "/deltaTable/") input table is checked for new data;
.trigger("1 minute") Each time a trigger fires, Sparks check
.option("checkpointLocation", "…") for new data and updates the results
.start()
Trigger Types:
Fixed interval micro .trigger(processingTime = Micro-batch processing kicked off at the user-specified
batches “2 minutes”) interval
Triggered One-time Process all of the available data as a single micro-batch and
.trigger(once=True)
micro batch then automatically stop the query
Triggered One-time Process all of the available data as multiple micro-batches
.trigger(availableNow=True)
micro batches and then automatically stop the query
Long-running tasks that continuously read, process, and write
Continuous .trigger(continuous= “2
data as soon events are available, with checkpoints at the
Processing seconds”)
specified frequency
Databricks: 500ms fixed interval
Default OS Apache Spark: Process each microbatch as soon as the
previous has been processed
Output Modes:
● The entire updated Result Table is written to the sink.
Complete ● The individual sink implementation decides how to handle writing the
entire table.
Only the new rows appended to the Result Table since the last trigger are
Append
written to the sink.
Only new rows and the rows in the Result Table that were updated since
Update
the last trigger will be outputted to the sink.
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) Raw data from Kafka
.writeStream
.format("delta") available as structured
.option("path", "/deltaTable/") data in seconds, ready for
.trigger("1 minute")
querying
.option("checkpointLocation", "…")
.outputMode("append")
.start()
Extensibility
Streaming DataFrame API — Open file access — SQL API — Python API
• Stateless
• Typically trivial transformations. The way records are handled do
not depend on previously seen records.
• Example: Data Ingest (map-only), simple dimensional joins
• Stateful
• Previously seen records can influence new records
• Example: Aggregations over time, Fraud/Anomaly Detection
Watermark: Handle late data and limit how long to remember old data
• Analogy: Highway minimum speed limit
Streaming
analytics
CSV
JSON
TXT
Data quality
AI and reporting
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.
Development vs Production
Fast iteration or enterprise grade reliability
In the Pipelines
UI:
A. Desired latency
B. Total cost of operation (TCO)
C. Maximum throughput
D. Cloud object storage
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic")
_____
Select one response.
A. .load()
B. .print()
C. .return()
D. .merge()