JSON Functions in
Pyspark
JSON IN DATA PROCESSING
1. JSON (JavaScript Object
Notation) is a lightweight
data-interchange format.
2. Easy for humans to read
and write, and easy for
machines to parse and
generate.
3. Widely used in web
applications and APIs.
4. Working with JSON is
unavoidable in real-world
data pipelines — logs,
APIs, nested datasets —
they all love JSON
2
1.Reading Json
▪ Use spark.read.json to read JSON files into a
DataFrame.
▪ Automatically infers schema
▪ Handles nested structures
2. Writing DataFrame to JSON
▪ df.write.json(): This method writes a DataFrame to a JSON file
or directory.
▪ Output Options: Options like compression (e.g., gzip, snappy)
and ignoreNullFields can be specified when writing JSON files
▪ Supports overwrite, append, partitioning, etc.
3
3.Working with JSON Columns (Strings stored as JSON
▪ from_json(col("json_col"), schema) → Parses JSON string
into Struct
▪ to_json(col("struct_col")) → Converts Struct back to
JSON string
from_json Function
▪ The from_json function is used to parse a column containing JSON strings into a structured format
(e.g., StructType, ArrayType).
▪ This is useful when you have JSON data stored as strings in a DataFrame and you want to work with it in a structured
way.(more manageable and queryable format)
4
3.Working with JSON Columns (Strings stored as JSON)
to_json Function
▪ The to_json function is used to convert a structured column (e.g., StructType, ArrayType) back into a JSON
string.
▪ This is useful when you want to serialize structured data into JSON format for storage or transmission.(for storage or
transmission.)
5
Complete Example
1.Reading JSON Strings: The JSON
strings in the json_col column are
parsed into a structured format using
the from_json function with a defined
schema.
2.Selecting Fields: Specific fields from
the parsed JSON are selected and
aliased for easier access, while retaining
the original parsed JSON column.
3.Converting Back to JSON: The
structured data in the parsed_json column
is converted back into a JSON string
using the to_json function.
Output
4.Displaying Data: The resulting
DataFrame, which includes both the
structured fields and the JSON string, is
displayed.
6
4.Handling Multi-line JSON Files
▪ Use multiline option to read multi-line JSON files.
▪ Automatically infers schema
▪ Handles nested structures
5. Creating a Temporary view with JSON Data
1.Reading JSON Data: Read JSON data into a
DataFrame.
2.Parsing JSON Strings: Use from_json to parse
JSON strings into structured columns.
3.Creating a Temporary View: Create a temporary
view to run SQL queries on the DataFrame.
7
6.Exploding Nested JSON
▪ This is particularly useful when dealing with nested JSON structures where you need to flatten the data for
analysis.
▪ Steps involved
• Reading JSON Data: Read JSON data into a DataFrame.
• Parsing JSON Strings: Use from_json to parse JSON strings into structured columns.
• Exploding Nested Arrays: Use the explode function to transform array elements into individual rows.
Output
8
Summary:
➢ JSON is widely used in APIs and log data; Spark provides rich support to
parse and handle it.
➢ read.json() helps in directly loading JSON files into DataFrames.
➢ Use from_json() to parse JSON strings into struct type and to_json() to
convert structs to JSON strings.
➢ explode() is used to flatten arrays within nested JSON structures.
➢ Define schemas using StructType to handle complex or deeply nested JSON
fields efficiently.
Function Purpose Purpose
read.json() Load JSON file as
DataFrame
from_json() Parse JSON string to Struct
to_json() Convert Struct to JSON
string
9
explode() Flatten nested arrays