GitHub - cartershanklin_pyspark-cheatsheet_ PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster (2)
GitHub - cartershanklin_pyspark-cheatsheet_ PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster (2)
398 stars
LICENSE Initial commit 4 years ago
11 watching
Makefile Major documentation refactor. 2 years ago 205 forks
Report repository
README.md Fixed a duplicated class name… 2 years ago
requirements.txt Upgrade to Spark 3.2.2. 2 years ago Python 59.9% Jupyter Notebook 40.0%
Other 0.1%
These snippets are licensed under the CC0 1.0 Universal License. That means you can freely copy and adapt these code snippets and you don't
need to give attribution or include any notices.
"Auto MPG Data Set" available from the UCI Machine Learning Repository.
customer_spend.csv, a generated time series dataset.
date_examples.csv, a generated dataset with various date and time formats.
weblog.csv, a cleaned version of this web log dataset.
These snippets were tested against the Spark 3.2.2 API. This page was last updated 2022-09-19 15:31:03.
https://github.com/cartershanklin/pyspark-cheatsheet 2/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Try in a Notebook
See the Notebook How-To for instructions on running in a Jupyter notebook.
Table of contents
Accessing Data Sources
Load a DataFrame from CSV
Load a DataFrame from a Tab Separated Value (TSV) file
Save a DataFrame in CSV format
Load a DataFrame from Parquet
Save a DataFrame in Parquet format
Load a DataFrame from JSON Lines (jsonl) Formatted Data
Save a DataFrame into a Hive catalog table
Load a Hive catalog table into a DataFrame
Load a DataFrame from a SQL query
Load a CSV file from Amazon S3
Load a CSV file from Oracle Cloud Infrastructure (OCI) Object Storage
https://github.com/cartershanklin/pyspark-cheatsheet 3/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 4/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 5/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 6/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 8/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 9/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Save a model
Load a model and use it for transformations
Load a model and use it for predictions
Load a classification model and use it to compute confidences for output labels
Performance
Get the Spark version
Log messages using Spark's Log4J
Cache a DataFrame
Show the execution plan, with costs
Partition by a Column Value
Range Partition a DataFrame
Change Number of DataFrame Partitions
Coalesce DataFrame partitions
Set the number of shuffle partitions
Sample a subset of a DataFrame
Run multiple concurrent jobs in different pools
Print Spark configuration properties
Set Spark configuration properties
Publish Metrics to Graphite
Increase Spark driver/executor heap space
https://github.com/cartershanklin/pyspark-cheatsheet 10/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = spark.read.format("csv").option("header", True).load("data/auto-mpg.csv")
df = (
spark.read.format("csv")
.option("header", True)
.option("sep", "\t")
https://github.com/cartershanklin/pyspark-cheatsheet 11/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
.load("data/auto-mpg.tsv")
)
auto_df.write.csv("output.csv")
df = spark.read.format("parquet").load("data/auto-mpg.parquet")
https://github.com/cartershanklin/pyspark-cheatsheet 12/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
auto_df.write.parquet("output.parquet")
JSON Lines / jsonl format uses one JSON document per line. If you have data with mostly regular structure this is better than nesting it in an
array. See jsonlines.org
df = spark.read.json("data/weblog.jsonl")
https://github.com/cartershanklin/pyspark-cheatsheet 13/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| client| country| session| timestamp| uri| user|
+----------+----------+--------+----------+----------+------+
|{false,...|Bangladesh|55fa8213| 869196249|http://...|dde312|
|{true, ...| Niue|2fcd4a83|1031238717|http://...|9d00b9|
|{true, ...| Rwanda|013b996e| 628683372|http://...|1339d4|
|{false,...| Austria|07e8a71a|1043628668|https:/...|966312|
|{false,...| Belize|b23d05d8| 192738669|http://...|2af1e1|
|{false,...|Lao Peo...|d83dfbae|1066490444|http://...|844395|
|{false,...|French ...|e77dfaa2|1350920869|https:/...| null|
|{false,...|Turks a...|56664269| 280986223|http://...| null|
|{false,...| Ethiopia|628d6059| 881914195|https:/...|8ab45a|
|{false,...|Saint K...|85f9120c|1065114708|https:/...| null|
+----------+----------+--------+----------+----------+------+
only showing top 10 rows
auto_df.write.mode("overwrite").saveAsTable("autompg")
df = spark.table("autompg")
https://github.com/cartershanklin/pyspark-cheatsheet 14/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
df = sqlContext.sql(
"select carname, mpg, horsepower from autompg where horsepower > 100 and mpg > 25"
)
https://github.com/cartershanklin/pyspark-cheatsheet 15/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|oldsmob...|26.6| 105.0|
+----------+----+----------+
import configparser
import os
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_key = config.get("default", "aws_access_key_id").replace('"', "")
secret_key = config.get("default", "aws_secret_access_key").replace('"', "")
df = (
spark.read.format("csv")
.option("header", True)
.load("s3a://cheatsheet111/auto-mpg.csv")
)
Load a CSV file from Oracle Cloud Infrastructure (OCI) Object Storage
This example shows loading data from Oracle Cloud Infrastructure Object Storage using an API key.
https://github.com/cartershanklin/pyspark-cheatsheet 16/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
import oci
oci_config = oci.config.from_file()
conf = spark.sparkContext.getConf()
conf.set("fs.oci.client.auth.tenantId", oci_config["tenancy"])
conf.set("fs.oci.client.auth.userId", oci_config["user"])
conf.set("fs.oci.client.auth.fingerprint", oci_config["fingerprint"])
conf.set("fs.oci.client.auth.pemfilepath", oci_config["key_file"])
conf.set(
"fs.oci.client.hostname",
"https://objectstorage.{0}.oraclecloud.com".format(oci_config["region"]),
)
PATH = "oci://<your_bucket>@<your_namespace/<your_path>"
df = spark.read.format("csv").option("header", True).load(PATH)
password = "my_password"
table = "source_table"
tnsname = "my_tns_name"
user = "ADMIN"
wallet_path = "/path/to/your/wallet"
properties = {
"driver": "oracle.jdbc.driver.OracleDriver",
"oracle.net.tns_admin": tnsname,
"password": password,
"user": user,
}
url = f"jdbc:oracle:thin:@{tnsname}?TNS_ADMIN={wallet_path}"
df = spark.read.jdbc(url=url, table=table, properties=properties)
https://github.com/cartershanklin/pyspark-cheatsheet 17/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
password = "my_password"
table = "target_table"
tnsname = "my_tns_name"
user = "ADMIN"
wallet_path = "/path/to/your/wallet"
properties = {
"driver": "oracle.jdbc.driver.OracleDriver",
"oracle.net.tns_admin": tnsname,
"password": password,
"user": user,
}
url = f"jdbc:oracle:thin:@{tnsname}?TNS_ADMIN={wallet_path}"
Options include:
If you use Delta Lake there is a special procedure for specifying spark.jars.packages , see the source code that generates this file for details.
https://github.com/cartershanklin/pyspark-cheatsheet 18/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
properties = {
"driver": "org.postgresql.Driver",
"user": pg_user,
"password": pg_password,
}
url = f"jdbc:postgresql://{pg_host}:5432/{pg_database}"
auto_df.write.jdbc(url=url, table=table, mode="Append", properties=properties)
Options include:
properties = {
"driver": "org.postgresql.Driver",
"user": pg_user,
"password": pg_password,
https://github.com/cartershanklin/pyspark-cheatsheet 19/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
}
url = f"jdbc:postgresql://{pg_host}:5432/{pg_database}"
df = spark.read.jdbc(url=url, table=table, properties=properties)
https://github.com/cartershanklin/pyspark-cheatsheet 20/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
StringType,
StructField,
StructType,
)
schema = StructType(
[
StructField("mpg", DoubleType(), True),
StructField("cylinders", IntegerType(), True),
StructField("displacement", DoubleType(), True),
StructField("horsepower", DoubleType(), True),
StructField("weight", DoubleType(), True),
StructField("acceleration", DoubleType(), True),
StructField("modelyear", IntegerType(), True),
StructField("origin", IntegerType(), True),
StructField("carname", StringType(), True),
]
)
df = (
spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("data/auto-mpg.csv")
)
README CC0-1.0 license
auto_df.write.mode("overwrite").csv("output.csv")
auto_df.coalesce(1).write.csv("header.csv", header="true")
If you need to write to a single file with a name you choose, consider converting it to a Pandas dataframe and saving it using Pandas.
Either way all data will be collected on one node before being written so be careful not to run out of memory.
auto_df.coalesce(1).write.csv("single.csv")
https://github.com/cartershanklin/pyspark-cheatsheet 22/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
The values of the partitions will appear as subdirectories and are not contained in the output files, i.e. they become "virtual columns". When you
read a partition table these virtual columns will be part of the DataFrame.
Dynamic partitioning has the potential to create many small files, this will impact performance negatively. Be sure the partition columns do not
have too many distinct values and limit the use of multiple virtual columns.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
auto_df.write.mode("append").partitionBy("modelyear").saveAsTable(
"autompg_partitioned"
)
With dynamic partitioning, partitions with keys in the DataFrame are overwritten, but partitions not in the DataFrame are untouched.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
your_dataframe.write.mode("overwrite").insertInto("your_table")
https://github.com/cartershanklin/pyspark-cheatsheet 23/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
money_convert = udf(
lambda x: Decimal(price_str(x)) if x is not None else None,
DecimalType(8, 4),
)
df = df.withColumn("spend_dollars", money_convert(df.spend_dollars))
https://github.com/cartershanklin/pyspark-cheatsheet 24/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|2020-02-29| 4| 2.1300|
|2020-02-29| 5| 0.8200|
+----------+-----------+-------------+
only showing top 10 rows
DataFrame Operations
Adding, removing and modifying DataFrame columns.
df = auto_df.withColumn("upper", upper(auto_df.carname)).withColumn(
"lower", lower(auto_df.carname)
)
https://github.com/cartershanklin/pyspark-cheatsheet 25/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|AMC AMB...|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+----------+----------+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 26/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.withColumn(
"mpg_class",
when(col("mpg") <= 20, "low")
.when(col("mpg") <= 30, "mid")
.when(col("mpg") <= 40, "high")
.otherwise("very high"),
)
df = auto_df.withColumn("one", lit(1))
https://github.com/cartershanklin/pyspark-cheatsheet 27/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Concatenate columns
TODO
df = auto_df.withColumn(
"concatenated", concat(col("cylinders"), lit("_"), col("mpg"))
)
https://github.com/cartershanklin/pyspark-cheatsheet 28/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...| 8_18.0|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...| 8_16.0|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...| 8_17.0|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...| 8_15.0|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...| 8_14.0|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...| 8_14.0|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...| 8_14.0|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...| 8_15.0|
+----+---------+------------+----------+------+------------+---------+------+----------+------------+
only showing top 10 rows
Drop a column
df = auto_df.drop("horsepower")
https://github.com/cartershanklin/pyspark-cheatsheet 29/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.withColumnRenamed("horsepower", "horses")
df = auto_df.withColumnRenamed("horsepower", "horses").withColumnRenamed(
"modelyear", "year"
)
https://github.com/cartershanklin/pyspark-cheatsheet 30/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 31/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+----------+-------------+-----------+-------+-------------+----------+-------+----------+
only showing top 10 rows
Steps below:
Create a DataFrame with one row and one column, this example uses an average but it could be anything.
https://github.com/cartershanklin/pyspark-cheatsheet 32/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Call the DataFrame's first method, this returns the first Row of the DataFrame.
Row s can be accessed like arrays, so we extract the zeroth value of the first Row using first()[0] .
average = auto_df.agg(dict(mpg="avg")).first()[0]
print(str(average))
first_three = auto_df.limit(3)
for row in first_three.collect():
my_dict = row.asDict()
print(my_dict)
{'mpg': '18.0', 'cylinders': '8', 'displacement': '307.0', 'horsepower': '130.0', 'weight': '3504.', 'acceleration': '12.0',
'modelyear': '70', 'origin': '1', 'carname': 'chevrolet chevelle malibu'}
{'mpg': '15.0', 'cylinders': '8', 'displacement': '350.0', 'horsepower': '165.0', 'weight': '3693.', 'acceleration': '11.5',
'modelyear': '70', 'origin': '1', 'carname': 'buick skylark 320'}
{'mpg': '18.0', 'cylinders': '8', 'displacement': '318.0', 'horsepower': '150.0', 'weight': '3436.', 'acceleration': '11.0',
'modelyear': '70', 'origin': '1', 'carname': 'plymouth satellite'}
https://github.com/cartershanklin/pyspark-cheatsheet 33/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
schema = StructType(
[
StructField("my_id", LongType(), True),
StructField("my_string", StringType(), True),
]
https://github.com/cartershanklin/pyspark-cheatsheet 34/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
)
df = spark.createDataFrame([], schema)
import datetime
from pyspark.sql.types import (
StructField,
StructType,
LongType,
StringType,
TimestampType,
)
schema = StructType(
[
StructField("my_id", LongType(), True),
StructField("my_string", StringType(), True),
StructField("my_timestamp", TimestampType(), True),
]
)
df = spark.createDataFrame(
[
(1, "foo", datetime.datetime.strptime("2021-01-01", "%Y-%m-%d")),
(2, "bar", datetime.datetime.strptime("2021-01-02", "%Y-%m-%d")),
],
https://github.com/cartershanklin/pyspark-cheatsheet 35/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
schema,
)
df = auto_df.withColumn("horsepower", col("horsepower").cast("double"))
https://github.com/cartershanklin/pyspark-cheatsheet 36/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
df = auto_df.withColumn("horsepower", col("horsepower").cast("int"))
print("{} rows".format(auto_df.count()))
print("{} columns".format(len(auto_df.columns)))
https://github.com/cartershanklin/pyspark-cheatsheet 37/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
print("{} partition(s)".format(auto_df.rdd.getNumPartitions()))
print(auto_df.dtypes)
Steps below:
The second example is a variation on the first, modifying source rdd entries while creating the target rdd .
https://github.com/cartershanklin/pyspark-cheatsheet 39/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|28.0| 16.0| 910.0| 450.0|8850.0| 20.0| 140| 2|pontiac...|
|30.0| 16.0| 780.0| 380.0|7700.0| 17.0| 140| 2|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
rdd = auto_df.rdd
print(rdd.take(10))
https://github.com/cartershanklin/pyspark-cheatsheet 40/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
auto_df.show(10)
import os
def foreach_function(row):
if row.horsepower is not None:
os.system("echo " + row.horsepower)
https://github.com/cartershanklin/pyspark-cheatsheet 41/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
auto_df.foreach(foreach_function)
def map_function(row):
if row.horsepower is not None:
return [float(row.horsepower) * 10]
else:
return [None]
df = auto_df.rdd.map(map_function).toDF()
https://github.com/cartershanklin/pyspark-cheatsheet 42/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Note also that you can yield results rather than returning full lists which can simplify code considerably.
def flatmap_function(row):
if row.cylinders is not None:
return list(range(int(row.cylinders)))
else:
return [None]
rdd = auto_df.rdd.flatMap(flatmap_function)
row = Row("val")
df = rdd.map(row).toDF()
https://github.com/cartershanklin/pyspark-cheatsheet 43/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+---+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 44/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Transforming Data
Data conversions and other modifications.
You can also join DataFrames if you register them. If you're porting complex SQL from another application this can be a lot easier than
converting it to use DataFrame SQL APIs.
auto_df.registerTempTable("auto_df")
df = sqlContext.sql(
"select modelyear, avg(mpg) from auto_df group by modelyear"
)
group = 0
df = (
auto_df.withColumn(
"identifier", regexp_extract(col("carname"), "(\S?\d+)", group)
)
.drop("acceleration")
.drop("cylinders")
.drop("displacement")
.drop("modelyear")
.drop("mpg")
.drop("origin")
.drop("horsepower")
.drop("weight")
)
https://github.com/cartershanklin/pyspark-cheatsheet 46/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|chevrol...| |
|plymout...| |
|pontiac...| |
|amc amb...| |
+----------+----------+
only showing top 10 rows
df = auto_df.fillna({"horsepower": 0})
https://github.com/cartershanklin/pyspark-cheatsheet 47/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.fillna({"horsepower": auto_df.agg(avg("horsepower")).first()[0]})
unmodified_columns = auto_df.columns
unmodified_columns.remove("horsepower")
https://github.com/cartershanklin/pyspark-cheatsheet 48/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
manufacturer_avg = auto_df.groupBy("cylinders").agg({"horsepower": "avg"})
df = auto_df.join(manufacturer_avg, "cylinders").select(
*unmodified_columns,
coalesce("horsepower", "avg(horsepower)").alias("horsepower"),
)
source = spark.sparkContext.parallelize(
[["1", '{ "a" : 10, "b" : 11 }'], ["2", '{ "a" : 20, "b" : 21 }']]
).toDF(["id", "json"])
df = source.select("id", json_tuple(col("json"), "a", "b"))
https://github.com/cartershanklin/pyspark-cheatsheet 49/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
source = spark.sparkContext.parallelize(
[["1", '{ "a" : 10, "b" : 11 }'], ["2", '{ "a" : 20, "b" : 21 }']]
).toDF(["id", "json"])
df = (
source.select("id", json_tuple(col("json"), "a", "b"))
.withColumnRenamed("c0", "a")
.withColumnRenamed("c1", "b")
.where(col("b") > 15)
)
https://github.com/cartershanklin/pyspark-cheatsheet 50/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 51/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.where(col("cylinders") == "8")
df = auto_df.where(col("cylinders").isin(["4", "6"]))
df = auto_df.where(~col("cylinders").isin(["4", "6"]))
https://github.com/cartershanklin/pyspark-cheatsheet 53/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 54/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|14.0| 8.0| 440.0| 215.0|4312.0| 8.5| 70| 1|plymout...|
|14.0| 8.0| 455.0| 225.0|4425.0| 10.0| 70| 1|pontiac...|
|15.0| 8.0| 390.0| 190.0|3850.0| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
df = auto_df.where(auto_df.carname.contains("custom"))
https://github.com/cartershanklin/pyspark-cheatsheet 55/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.where(col("carname").like("%custom%"))
# OR
df = auto_df.filter((col("mpg") > "30") | (col("acceleration") < "10"))
# AND
df = auto_df.filter((col("mpg") > "30") & (col("acceleration") < "13"))
https://github.com/cartershanklin/pyspark-cheatsheet 57/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.orderBy("carname")
df = auto_df.orderBy(col("carname").desc())
n = 10
df = auto_df.limit(n)
https://github.com/cartershanklin/pyspark-cheatsheet 58/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
df = auto_df.select("cylinders").distinct()
https://github.com/cartershanklin/pyspark-cheatsheet 59/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Remove duplicates
df = auto_df.dropDuplicates(["carname"])
Grouping
Group DataFrame data by key to perform aggregates like counting, sums, averages, etc.
# No sorting.
df = auto_df.groupBy("cylinders").count()
https://github.com/cartershanklin/pyspark-cheatsheet 60/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# With sorting.
df = auto_df.groupBy("cylinders").count().orderBy(desc("count"))
df = (
auto_df.groupBy("cylinders")
.agg(avg("horsepower").alias("avg_horsepower"))
.orderBy(desc("avg_horsepower"))
)
df = (
auto_df.groupBy("cylinders")
.count()
.orderBy(desc("count"))
.filter(col("count") > 100)
)
df = (
auto_df.groupBy(["modelyear", "cylinders"])
https://github.com/cartershanklin/pyspark-cheatsheet 62/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
.agg(avg("horsepower").alias("avg_horsepower"))
.orderBy(desc("avg_horsepower"))
)
The agg method allows you to easily run multiple aggregations by accepting a dictionary with keys being the column name and values being
the aggregation type. This example uses this to aggregate 3 columns in one expression.
https://github.com/cartershanklin/pyspark-cheatsheet 63/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+---------+---------------+-----------+-----------------+
| 70| 147.827...| 4732.| 97.00|
| 71| 107.037...| 5140.| 98.00|
| 72| 120.178...| 4633.| 98.00|
| 73| 130.475| 4997.| 98.00|
| 74| 94.2307...| 4699.| 98.00|
| 75| 101.066...| 4668.| 97.00|
| 76| 101.117...| 4380.| 98.00|
| 77| 105.071...| 4335.| 98.00|
| 78| 99.6944...| 4080.| 98.00|
| 79| 101.206...| 4360.| 98.00|
+---------+---------------+-----------+-----------------+
only showing top 10 rows
orderBy doesn't accept a list. If you need to build orderings dynamically put them in a list and splat them into orderBy 's arguments like in the
example below.
https://github.com/cartershanklin/pyspark-cheatsheet 64/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.select(max(col("horsepower")).alias("max_horsepower"))
https://github.com/cartershanklin/pyspark-cheatsheet 65/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Sum a column
df = auto_df.groupBy("cylinders").agg(sum("weight").alias("total_weight"))
https://github.com/cartershanklin/pyspark-cheatsheet 66/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 4| 470858.0|
+---------+------------+
df = auto_df.groupBy("cylinders").agg(countDistinct("mpg"))
https://github.com/cartershanklin/pyspark-cheatsheet 67/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 5| 3|
| 6| 38|
| 4| 87|
+---------+----------+
https://github.com/cartershanklin/pyspark-cheatsheet 68/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Find the top N per row group (use N=1 for maximum)
To find the top N per group we:
Build a Window
Partition by the target group
Order by the value we want to rank
Use row_number to add the numeric rank
Use where to filter any row number less than or equal to N
https://github.com/cartershanklin/pyspark-cheatsheet 69/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.groupBy("cylinders").agg(
collect_list(col("carname")).alias("models")
)
https://github.com/cartershanklin/pyspark-cheatsheet 70/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 8|[chevro...|
| 5|[audi 5...|
| 6|[plymou...|
| 4|[toyota...|
+---------+----------+
Compute a histogram
Spark's RDD object supports computing histograms. This example computes the DataFrame column called horsepower to an RDD before
calling histogram .
https://github.com/cartershanklin/pyspark-cheatsheet 71/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
w = Window().orderBy(col("mpg").desc())
df = auto_df.withColumn("ntile4", ntile(4).over(w))
w = Window().partitionBy("cylinders").orderBy(col("mpg").desc())
df = auto_df.withColumn("ntile4", ntile(4).over(w))
https://github.com/cartershanklin/pyspark-cheatsheet 72/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
grouped = auto_df.groupBy("modelyear").count()
w = Window().orderBy(col("count").desc())
df = grouped.withColumn("ntile4", ntile(4).over(w))
https://github.com/cartershanklin/pyspark-cheatsheet 73/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 73| 40| 1|
| 78| 36| 1|
| 76| 34| 1|
| 82| 31| 2|
| 75| 30| 2|
| 81| 29| 2|
| 80| 29| 3|
| 70| 29| 3|
| 79| 29| 3|
| 72| 28| 4|
+---------+-----+------+
only showing top 10 rows
To filter out all rows with a value outside a target percentile range:
Get the numeric percentile value using the percentile function and extracting it from the resulting DataFrame .
In a second step filter anything larger than (or smaller than, depending on what you want) that value.
target_percentile = auto_df.agg(
F.expr("percentile(mpg, 0.9)").alias("target_percentile")
).first()[0]
df = auto_df.filter(col("mpg") > lit(target_percentile))
https://github.com/cartershanklin/pyspark-cheatsheet 75/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 81| 8.0| 105.0| 1|
| 81| 6.0| 100.714...| 7|
| 81| 4.0| 72.95| 21|
| 81| null| 81.0357...| 29|
| 80| 6.0| 111.0| 2|
| 80| 5.0| 67.0| 1|
| 80| 4.0| 74.0434...| 25|
| 80| 3.0| 100.0| 1|
| 80| null| 77.4814...| 29|
| null| null| 80.0588...| 89|
+---------+---------+--------------+-----+
https://github.com/cartershanklin/pyspark-cheatsheet 76/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 82| null| 81.4666...| 31|
| 81| 8.0| 105.0| 1|
| 81| 6.0| 100.714...| 7|
| 81| 4.0| 72.95| 21|
| 81| null| 81.0357...| 29|
| 80| 6.0| 111.0| 2|
| 80| 5.0| 67.0| 1|
| 80| 4.0| 74.0434...| 25|
| 80| 3.0| 100.0| 1|
| 80| null| 77.4814...| 29|
| null| 8.0| 105.0| 1|
| null| 6.0| 102.833...| 12|
| null| 5.0| 67.0| 1|
| null| 4.0| 75.7| 74|
| null| 3.0| 100.0| 1|
| null| null| 80.0588...| 89|
+---------+---------+--------------+-----+
Joining DataFrames
Spark allows DataFrames to be joined similarly to how tables are joined in an RDBMS. The diagram below shows join types available in Spark.
https://github.com/cartershanklin/pyspark-cheatsheet 77/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 78/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 79/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+------------+----+---------+------------+----------+------+------------+---------+------+----------+-------+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 80/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...| chevrolet| chevrolet| us|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...| plymouth| plymouth| us|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...| pontiac| pontiac| us|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...| amc| amc| us|
+----+---------+------------+----------+------+------------+---------+------+----------+------------+------------+-------+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 81/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 82/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Full join.
joined = auto_df.join(auto_df, "carname", "full")
# Cross join.
df = auto_df.crossJoin(auto_df)
https://github.com/cartershanklin/pyspark-cheatsheet 83/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
If you load a directory, Spark attempts to combine all files in that director into one DataFrame.
https://github.com/cartershanklin/pyspark-cheatsheet 84/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Subtract DataFrames
Spark's subtract operator is similar to SQL's MINUS operator.
https://github.com/cartershanklin/pyspark-cheatsheet 85/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
File Processing
Loading File Metadata and Processing Files.
https://github.com/cartershanklin/pyspark-cheatsheet 86/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
StructType,
LongType,
StringType,
TimestampType,
)
import datetime
import glob
import os
https://github.com/cartershanklin/pyspark-cheatsheet 87/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
]
)
df = spark.createDataFrame(entries, schema)
This example loads details of files in an OCI Object Storage bucket into a DataFrame.
import oci
from pyspark.sql.types import (
StructField,
StructType,
LongType,
StringType,
TimestampType,
)
https://github.com/cartershanklin/pyspark-cheatsheet 88/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
def get_authenticated_client(client):
config = oci.config.from_file()
authenticated_client = client(config)
return authenticated_client
object_store_client = get_authenticated_client(
oci.object_storage.ObjectStorageClient
)
https://github.com/cartershanklin/pyspark-cheatsheet 89/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|usercon...|32006919|2021-01...|igX2QgX...|
|usercon...| 4307|2021-01...|NYXxVUc...|
|usercon...|71183217|2021-01...|HlBkZ/l...|
|usercon...|16906538|2021-01...|K8qMDXT...|
|usercon...| 4774|2021-01...|cXnXiq3...|
|usercon...| 303|2021-01...|5wgh5PJ...|
|usercon...| 1634|2021-01...|3Nqbf6K...|
|usercon...| 2611|2021-01...|B8XLwDe...|
|usercon...| 2017366|2021-01...|XyKoSOA...|
+----------+--------+----------+----------+
only showing top 10 rows
def resize_an_image(row):
width, height = 128, 128
file_name = row._1
new_name = file_name.replace(".png", ".resized.png")
img = Image.open(file_name)
img = img.resize((width, height), Image.ANTIALIAS)
img.save(new_name)
https://github.com/cartershanklin/pyspark-cheatsheet 90/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.where(col("horsepower").isNull())
df = auto_df.where(col("horsepower").isNotNull())
https://github.com/cartershanklin/pyspark-cheatsheet 91/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
This class's drop method returns a new DataFrame with nulls omitted. thresh controls the number of nulls before the row gets dropped and
subset controls the columns to consider.
df = auto_df.na.drop(thresh=1, subset=("horsepower",))
df = auto_df.select(
[count(when(isnan(c), c)).alias(c) for c in auto_df.columns]
)
df = auto_df.select(
[count(when(col(c).isNull(), c)).alias(c) for c in auto_df.columns]
)
https://github.com/cartershanklin/pyspark-cheatsheet 92/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = spark.sparkContext.parallelize([["2021-01-01"], ["2022-01-01"]]).toDF(
["date_col"]
)
df = df.withColumn("date_col", col("date_col").cast("date"))
https://github.com/cartershanklin/pyspark-cheatsheet 93/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = spark.sparkContext.parallelize([["20210101"], ["20220101"]]).toDF(
["date_col"]
)
df = df.withColumn("date_col", to_date(col("date_col"), "yyyyddMM"))
df = spark.sparkContext.parallelize([["2020-01-01"], ["1712-02-10"]]).toDF(
["date_col"]
)
df = df.withColumn("date_col", col("date_col").cast("date")).withColumn(
"last_day", last_day(col("date_col"))
)
https://github.com/cartershanklin/pyspark-cheatsheet 94/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+----------+----------+
|2020-01-01|2020-01-31|
|1712-02-10|1712-02-29|
+----------+----------+
df = spark.sparkContext.parallelize([["1590183026"], ["2000000000"]]).toDF(
["ts_col"]
)
df = df.withColumn("date_col", from_unixtime(col("ts_col")))
https://github.com/cartershanklin/pyspark-cheatsheet 95/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Use the dateparser module to convert many formats into timestamps.
date_convert = udf(
lambda x: dateparser.parse(x) if x is not None else None, TimestampType()
)
df = (
spark.read.format("csv")
.option("header", True)
.load("data/date_examples.csv")
)
df = df.withColumn("parsed", date_convert(df.date))
Unstructured Analytics
Analyzing unstructured data like JSON, XML, etc.
https://github.com/cartershanklin/pyspark-cheatsheet 96/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Spark allows you to traverse complex types in a select operation by providing multiple StructField names separated by a . . Names used to
in StructField s will correspond to the JSON member names.
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": "100"
},
"IDs": [116, 943, 234, 38793]
}
}
The resulting DataFrame will have one StructType column named Image. The Image column will have these selectable fields: Image.Width ,
Image.Height , Image.Title , Image.Thumbnail.Url , Image.Thumbnail.Height , Image.Thumbnail.Width , Image.IDs .
https://github.com/cartershanklin/pyspark-cheatsheet 97/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
col("symbol").alias("symbol"),
col("quoteType.longName").alias("longName"),
col("price.marketCap.raw").alias("marketCap"),
col("summaryDetail.previousClose.raw").alias("previousClose"),
col("summaryDetail.fiftyTwoWeekHigh.raw").alias("fiftyTwoWeekHigh"),
col("summaryDetail.fiftyTwoWeekLow.raw").alias("fiftyTwoWeekLow"),
col("summaryDetail.trailingPE.raw").alias("trailingPE"),
]
df = base.select(target_json_fields)
https://github.com/cartershanklin/pyspark-cheatsheet 98/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
base = (
spark.read.format("csv")
.option("header", True)
.option("quote", '"')
.option("escape", '"')
.load("data/financial.csv")
)
explode and posexplode are called table generating functions because they produce one output row for each array entry, in other words a
row goes in and a table comes out. The output column has the same data type as the data type in the array. When dealing with JSON this data
type could be a boolean, integer, float or StructType.
The example below uses explode to flatten an array of StructTypes, then selects certain key fields from the output structures.
base = spark.read.json("data/financial.jsonl")
https://github.com/cartershanklin/pyspark-cheatsheet 100/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
col("col.totalAssets.raw").alias("totalAssets"),
col("col.totalLiab.raw").alias("totalLiab"),
]
# Balance sheet data is in an array, use explode to generate one row per entry.
df = selected.select("symbol", explode("balanceSheetStatements")).select(
target_json_fields
)
Pandas
Using Python's Pandas library to augment Spark. Some operations require the pyarrow library.
https://github.com/cartershanklin/pyspark-cheatsheet 101/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
pandas_df = auto_df.toPandas()
df = spark.createDataFrame(pandas_df)
https://github.com/cartershanklin/pyspark-cheatsheet 102/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
schema = StructType(
[StructField(name, StringType(), True) for name in pandas_df.columns]
)
df = spark.createDataFrame(pandas_df, schema)
https://github.com/cartershanklin/pyspark-cheatsheet 103/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
Be aware that rows in a Spark DataFrame have no guaranteed order unless you explicitly order them.
N = 10
pdf = auto_df.limit(N).toPandas()
https://github.com/cartershanklin/pyspark-cheatsheet 104/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
@pandas_udf("double")
def mean_udaf(pdf: DataFrame) -> float:
return pdf.mean()
df = auto_df.groupby("cylinders").agg(mean_udaf(auto_df["mpg"]))
https://github.com/cartershanklin/pyspark-cheatsheet 105/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
def rescale(pdf):
minv = pdf.horsepower.min()
maxv = pdf.horsepower.max() - minv
return pdf.assign(horsepower=(pdf.horsepower - minv) / maxv * 100)
df = auto_df.groupby("cylinders").applyInPandas(rescale, auto_df.schema)
Data Profiling
Extracting key statistics out of a body of data.
df = auto_df.select(
[count(when(col(c).isNull(), c)).alias(c) for c in auto_df.columns]
)
https://github.com/cartershanklin/pyspark-cheatsheet 107/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 108/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
| 5140.0| 24.8| 8.0| 46.6| 230.0| 455.0|
+-----------+-----------------+--------------+--------+---------------+-----------------+
import pyspark.sql.functions as F
target_column = "mpg"
z_score_threshold = 2
# Compute deviations.
deviations = target_df.crossJoin(profiled).withColumn(
"deviation", sqrt((target_df[target_column] - profiled["median"]) ** 2)
)
deviations.registerTempTable("deviations")
https://github.com/cartershanklin/pyspark-cheatsheet 110/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Data Management
Upserts, updates and deletes on data.
Your Spark session needs to be "Delta enabled". See cheatsheet.py (the code that generates this cheatsheet) for more information on how to
do this.
auto_df.write.mode("overwrite").format("delta").saveAsTable("delta_table")
https://github.com/cartershanklin/pyspark-cheatsheet 111/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Be sure to read Delta Lake's documentation on concurrency control before using transactions in any application.
output_path = "delta_tests"
https://github.com/cartershanklin/pyspark-cheatsheet 112/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
Be sure to read Delta Lake's documentation on concurrency control before using transactions in any application.
https://github.com/cartershanklin/pyspark-cheatsheet 113/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
)
.whenNotMatchedInsertAll()
.execute()
)
https://github.com/cartershanklin/pyspark-cheatsheet 114/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Get versions.
output_path = "delta_tests"
dt = DeltaTable.forPath(spark, output_path)
versions = (
dt.history().select("version timestamp".split()).orderBy(desc("version"))
)
most_recent_version = versions.first()[0]
print("Most recent version is", most_recent_version)
https://github.com/cartershanklin/pyspark-cheatsheet 115/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = (
spark.read.format("delta")
.option("versionAsOf", most_recent_version)
.load(output_path)
)
Usage Notes:
If the timestamp you specify is earlier than any valid timestamp, the table will fail to load with an error like The provided timestamp (...)
is before the earliest version available to this table .
Otherwise, the table version is based on rounding the timestamp you specify down to the nearest valid timestamp less than or equal to the
timestamp you specify.
# Get versions.
output_path = "delta_tests"
dt = DeltaTable.forPath(spark, output_path)
versions = dt.history().select("version timestamp".split()).orderBy("timestamp")
most_recent_timestamp = versions.first()[1]
print("Most recent timestamp is", most_recent_timestamp)
https://github.com/cartershanklin/pyspark-cheatsheet 116/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
.option("timestampAsOf", most_recent_timestamp)
.load(output_path)
)
output_path = "delta_tests"
# Load table.
dt = DeltaTable.forPath(spark, output_path)
https://github.com/cartershanklin/pyspark-cheatsheet 117/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
import os
import time
extra_properties = dict(
user=os.environ.get("USER"),
write_timestamp=time.time(),
)
auto_df.write.mode("append").option("userMetadata", extra_properties).format(
"delta"
).save("delta_table_metadata")
https://github.com/cartershanklin/pyspark-cheatsheet 118/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
dt = DeltaTable.forPath(spark, "delta_table_metadata")
df = dt.history().select("version timestamp userMetadata".split())
Spark Streaming
Spark Streaming (Focuses on Structured Streaming).
This is for test/dev only, for production you should put credentials in a JAAS Login Configuration File.
options = {
"kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="USERNAME" password="PA
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.bootstrap.servers": "server:9092",
"group.id": "my_group",
"subscribe": "my_topic",
https://github.com/cartershanklin/pyspark-cheatsheet 119/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
}
df = spark.readStream.format("kafka").options(**options).load()
input_location = "streaming/input"
schema = StructType(
[
StructField("mpg", DoubleType(), True),
StructField("cylinders", IntegerType(), True),
StructField("displacement", DoubleType(), True),
StructField("horsepower", DoubleType(), True),
StructField("weight", DoubleType(), True),
StructField("acceleration", DoubleType(), True),
StructField("modelyear", IntegerType(), True),
StructField("origin", IntegerType(), True),
StructField("carname", StringType(), True),
StructField("manufacturer", StringType(), True),
]
)
df = spark.readStream.csv(path=input_location, schema=schema).withColumn(
"timestamp", current_timestamp()
)
aggregated = (
https://github.com/cartershanklin/pyspark-cheatsheet 120/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df.groupBy(window(df.timestamp, "1 minute"), "manufacturer")
.agg(
avg("horsepower").alias("avg_horsepower"),
avg("timestamp").alias("avg_timestamp"),
count("modelyear").alias("count"),
)
.coalesce(10)
)
summary = aggregated.orderBy("window", "manufacturer")
query = (
summary.writeStream.outputMode("complete")
.format("console")
.option("truncate", False)
.start()
)
query.awaitTermination()
-------------------------------------------
Batch: 0
-------------------------------------------
+------------------------------------------+------------+--------------+----------------+-----+
|window |manufacturer|avg_horsepower|avg_timestamp |count|
+------------------------------------------+------------+--------------+----------------+-----+
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|amc |131.75 |1.616243250178E9|4 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|chevrolet |175.0 |1.616243250178E9|4 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|datsun |88.0 |1.616243250178E9|1 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|dodge |190.0 |1.616243250178E9|2 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|ford |159.5 |1.616243250178E9|4 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|plymouth |155.0 |1.616243250178E9|4 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|pontiac |225.0 |1.616243250178E9|1 |
+------------------------------------------+------------+--------------+----------------+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
https://github.com/cartershanklin/pyspark-cheatsheet 121/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+------------------------------------------+------------+------------------+--------------------+-----+
|window |manufacturer|avg_horsepower |avg_timestamp |count|
+------------------------------------------+------------+------------------+--------------------+-----+
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|amc |119.57142857142857|1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|chevrolet |140.875 |1.6162432611729999E9|8 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|datsun |81.66666666666667 |1.616243264838E9 |3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|dodge |186.66666666666666|1.6162432575080001E9|3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|ford |142.125 |1.6162432623946667E9|9 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|plymouth |135.0 |1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|pontiac |168.75 |1.6162432666704998E9|4 |
+------------------------------------------+------------+------------------+--------------------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+------------+------------------+--------------------+-----+
|window |manufacturer|avg_horsepower |avg_timestamp |count|
+------------------------------------------+------------+------------------+--------------------+-----+
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|amc |119.57142857142857|1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|chevrolet |140.875 |1.6162432611729999E9|8 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|datsun |81.66666666666667 |1.616243264838E9 |3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|dodge |186.66666666666666|1.6162432575080001E9|3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|ford |142.125 |1.6162432623946667E9|9 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|plymouth |135.0 |1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|pontiac |168.75 |1.6162432666704998E9|4 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|amc |150.0 |1.616243297163E9 |2 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|chevrolet |128.33333333333334|1.616243297163E9 |3 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|datsun |92.0 |1.616243297163E9 |1 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|dodge |80.0 |1.616243297163E9 |2 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|ford |116.25 |1.616243297163E9 |4 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|plymouth |150.0 |1.616243297163E9 |2 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|pontiac |175.0 |1.616243297163E9 |1 |
+------------------------------------------+------------+------------------+--------------------+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+------------------------------------------+------------+------------------+--------------------+-----+
https://github.com/cartershanklin/pyspark-cheatsheet 122/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|window |manufacturer|avg_horsepower |avg_timestamp |count|
+------------------------------------------+------------+------------------+--------------------+-----+
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|amc |119.57142857142857|1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|chevrolet |140.875 |1.6162432611729999E9|8 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|datsun |81.66666666666667 |1.616243264838E9 |3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|dodge |186.66666666666666|1.6162432575080001E9|3 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|ford |142.125 |1.6162432623946667E9|9 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|plymouth |135.0 |1.6162432596022856E9|7 |
|{2021-03-20 05:27:00, 2021-03-20 05:28:00}|pontiac |168.75 |1.6162432666704998E9|4 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|amc |137.5 |1.616243313837E9 |6 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|chevrolet |127.44444444444444|1.6162433138370001E9|9 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|datsun |93.0 |1.6162433096685E9 |2 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|dodge |115.0 |1.6162433096685E9 |4 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|ford |122.22222222222223|1.6162433110579998E9|9 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|plymouth |136.66666666666666|1.616243313837E9 |6 |
|{2021-03-20 05:28:00, 2021-03-20 05:29:00}|pontiac |202.5 |1.6162433096685E9 |2 |
+------------------------------------------+------------+------------------+--------------------+-----+
input_location = "streaming/input"
schema = StructType(
[
https://github.com/cartershanklin/pyspark-cheatsheet 123/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
StructField("mpg", DoubleType(), True),
StructField("cylinders", IntegerType(), True),
StructField("displacement", DoubleType(), True),
StructField("horsepower", DoubleType(), True),
StructField("weight", DoubleType(), True),
StructField("acceleration", DoubleType(), True),
StructField("modelyear", IntegerType(), True),
StructField("origin", IntegerType(), True),
StructField("carname", StringType(), True),
StructField("manufacturer", StringType(), True),
]
)
df = spark.readStream.csv(path=input_location, schema=schema)
summary = (
df.groupBy("modelyear")
.agg(
avg("horsepower").alias("avg_horsepower"),
count("modelyear").alias("count"),
)
.orderBy(desc("modelyear"))
.coalesce(10)
)
query = summary.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
-------------------------------------------
Batch: 0
-------------------------------------------
+---------+------------------+-----+
|modelyear| avg_horsepower|count|
+---------+------------------+-----+
| 70|147.82758620689654| 29|
+---------+------------------+-----+
https://github.com/cartershanklin/pyspark-cheatsheet 124/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
-------------------------------------------
Batch: 1
-------------------------------------------
+---------+------------------+-----+
|modelyear| avg_horsepower|count|
+---------+------------------+-----+
| 71|107.03703703703704| 28|
| 70|147.82758620689654| 29|
+---------+------------------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+---------+------------------+-----+
|modelyear| avg_horsepower|count|
+---------+------------------+-----+
| 72|120.17857142857143| 28|
| 71|107.03703703703704| 28|
| 70|147.82758620689654| 29|
+---------+------------------+-----+
df = auto_df.withColumn("timestamp", current_timestamp())
https://github.com/cartershanklin/pyspark-cheatsheet 125/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|2022-09...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|2022-09...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|2022-09...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|2022-09...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|2022-09...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|2022-09...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|2022-09...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|2022-09...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|2022-09...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|2022-09...|
+----+---------+------------+----------+------+------------+---------+------+----------+----------+
only showing top 10 rows
Session windows divide an input stream by both a time dimension and a grouping key. The length of the window depends on how long the
grouping key is "active", the length of the window is extended each time the grouping key is seen without a timeout.
This example shows weblog traffic split by IP address, with a 5 minute timeout per session. This sessionization would allow you compute things
like average number of visits per session.
hits_per_session = (
weblog_df.groupBy("ip", session_window("time", "5 minutes"))
.count()
.withColumn("session", hash("ip", "session_window"))
)
To deal with this put a short-circuit field in the UDF and call the UDF with the condition. If the short-circuit is true return immediately.
@udf(returnType=BooleanType())
def myudf(short_circuit, state, value):
if short_circuit == True:
return True
df = (
spark.readStream.format("socket")
.option("host", "localhost")
https://github.com/cartershanklin/pyspark-cheatsheet 127/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
.option("port", "9090")
.load()
)
parsed = (
df.selectExpr(
"split(value,',')[0] as state",
"split(value,',')[1] as zipcode",
"split(value,',')[2] as spend",
)
.groupBy("state")
.agg({"spend": "avg"})
.orderBy(desc("avg(spend)"))
)
tagged = parsed.withColumn(
"below", myudf(col("avg(spend)") < 100, col("state"), col("avg(spend)"))
)
tagged.writeStream.outputMode("complete").format(
"console"
).start().awaitTermination()
pipeline_model = pipeline.fit(train_df)
pipeline_model.write().overwrite().save("path/to/pipeline")
https://github.com/cartershanklin/pyspark-cheatsheet 128/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
pipeline_model = PipelineModel.load("path/to/pipeline")
df = pipeline.transform(input_df)
df.writeStream.format("console").start().awaitTermination()
df.writeStream.outputMode("complete").format("console").trigger(
processingTime="10 seconds"
).start().awaitTermination()
import time
https://github.com/cartershanklin/pyspark-cheatsheet 129/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = (
spark.readStream.format("rate")
.option("rowPerSecond", 100)
.option("numPartitions", 2)
.load()
)
query = df.writeStream.foreachBatch(foreach_batch_function).start()
Time Series
Techniques for dealing with time series data.
Customer ID
Year-Month (YYYY-MM)
Amount customer spent in that month.
This data is useful to compute things like biggest spender, most frequent buyer, etc.
Imagine though that this dataset doesn't contain a record for customers who didn't buy anything in that month. This will create all kinds of
problems, for example the average monthly spend will skew up because no zero values will be included in the average. To answer questions like
these we need to create zero-value records for anything that is missing.
In SQL this is handled with the Partitioned Outer Join. This example shows you how you can roll you own in Spark:
The output of this is a row for each customer ID / date combination. The value for spend_dollars is either the value from the original DataFrame
or 0 if there was no corresponding row in the original DataFrame .
You may want to filter the results to remove any rows belonging to a customer before their first actual purchase. Refer to the code for "First
Time an ID is Seen" for how to find that information.
# Use distinct values of customer and date from the dataset itself.
# In general it's safer to use known reference tables for IDs and dates.
df = spend_df.join(
spend_df.select("customer_id")
.distinct()
.crossJoin(spend_df.select("date").distinct()),
["date", "customer_id"],
https://github.com/cartershanklin/pyspark-cheatsheet 131/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
"right",
).select("date", "customer_id", coalesce("spend_dollars", lit(0)))
w = Window().partitionBy("customer_id").orderBy("date")
df = spend_df.withColumn("first_seen", first("date").over(w))
Cumulative Sum
A comulative sum can be computed using using the standard sum function windowed from unbounded preceeding rows to the current row.
w = (
Window()
.partitionBy("customer_id")
.orderBy("date")
.rangeBetween(Window.unboundedPreceding, 0)
)
df = spend_df.withColumn("running_sum", sum("spend_dollars").over(w))
https://github.com/cartershanklin/pyspark-cheatsheet 133/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|2020-06-30| 0| 0.0300| 0.2100|
|2020-07-31| 0| 0.0300| 0.2400|
|2020-08-31| 0| 0.0500| 0.2900|
|2020-09-30| 0| 0.0000| 0.2900|
|2020-10-31| 0| 0.0300| 0.3200|
|2020-11-30| 0| 0.0900| 0.4100|
|2020-12-31| 0| 0.0600| 0.4700|
+----------+-----------+-------------+-----------+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 134/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|2020-06-30| 0| 0.0300| 0.2100|
|2020-07-31| 0| 0.0300| 0.2400|
|2020-08-31| 0| 0.0500| 0.2900|
|2020-09-30| 0| 0.0000| 0.2900|
|2020-10-31| 0| 0.0300| 0.3200|
|2020-11-30| 0| 0.0900| 0.4100|
|2020-12-31| 0| 0.0600| 0.4700|
+----------+-----------+-------------+-----------+
only showing top 10 rows
Cumulative Average
A comulative average can be computed using using the standard avg function windowed from unbounded preceeding rows to the current
row.
w = (
Window()
.partitionBy("customer_id")
.orderBy("date")
.rangeBetween(Window.unboundedPreceding, 0)
)
df = spend_df.withColumn("running_avg", avg("spend_dollars").over(w))
https://github.com/cartershanklin/pyspark-cheatsheet 135/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|2020-07-31| 0| 0.0300| 0.04800000|
|2020-08-31| 0| 0.0500| 0.04833333|
|2020-09-30| 0| 0.0000| 0.04142857|
|2020-10-31| 0| 0.0300| 0.04000000|
|2020-11-30| 0| 0.0900| 0.04555556|
|2020-12-31| 0| 0.0600| 0.04700000|
+----------+-----------+-------------+-----------+
only showing top 10 rows
https://github.com/cartershanklin/pyspark-cheatsheet 136/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
|2020-07-31| 0| 0.0300| 0.04800000|
|2020-08-31| 0| 0.0500| 0.04833333|
|2020-09-30| 0| 0.0000| 0.04142857|
|2020-10-31| 0| 0.0300| 0.04000000|
|2020-11-30| 0| 0.0900| 0.04555556|
|2020-12-31| 0| 0.0600| 0.04700000|
+----------+-----------+-------------+-----------+
only showing top 10 rows
Machine Learning
Machine Learning is a deep subject, too much to cover in this cheatsheet which is intended for code you can easily paste into your apps. The
examples below will show basics of ML in Spark. It is helpful to understand the terminology of ML like Features, Estimators and Models. If you
want some background on these things consider courses like "Google crash course in ML" or Udemy's "Machine Learning Course with Python".
Feature: A Feature is an individual measurement. For example if you want to predict height based on age and sex, a combination of age
and sex is a Feature.
Vector: A Vector is a special Spark data type similar to an array of numbers. Spark ML algorithms require Features to be loaded into Vectors
for training and predictions.
Vector Column: Model training requires considering many Features at the same time. Spark ML operates on DataFrame s. Before training
can happen you need to construct a DataFrame column of type Vector. See examples below.
Label: Supervised ML algorithms like regression and classification require a label when training. In Spark you will put labels in a column in a
DataFrame such that each row has both a Feature and its associated Label.
Model: A Model is an algorithm capable of turning Feature vectors into values, usually thought of as predictions.
Estimator: An Estimator builds a mathematical model that transforms input values into outputs. Estimators do double duty in Spark, some
Estimators like regression and classification build statistical models. Some Estimators are purely for data preparation like the StringIndexer
which builds a Model containing a dictionary that maps strings to numbers in a deterministic way.
Fitting: Fitting is the process of building a Model using an Estimator and an input DataFrame you provide.
https://github.com/cartershanklin/pyspark-cheatsheet 137/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Transformer: Transformers create new DataFrames using the transform API, which applies algorithms to the input DataFrame and outputs
a DataFrame with additional columns. The nature of the transform could be statistical or it could be a simple algorithm, depending on the
type of Estimator that created the Model.
Pipelines: Pipelines are a series of Estimators that apply a series of transform s to a DataFrame before calling fit on the final Estimator in
the Pipeline. The Pipeline is itself an Estimator. When you fit a Pipeline, Spark fit s the first Estimator in the Pipeline using an input
DataFrame you provide. This produces a Model. If there are additional Estimators in the Pipeline, the newly created Model's transform
method is called against the input DataFrame to create a new DataFrame . The process then begins again with the newly created
DataFrame being passed to the next Estimator's fit method. Fitting a Pipeline produces a PipelineModel .
https://github.com/cartershanklin/pyspark-cheatsheet 138/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 139/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 140/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Instead these Estimators require special type of column called a Vector Column. The Vector is like an array of numbers packed into a single cell.
Combining this with Vectors in other rows in the DataFrame gives a 2-dimensional array.
One essential step in using these estimators is to load the data in your DataFrame into a vector column. This is usually done using a
VectorAssembler .
This example assembles the cylinders, displacement and acceleration columns from the Auto MPG dataset into a vector column called
features . If you look at the features column in the output you will see it is composed of the values of these source columns. Later examples
will use the features column to fit predictive Models.
https://github.com/cartershanklin/pyspark-cheatsheet 141/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
vectorAssembler = VectorAssembler(
inputCols=[
"cylinders",
"displacement",
"acceleration",
],
outputCol="features",
handleInvalid="skip",
)
assembled = vectorAssembler.transform(auto_df_fixed)
assembled = assembled.select(
["cylinders", "displacement", "acceleration", "features"]
)
print("Data type for features column is:", assembled.dtypes[-1][1])
https://github.com/cartershanklin/pyspark-cheatsheet 142/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
fit the RandomForestRegressor estimator using the DataFrame . This produces a RandomForestRegressionModel .
https://github.com/cartershanklin/pyspark-cheatsheet 143/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
This example does not make predictions, see "Load a model and use it for transformations" or "Load a model and use it for predictions" to see
how to make predictions with Models.
https://github.com/cartershanklin/pyspark-cheatsheet 144/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
vectorAssembler = VectorAssembler(
inputCols=[
"cylinders",
"displacement",
"horsepower",
],
outputCol="features",
handleInvalid="skip",
)
assembled = vectorAssembler.transform(auto_df_fixed)
assembled = assembled.select(["features", "mpg", "carname"])
Hyperparameter tuning
A key factor in the quality of a Random Forest model is a good choice of the number of trees parameter. The previous example arbitrarily set
this parameter to 20. In practice it is better to try many parameters and choose the best one.
Spark automates this using using a CrossValidator . The CrossValidator performs a parallel search of possible parameters and evaluates their
performance using random test/training splits of input data. When all parameter combinations are tried the best Model is identified and made
available in a property called bestModel .
Commonly you want to transform your DataFrame before your estimator gets it. This is done using a Pipeline estimator. When you give a
Pipeline estimator to a CrossValidator the CrossValidator will evaluate the final stage in the Pipeline .
https://github.com/cartershanklin/pyspark-cheatsheet 146/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 147/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Set up our main ML pipeline.
columns_to_assemble = [
"cylinders",
"displacement",
"acceleration",
]
vector_assembler = VectorAssembler(
inputCols=columns_to_assemble,
outputCol="features",
handleInvalid="skip",
)
# Hyperparameter search.
target_metric = "rmse"
paramGrid = (
ParamGridBuilder().addGrid(rf.numTrees, list(range(20, 100, 10))).build()
)
cross_validator = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(
labelCol="mpg", predictionCol="prediction", metricName=target_metric
),
numFolds=2,
parallelism=4,
)
https://github.com/cartershanklin/pyspark-cheatsheet 148/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
best_pipeline_model = fit_cross_validator.bestModel
best_regressor = best_pipeline_model.stages[1]
print("Best model has {} trees.".format(best_regressor.getNumTrees))
https://github.com/cartershanklin/pyspark-cheatsheet 149/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
.select(["manufacturer", "manufacturer_encoded"])
)
There is a column called displacement. A displacement of 400 is twice as much as a displacement of 200. This is a continuous feature.
There is a column called origin. Cars originating from the US get a value of 1. Cars originating from Europe get a value of 2. Despite
anything you may have read, Europe is not twice as much as the US. Instead this is a categorical feature and no relationship can be learned
by comparing values.
If your estimator decides that Europe is twice as much as the US it will make all kinds of other wrong conclusions. There are a few ways of
handling categorical features, one popular approach is called "One-Hot Encoding".
https://github.com/cartershanklin/pyspark-cheatsheet 150/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
If we one-hot encode the origin column we replace it with vectors which encode each possible value of the category. The length of the vector is
equal to the number of possible distinct values minus one. At most one of vector's values is set to 1 at any given time. (This is usually called
"dummy encoding" which is slightly different from one-hot encoding.) With values encoded this way your estimator won't draw false linear
relationships in the feature.
One-hot encoding is easy and may lead to good results but there are different ways to handle categorical values depending on the
circumstances. For other tools the pros use look up terms like "deviation coding" or "difference coding".
https://github.com/cartershanklin/pyspark-cheatsheet 151/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
+---------+-----------------+
only showing top 10 rows
Here's a simple examnple of why: imagine you have a string measure that can take 100 different possible values. Before you can fit your
estimator you need to convert these strings to integers using a StringIndexer . The CrossValidator will randomly partition your DataFrame
into training and test sets, then fit a StringIndexer against the training set to produce a StringIndexerModel . If the training set doesn't
contain all possible 100 values, when the StringIndexerModel is used to transform the test set it will encounter unknown values and fail. The
StringIndexer needs to be fit against the full dataset before any estimator fitting. There are other examples and in general the safe choice
is to do all data preparation before estimator fitting.
In complex applications different teams are responsible for data preparation (also called "feature engineering") and model development. In this
case the feature engineering team will create feature DataFrame s and save them into a so-called "Feature Store". A model development team
will load DataFrame s from the feature store in entirely different applications. The process below divides feature engineering and model
development into 2 separate Pipeline s but doesn't go so far as to save between Pipelines 1 and 2.
https://github.com/cartershanklin/pyspark-cheatsheet 152/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 153/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
https://github.com/cartershanklin/pyspark-cheatsheet 154/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
### Phase 1.
# Add manufacturer name we will use as a string column.
first_word_udf = udf(lambda x: x.split()[0], StringType())
df = auto_df_fixed.withColumn(
"manufacturer", first_word_udf(auto_df_fixed.carname)
)
https://github.com/cartershanklin/pyspark-cheatsheet 155/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
inputCol="manufacturer", outputCol="manufacturer_encoded"
)
# Turn the model year into categories.
year_encoder = OneHotEncoder(
inputCol="modelyear", outputCol="modelyear_encoded"
)
### Phase 2.
# Assemble vectors.
columns_to_assemble = [
"cylinders",
"displacement",
"horsepower",
"weight",
"acceleration",
"manufacturer_encoded",
"modelyear_encoded",
]
vector_assembler = VectorAssembler(
inputCols=columns_to_assemble,
outputCol="features",
handleInvalid="skip",
)
https://github.com/cartershanklin/pyspark-cheatsheet 156/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Run cross-validation to get the best parameters.
paramGrid = (
ParamGridBuilder().addGrid(rf.numTrees, list(range(20, 100, 10))).build()
)
cross_validator = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(
labelCol="mpg", predictionCol="prediction", metricName="rmse"
),
numFolds=2,
parallelism=4,
)
fit_cross_validator = cross_validator.fit(prepared)
Regression models can be evaluated with RegressionEvaluator , which provides these metrics:
https://github.com/cartershanklin/pyspark-cheatsheet 157/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
R squared or r2.
Mean absolute error or mae.
Explained variance or var.
Among these, rmse and r2 are the most commonly used metrics. Lower values of rmse are better, higher values of r2 are better, up to the
maximum possible r2 score of 1.
Binary classification models are evaluated using BinaryClassificationEvaluator , which provides these measures:
The ROC curve is commonly plotted when using binary classifiers. A perfect model would look like an upside-down "L" leading to area under
ROC of 1.
Multiclass classification evaluation is much more complex, a MulticlassClassificationEvaluator is provided with options including:
F1 score.
Accuracy.
True positive rate by label.
False positive rate by label.
Precision by label.
Recall by label.
F measure by label, which computes the F score for a particular label.
Weighted true positive rate, like true positive rate, but allows each measure to have a weight.
Weighted false positive rate, like false positive rate, but allows each measure to have a weight.
Weighted precision, like precision, but allows each measure to have a weight.
Weighted recall, like recall, but allows each measure to have a weight.
Weighted f measure, like F1 score, but allows each measure to have a weight.
Log loss (short for Logistic Loss)
Hamming loss
https://github.com/cartershanklin/pyspark-cheatsheet 158/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
silhouette
The example below loads 2 regression models fit earlier and compares their metrics. rf_regression_simple.model considered 3 columns of the
input dataset while rf_regression_full.model considered 7. We can expect that rf_regression_full.model will have better metrics.
https://github.com/cartershanklin/pyspark-cheatsheet 159/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
first_word_udf = udf(lambda x: x.split()[0], StringType())
df = auto_df_fixed.withColumn(
"manufacturer", first_word_udf(auto_df_fixed.carname)
)
manufacturer_encoder = StringIndexer(
inputCol="manufacturer", outputCol="manufacturer_encoded"
)
year_encoder = OneHotEncoder(
inputCol="modelyear", outputCol="modelyear_encoded"
)
data_prep_pipeline = Pipeline(stages=[manufacturer_encoder, year_encoder])
prepared = data_prep_pipeline.fit(df).transform(df)
columns_to_assemble = [
"cylinders",
"displacement",
"horsepower",
"weight",
"acceleration",
"manufacturer_encoded",
"modelyear_encoded",
]
complex_assembler = VectorAssembler(
inputCols=columns_to_assemble,
outputCol="features",
handleInvalid="skip",
)
complex_input = complex_assembler.transform(prepared).select(
["features", "mpg"]
)
cv_model = CrossValidatorModel.load("rf_regression_full.model")
best_pipeline = cv_model.bestModel
rf_complex_model = best_pipeline.stages[-1]
complex_predictions = rf_complex_model.transform(complex_input)
# Evaluate performances.
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="mpg")
performances = [
["simple", simple_predictions, dict()],
https://github.com/cartershanklin/pyspark-cheatsheet 160/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
["complex", complex_predictions, dict()],
]
for label, predictions, tracker in performances:
for metric in metrics:
tracker[metric] = evaluator.evaluate(
predictions, {evaluator.metricName: metric}
)
print(performances)
https://github.com/cartershanklin/pyspark-cheatsheet 161/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
original_columns = assembler.getInputCols()
https://github.com/cartershanklin/pyspark-cheatsheet 162/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
manufacturer_encoded = StringIndexer(
inputCol="manufacturer", outputCol="manufacturer_encoded"
)
encoded_df = manufacturer_encoded.fit(df).transform(df)
# Hyperparameter search.
target_metric = "rmse"
paramGrid = (
ParamGridBuilder().addGrid(rf.numTrees, list(range(20, 100, 10))).build()
)
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
https://github.com/cartershanklin/pyspark-cheatsheet 163/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
evaluator=RegressionEvaluator(
labelCol="mpg", predictionCol="prediction", metricName=target_metric
),
numFolds=2,
parallelism=4,
)
parameter_grid = [
{k.name: v for k, v in p.items()} for p in model.getEstimatorParamMaps()
]
pdf = pandas.DataFrame(
model.avgMetrics,
index=[x["numTrees"] for x in parameter_grid],
columns=[target_metric],
)
ax = pdf.plot(style="*-")
ax.figure.suptitle("Hyperparameter Search: RMSE by Number of Trees")
ax.figure.savefig("hyperparameters.png")
https://github.com/cartershanklin/pyspark-cheatsheet 164/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Save a model
To save a model call fit on the Estimator to build a Model , then save the Model .
https://github.com/cartershanklin/pyspark-cheatsheet 166/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
vectorAssembler = VectorAssembler(
inputCols=[
"cylinders",
"displacement",
"horsepower",
],
outputCol="features",
handleInvalid="skip",
)
assembled = vectorAssembler.transform(auto_df_fixed)
# A regression model.
rf_regressor = RandomForestRegressor(
numTrees=50,
featuresCol="features",
labelCol="mpg",
)
# A classification model.
rf_classifier = RandomForestClassifier(
numTrees=50,
featuresCol="features",
labelCol="origin",
)
https://github.com/cartershanklin/pyspark-cheatsheet 167/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
rf_classifier_model = rf_classifier.fit(train_df)
rf_classifier_model.write().overwrite().save("rf_classifier_saveload.model")
The example below loads a RandomForestRegressionModel which was the output of the fit call on a RandomForestRegression estimator. Many
people try to load this using the RandomForestRegression class but this is the Estimator class and won't work, instead we use the
RandomForestRegressionModel class.
After we load the Model we can use its transform method to convert a DataFrame into a new DataFrame containing a prediction column. The
input DataFrame needs to have the same structure as the DataFrame use to fit the Estimator.
# Model type and assembled features need to agree with the trained model.
rf_model = RandomForestRegressionModel.load("rf_regressor_saveload.model")
# The input DataFrame needs the same structure we used when we fit.
vectorAssembler = VectorAssembler(
inputCols=[
"cylinders",
"displacement",
"horsepower",
],
outputCol="features",
handleInvalid="skip",
)
assembled = vectorAssembler.transform(auto_df_fixed)
predictions = rf_model.transform(assembled).select(
"carname", "mpg", "prediction"
)
https://github.com/cartershanklin/pyspark-cheatsheet 168/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
# Model type and assembled features need to agree with the trained model.
rf_model = RandomForestRegressionModel.load("rf_regressor_saveload.model")
https://github.com/cartershanklin/pyspark-cheatsheet 169/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Load a classification model and use it to compute confidences for output labels
Classification Model s let you get confidence levels for each possible output class using the predictProbability method. This is Spark's
equivalent of sklearn's predict_proba .
# Model type and assembled features need to agree with the trained model.
rf_model = RandomForestClassificationModel.load("rf_classifier_saveload.model")
Performance
A few performance tips and tricks.
print(spark.sparkContext.version)
https://github.com/cartershanklin/pyspark-cheatsheet 170/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
_jvm is an internal variable and this could break after a Spark version upgrade.
This is the JVM running on the Driver node. I'm not aware of a way to access Log4J on Executor nodes.
logger = spark.sparkContext._jvm.org.apache.log4j.Logger.getRootLogger()
logger.warn("WARNING LEVEL LOG MESSAGE")
Cache a DataFrame
By default a DataFrame is not stored anywhere and is recomputed whenever it is needed. Caching a DataFrame can improve performance if it is
accessed many times. There are two ways to do this:
The DataFrame cache method sets the DataFrame's persistence mode to the default (Memory and Disk).
For more control you can use persist . persist requires a StorageLevel . persist is most typically used to control replication factor.
https://github.com/cartershanklin/pyspark-cheatsheet 171/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
print(
"\nChange storage level to the equivalent of cache using an explicit StorageLevel."
)
df2.persist(storageLevel=StorageLevel(True, True, False, True, 1))
print(df2.storageLevel)
df = auto_df.groupBy("cylinders").count()
execution_plan = str(df.explain(mode="cost"))
print(execution_plan)
https://github.com/cartershanklin/pyspark-cheatsheet 172/173
9/1/24, 11:39 PM GitHub - cartershanklin/pyspark-cheatsheet: PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
df = auto_df.repartition(20, "modelyear")
df.foreachPartition(number_in_partition)
https://github.com/cartershanklin/pyspark-cheatsheet 173/173