0% found this document useful (0 votes)

1K views3 pages

Pyspark Vs Pandas Cheatsheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

1K views3 pages

Pyspark Vs Pandas Cheatsheet

Uploaded by

api-261489892

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 3

CHEATSHEET: PANDAS VS PYSPARK

Vanessa Afolabi

Import Libraries and Set System Options:

PANDAS PYSPARK

import pandas as pd from pyspark.sql.types import *

pd.options.display.max colwidth = 1000 from pyspark.sql.functions import *
from pyspark.sql import SQLContext*

Define and create a dataset:

PANDAS PYSPARK

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

df = pd.DataFrame(data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame(sc.emptyRDD(), schema)

Read and Write to CSV:

PANDAS PYSPARK

df.read csv() SQLContext(sc).read csv()

df.to csv() df.toPandas.to csv()

Indexing and Splitting:

PANDAS PYSPARK

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.iloc[ ]

Inspect Data:
PANDAS PYSPARK

df.head() df.show()
df.head(n)
df.columns df.printSchema()
df.columns
df.shape df.count()
Handling Duplicate Data:
PANDAS PYSPARK

df.unique() df.distinct().count()
df.duplicated
df.drop duplicates() df.dropDuplicates()

Rename Columns:
PANDAS PYSPARK

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

PANDAS PYSPARK

df.dropna() df.na.drop()
df.fillna() df.na.fill()
df.replace df.na.replace()
df[’col’].isna() df.col.isNull()
df[’col’].isnull()
df[’col’].notna() df.col.isNotNull()
df[’col’].notnull()

Common Column Functions:

PANDAS PYSPARK

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

df[”col”] = df[”col”].str.replace() df = df.select(’*’,regexp replace().alias())
df = df.select(’*’,regexp extract().alias())
df[”col”] = df[”col”].str.split() df = df.withColumn(’col’,split(’col’))
df[”col”] = df[”col”].str.join() df = df.withColumn(’col’, UDF JOIN(df.col, lit(’ ’)))
df[”col”] = df[”col”].str.strip() df = df.withColumn(’col’, trim(df.col))

Apply User Defined Functions:

PANDAS PYSPARK

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

df.apply(f) df = df.withColumn(’col’, when(cond, UDF(df.col)).otherwise())
df.applyMap(f)

Join two dataset columns:

PANDAS PYSPARK

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

df.select(’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

Filter Dataset:
PANDAS PYSPARK

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df = df.filter(df[’col’] == val)

Select Columns:
PANDAS PYSPARK

df = df[[’col1’,’col2’,’col3’]] df = df.select(’col1’,’col2’,’col3’)

Drop Columns:
PANDAS PYSPARK

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.drop(columns = [’B’,’C’])

Grouping Data:
PANDAS PYSPARK

df.groupby(by=[’col1’,’col2’]).count() df.groupBy(’col’).count().show()

Combining Data:
PANDAS PYSPARK

pd.concat([df1,df2]) df1.union(df2)
df1.append(df2)
df1.join(df2) df1.join(df2)

Cartesian Product:
PANDAS PYSPARK

df1[’key’] = 1 df1.crossJoin(df2)
df2[’key’] = 1
df1.merge(df2, how=’outer’, on=’key’)

Sorting Data:
PANDAS PYSPARK

df.sort values() df.sort()

df.sort index() df.orderBy()

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
100% (1)
RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
404 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Datastage Performance Guide PDF
No ratings yet
Datastage Performance Guide PDF
108 pages
Teamcenter Engineering Installation - Ver1.0
No ratings yet
Teamcenter Engineering Installation - Ver1.0
11 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Spark Walmart Data Analysis Project
No ratings yet
Spark Walmart Data Analysis Project
17 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Airflow DAG - Best Practices: DAG As Configuration File
100% (1)
Airflow DAG - Best Practices: DAG As Configuration File
6 pages
Spark Notes
No ratings yet
Spark Notes
71 pages
Spark Architecture
No ratings yet
Spark Architecture
12 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Datastage Interview Questions
No ratings yet
Datastage Interview Questions
10 pages
DataStage Interview Question
No ratings yet
DataStage Interview Question
9 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Informatica Senarios
No ratings yet
Informatica Senarios
26 pages
Unix Ds Commands
No ratings yet
Unix Ds Commands
7 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Advanced SQL Skills
No ratings yet
Advanced SQL Skills
22 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Pandas - PySpark Equivalents-1
No ratings yet
Pandas - PySpark Equivalents-1
3 pages
Page 02
No ratings yet
Page 02
2 pages
PySpark, SQL
No ratings yet
PySpark, SQL
7 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Vertopal.com 01 MichaelHarris WinningPatterns
No ratings yet
Vertopal.com 01 MichaelHarris WinningPatterns
16 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Basic Python For Scientists: Pim Schellart May 27, 2010
No ratings yet
Basic Python For Scientists: Pim Schellart May 27, 2010
21 pages
Operating System MCQ'S
100% (2)
Operating System MCQ'S
19 pages
Mathematics
No ratings yet
Mathematics
34 pages
A Review On Plant Recognition and Classification Techniques Using Leaf Images
No ratings yet
A Review On Plant Recognition and Classification Techniques Using Leaf Images
6 pages
Belt Conveyor Simulation
No ratings yet
Belt Conveyor Simulation
2 pages
Sce Eno c30 Global
No ratings yet
Sce Eno c30 Global
284 pages
Trans Tec 7600
No ratings yet
Trans Tec 7600
2 pages
Object-Oriented Analysis and Design: S.Balaji
No ratings yet
Object-Oriented Analysis and Design: S.Balaji
40 pages
Digital Logic Design - CS302 Spring 2005 Mid Term Paper
No ratings yet
Digital Logic Design - CS302 Spring 2005 Mid Term Paper
4 pages
DBMS Practical List DDU PDF
No ratings yet
DBMS Practical List DDU PDF
27 pages
Sreenu Resume
No ratings yet
Sreenu Resume
4 pages
Chapter 13 - : Aggregate Planning
No ratings yet
Chapter 13 - : Aggregate Planning
28 pages
Example of Scam
No ratings yet
Example of Scam
5 pages
Ga55, 75, 90 VSD PDF
100% (3)
Ga55, 75, 90 VSD PDF
86 pages
Exm 2023
No ratings yet
Exm 2023
16 pages
Management Information Systems, 10/e
No ratings yet
Management Information Systems, 10/e
37 pages
An Analysis Model For Detecting Click Farm Misbehaviors in Anonymous Cryptocurrency
No ratings yet
An Analysis Model For Detecting Click Farm Misbehaviors in Anonymous Cryptocurrency
8 pages
C64 Macro Assembler Development System Manual C64101 PDF
100% (2)
C64 Macro Assembler Development System Manual C64101 PDF
75 pages
Embedded Systems
No ratings yet
Embedded Systems
16 pages
Why Internet Is So Important
No ratings yet
Why Internet Is So Important
2 pages
Estimating and Costing PDF
No ratings yet
Estimating and Costing PDF
8 pages
EOPT Tool For Barangay or Purok - 1000children1
75% (8)
EOPT Tool For Barangay or Purok - 1000children1
492 pages
Major Freelance Marketplaces
No ratings yet
Major Freelance Marketplaces
4 pages
(ADVPL) - Exemplos Funções
No ratings yet
(ADVPL) - Exemplos Funções
20 pages
ItDoesntMatterWhatUare Putten Kreamer Gratch Kang 2010 PDF
No ratings yet
ItDoesntMatterWhatUare Putten Kreamer Gratch Kang 2010 PDF
10 pages
Megan Margino Resume
No ratings yet
Megan Margino Resume
2 pages
P840 Troubleshooting
No ratings yet
P840 Troubleshooting
96 pages
Comparative Study of STAAD, SAP & ETABS Software
No ratings yet
Comparative Study of STAAD, SAP & ETABS Software
14 pages
Pricing Models in Web Advertising
No ratings yet
Pricing Models in Web Advertising
6 pages

Pyspark Vs Pandas Cheatsheet

Uploaded by

Pyspark Vs Pandas Cheatsheet

Uploaded by

CHEATSHEET: PANDAS VS PYSPARK

Import Libraries and Set System Options:

import pandas as pd from pyspark.sql.types import *

Define and create a dataset:

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

Read and Write to CSV:

df.read csv() SQLContext(sc).read csv()

Indexing and Splitting:

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

Common Column Functions:

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

Apply User Defined Functions:

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

Join two dataset columns:

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.sort values() df.sort()

You might also like