Pyspark Vs Pandas Cheatsheet
Pyspark Vs Pandas Cheatsheet
Vanessa Afolabi
Inspect Data:
PANDAS PYSPARK
df.head() df.show()
df.head(n)
df.columns df.printSchema()
df.columns
df.shape df.count()
Handling Duplicate Data:
PANDAS PYSPARK
df.unique() df.distinct().count()
df.duplicated
df.drop duplicates() df.dropDuplicates()
Rename Columns:
PANDAS PYSPARK
df.dropna() df.na.drop()
df.fillna() df.na.fill()
df.replace df.na.replace()
df[’col’].isna() df.col.isNull()
df[’col’].isnull()
df[’col’].notna() df.col.isNotNull()
df[’col’].notnull()
Filter Dataset:
PANDAS PYSPARK
Select Columns:
PANDAS PYSPARK
df = df[[’col1’,’col2’,’col3’]] df = df.select(’col1’,’col2’,’col3’)
Drop Columns:
PANDAS PYSPARK
Grouping Data:
PANDAS PYSPARK
df.groupby(by=[’col1’,’col2’]).count() df.groupBy(’col’).count().show()
Combining Data:
PANDAS PYSPARK
pd.concat([df1,df2]) df1.union(df2)
df1.append(df2)
df1.join(df2) df1.join(df2)
Cartesian Product:
PANDAS PYSPARK
df1[’key’] = 1 df1.crossJoin(df2)
df2[’key’] = 1
df1.merge(df2, how=’outer’, on=’key’)
Sorting Data:
PANDAS PYSPARK