PySpark Data Frame Questions PDF
PySpark Data Frame Questions PDF
com/in/rahulpupreja
DataFrame-
dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()
root
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()
The above code snippet gives you the database schema with
the column names-
root
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
import pyspark
spark = SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
data = [("James","","William","36636","M",3000),
("Michael","Smith","","40288","M",4000),
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
("Robert","","Dawson","42114","M",4000),
("Maria","Jones","39192","F",4000)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
import pyspark
spark =
SparkSession.builder.appName('ProjectPro).getOrCreate()
df.printSchema()
df.show(truncate=False)
Output-
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
import pyspark
spark =
SparkSession.builder.appName('ProjectPro').getOrCreate()
df.printSchema()
df.show(truncate=False)
#Distinct
distinctDF = df.distinct()
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
df2.show(truncate=False)
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
spark =
SparkSession.builder.appName('ProjectPro').getOrCreate()
column = ["Seqno","Name"]
df = spark.createDataFrame(data=data,schema=column)
df.show(truncate=False)
Output-
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
def convertCase(str):
resStr=""
for x in arr:
return resStr
The org.apache.spark.sql.expressions.UserDefinedFunction
class object is returned by the PySpark SQL udf() function.
records = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]
rdd=spark.sparkContext.parallelize(records)
map(f, preservesPartitioning=False)
rdd2=rdd.map(lambda x: (x,1))
print(element)
Output-
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
‘how’: default inner (Options are inner, cross, outer, full, full
outer, left, left outer, right, right outer, left semi, and left anti.)
arrayCol = ArrayType(StringType(),False)
schema = StructType([
StructField('properties',
MapType(StringType(),StringType()),True)
])
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
('Washington',{'hair':'grey','eye':'grey'}),
('Jefferson',{'hair':'brown','eye':''})
df = spark.createDataFrame(data=dataDictionary, schema =
schema)
df.printSchema()
df.show(truncate=False)
Output-
Output-
Output-
arrayCol = ArrayType(StringType(),False)
import pyspark
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",4
00,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",
4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Bea
ns",2000,"Mexico")]
columns= ["Product","Amount","Country"]
df.printSchema()
df.show(truncate=False)
Output-
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
pivotDF =
df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.printSchema()
pivotDF.show(truncate=False)
broadcastVariable.value
spark=SparkSession.builder.appName('SparkByExample.com')
.getOrCreate()
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
rdd = spark.sparkContext.parallelize(data)
def state_convert(code):
return broadcastState.value[code]
res = rdd.map(lambda a:
(a[0],a[1],a[2],state_convert(a{3]))).collect()
print(res)
spark=SparkSession.builder.appName('PySpark broadcast
variable').getOrCreate()
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","William","USA","CA"),
("Maria","Jones","USA","FL")
]
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
columns = ["firstname","lastname","country","state"]
df.printSchema()
df.show(truncate=False)
def state_convert(code):
return broadcastState.value[code]
res = df.rdd.map(lambda a:
(a[0],a[1],a[2],state_convert(a[3]))).toDF(column)
res.show(truncate=False)
Q1. You have a cluster of ten nodes with each node having
24 CPU cores. The following code works, but it may crash
on huge data sets, or at the very least, it may not take
advantage of the cluster's full processing capabilities.
Which aspect is the most difficult to alter, and how would
you go about doing so?
def cal(sparkSession: SparkSession): Unit = { val
NumNode = 10 val userActivityRdd: RDD[UserActivity] =
readUserActivityData(sparkSession) .
repartition(NumNode) val result = userActivityRdd .map(e
=> (e.userId, 1L)) . reduceByKey(_ + _) result .take(1000) }
The repartition command creates ten partitions regardless of
how many of them were loaded. On large datasets, they might
get fairly huge, and they'll almost certainly outgrow the RAM
allotted to a single executor.
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
Q2. Explain the following code and what output it will yield-
case class User(uId: Long, uName: String) case class
UserActivity(uId: Long, activityTypeId: Int,
timestampEpochSec: Long) val LoginActivityTypeId = 0 val
LogoutActivityTypeId = 1 private def
readUserData(sparkSession: SparkSession): RDD[User] = {
sparkSession.sparkContext.parallelize( Array( User(1,
"Doe, John"), User(2, "Doe, Jane"), User(3, "X, Mr.")) ) }
private def readUserActivityData(sparkSession:
SparkSession): RDD[UserActivity] = {
sparkSession.sparkContext.parallelize( Array(
UserActivity(1, LoginActivityTypeId, 1514764800L),
UserActivity(2, LoginActivityTypeId, 1514808000L),
UserActivity(1, LogoutActivityTypeId, 1514829600L),
UserActivity(1, LoginActivityTypeId, 1514894400L)) ) } def
calculate(sparkSession: SparkSession): Unit = { val
userRdd: RDD[(Long, User)] =
readUserData(sparkSession).map(e => (e.userId, e)) val
userActivityRdd: RDD[(Long, UserActivity)] =
readUserActivityData(sparkSession).map(e => (e.userId, e))
val result = userRdd .leftOuterJoin(userActivityRdd)
.filter(e => e._2._2.isDefined && e._2._2.get.activityTypeId
== LoginActivityTypeId) .map(e => (e._2._1.uName,
e._2._2.get.timestampEpochSec)) .reduceByKey((a, b) => if
(a < b) a else b) result .foreach(e => println(s"${e._1}:
${e._2}")) }
The primary function, calculate, reads two pieces of data.
(They are given in this case from a constant inline data
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
All users' login actions are filtered out of the combined dataset.
The uName and the event timestamp are then combined to
make a tuple.
.repartition(col(UIdColName))
.join(userActivityRdd, UIdColName)
.select(col(UNameColName))
.groupBy(UNameColName)
.count()
.withColumnRenamed("count", CountColName)
result.show()
}
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
Q5. What are the elements used by the GraphX library, and
how are they generated from an RDD? To determine page
rankings, fill in the following code-
def calculate(sparkSession: SparkSession): Unit = { val
pageRdd: RDD[(???, Page)] = readPageData(sparkSession)
. map(e => (e.pageId, e)) . cache() val pageReferenceRdd:
RDD[???[PageReference]] =
readPageReferenceData(sparkSession) val graph =
Graph(pageRdd, pageReferenceRdd) val
PageRankTolerance = 0.005 val ranks = graph.???
ranks.take(1000).foreach(print) } The output yielded will be
a list of tuples: (1,1.4537951595091907)
(2,0.7731024202454048) (3,0.7731024202454048)
Vertex, and Edge objects are supplied to the Graph object as
RDDs of type RDD[VertexId, VT] and RDD[Edge[ET]]
respectively (where VT and ET are any user-defined types
associated with a given Vertex or Edge). For Edge type, the
constructor is Edge[ET](srcId: VertexId, dstId: VertexId, attr:
ET). VertexId is just an alias for Long.
Request a demo
return 1
return 0
lines = sparkContext.textFile(“sample_file.txt”);
isExist = lines.map(keywordExists);
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
sum=isExist.reduce(sum);
• Fault Recovery
• Interactions between memory management and storage
systems
• Monitoring, scheduling, and distributing jobs
• Fundamental I/O functions
val denseVec =
Vectors.dense(4405d,260100d,400d,5.0,4.0,198.0,9070d,1.0,1.
0,2.0,0.0)
• Cache method-
• Persist method-
df.persist(StorageLevel.)
No. of nodes = 10
= 15/5
=3
= 10 * 3
sc.textFile(“hdfs://Hadoop/user/sample_file.txt”);
def toWords(line):
return line.split();
words = line.flatMap(toWords);
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
import findspark
findspark.init()
import pyspark
spark = SparkSession.builder.getOrCreate()
df.show()
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
Level Purpose
This level stores deserialized
Java objects in the JVM. It is the
MEMORY_ONLY
default persistence level in
PySpark.
This level stores RDD as
deserialized Java objects. If the
RDD is too large to reside in
MEMORY_AND_DISK
memory, it saves the partitions
that don't fit on the disk and reads
them as needed.
It stores RDD in the form of
serialized Java objects. Although
this level saves more space in
MEMORY_ONLY_SER
the case of fast serializers, it
demands more CPU capacity to
read the RDD.
This level acts similar to
MEMORY ONLY SER, except
MEMORY_AND_DISK_SER instead of recomputing partitions
on the fly each time they're
needed, it stores them on disk.
It only saves RDD partitions on
DISK_ONLY
the disk.
These levels function the same
MEMORY_ONLY_2,
as others. They copy each
MEMORY_AND_DISK_2, etc.
partition on two cluster nodes.
This level requires off-heap
OFF_HEAP
memory to store RDD.
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def toWords(line):
return line.split();
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
• Print it out:
counts.collect()
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
Q10. Consider the following scenario: you have a large text file.
How will you use PySpark to see if a specific keyword exists?
lines = sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def isFound(line):
if line.find(“my_keyword”) > -1
return 1
return 0
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “Found”
else:
Name ~|Age
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
Azarudeen, Shahul~|25
George, Bush~|59
findspark.init()
spark =
Sparksession.builder.master("local").appliame("scenario
based")\
-getorcreate()
sc=spark.sparkContext
dfaspark.read.text("input.csv")
df.show(truncate=0)
header=df.first()[0]
schema=header.split(-')
df_input.show(truncate=0)
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
Q2. How will you merge two files – File1 and File2 – into a
single DataFrame if they have different schemas?
File -1:
Name|Age
Azarudeen, Shahul|25
Michel, Clarke|26
Virat, Kohli|28
Andrew, Simond|37
File -2:
Name|Age|Gender
Flintoff, David|12|Male
findspark.init()
spark = SparkSession.builder.master("local").appName('Modes
of Dataframereader')\
.getorCreate()
sc=spark.sparkContext
df1=spark.read.option("delimiter","|").csv('input.csv')
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
df2=spark.read.option("delimiter","|").csv("input2.csv",header=T
rue)
df_add=df1.withColumn("Gender",lit("null"))
df_add. union(df2).show()
schema=StructType(
StructField("Name",StringType(), True),
StructField("Age",StringType(), True),
StructField("Gender",StringType(),True),
df3=spark.read.option("delimiter","|").csv("input.csv",header=Tr
ue, schema=schema)
df4=spark.read.option("delimiter","|").csv("input2.csv",
header=True, schema=schema)
df3.union(df4).show()
103, Mani, IT
104, Pavan, HR
Answer-
import findspark
findspark.init()
spark = Sparksession.builder.master("local").appName(
"Modes of Dataframereader')\
.getorcreate()
sc=spark. sparkContext
schm structiype([
structField("col_1",stringType(), True),
StructField("col_2",stringType(), True),
structrield("col",stringtype(), True),
])
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
df=spark.read.option("mode",
"DROPMALFORMED").csv('input1.csv', header=True,
schema=schm)
df. show()
Azar|25| MBA,BE,HSC
Hari|32|
Kumar|35|ME,BE,Diploma
Answer-
import findspark
findspark.init()
spark =
SparkSession.builder.master("local").appName('scenario
based')\
.getorCreate()
sc=spark.sparkContext
in_df=spark.read.option("delimiter","|").csv("input4.csv",
header-True)
in_df.show()
in_df.withColumn("Qualification",
explode_outer(split("Education",","))).show()
in_df.select("*",
posexplode_outer(split("Education",","))).withColumnRenamed
("col", "Qualification").withColumnRenamed ("pos",
"Index").drop(“Education”).show()
• get(filename),
• getrootdirectory()
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
spark=SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
export SPARK_HOME=/Users/abc/apps/spark-3.0.0-bin-
hadoop2.7
export
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/pyth
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
on/build:$SPARK_HOME/python/lib/py4j-0.10.9-
src.zip:$PYTHONPATH
export PYTHONPATH=${SPARK_HOME}/python/:$(echo
${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set
PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME
%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
# Import PySpark
import pyspark
Created by Rahul Pupreja, linkedin.com/in/rahulpupreja
#Create SparkSession
spark = SparkSession.builder
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
If you get the error message 'No module named pyspark', try
using findspark instead-
#Install findspark
# Import findspark
import findspark
findspark.init()
#import pyspark
import pyspark