This document provides a cheat sheet on RDD (Resilient Distributed Dataset) basics in PySpark. It summarizes common operations for retrieving RDD information, reshaping data through reducing, grouping and aggregating, and applying mathematical and user-defined functions to RDDs. These include counting elements, retrieving statistics, grouping and aggregating keys/values, applying maps and flatmaps, and set operations like subtraction.
This document provides a cheat sheet on RDD (Resilient Distributed Dataset) basics in PySpark. It summarizes common operations for retrieving RDD information, reshaping data through reducing, grouping and aggregating, and applying mathematical and user-defined functions to RDDs. These include counting elements, retrieving statistics, grouping and aggregating keys/values, applying maps and flatmaps, and set operations like subtraction.
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions >>> rdd.reduceByKey(lambda x,y : x+y) .collect() Merge the rdd values for each key Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)] 3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values >>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2) defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by >>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1}) >>> rdd.collectAsMap() Return (key,value) pairs as a .mapValues(list) .collect() PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key >>> rdd3.sum() Sum of RDD elements .mapValues(list) the Spark programming model to Python. 4950 .collect() >>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])] True Initializing Spark Summary Aggregating >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1)) >>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1])) SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each 99 (4950,100) partition and then the results >>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements >>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key 0 >>> rdd3.mean() Mean value of RDD elements .collect() Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))] >>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each >>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results >>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key >>> sc.master Master URL to connect to 833.25 .collect() >>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)] >>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34]) >>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by SparkContext >>> sc.appName Return application name min) .collect() applying a function >>> sc.applicationId Retrieve application ID >>> >>> sc.defaultParallelism Return default level of parallelism sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained .collect() .collect() in rdd2 Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)] >>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2 >>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd >>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)] .setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd .setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2 .set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys >>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')] Sort Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function .collect() In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)] created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key [('a', 7), ('a', 2), ('b', 2)] .collect() $ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)] $ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)] >>> rdd.first() Take first RDD element Set which master the context connects to with the --master argument, and ('a', 7) Repartitioning add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements [('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1 Sampling Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3 [3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt") >>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child", >>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat') >>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)] >>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values >>> rdd4 = sc.parallelize([("a",["x","y","z"]), ("b",["p", "r"])]) ['a',2,'b',7] >>> rdd.keys().collect() Return (key,value) RDD's keys Stopping SparkContext ['a', 'a', 'b'] >>> sc.stop() External Data Read either one text file from HDFS, a local file system or or any Iterating Execution Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x) >>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py of text files with wholeTextFiles(). ('a', 7) >>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp >>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively