-
Notifications
You must be signed in to change notification settings - Fork 158
/
Copy pathnote.json
1 lines (1 loc) · 79.2 KB
/
note.json
1
{"paragraphs":[{"text":"%md\n\n# Spark RDDs and DataFrames with Python\n#### Analyzing a Text File\n##### Level: Beginner\nAuthor: Robert Hryniewicz\nTwitter: @RobHryniewicz\n\nLast updated: Aug 1st, 2016 (ver 0.6)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955022_1905768678","id":"20160331-233830_1876799966","result":{"code":"SUCCESS","type":"HTML","msg":"<h1>Spark RDDs and DataFrames with Python</h1>\n<h4>Analyzing a Text File</h4>\n<h5>Level: Beginner</h5>\n<p>Author: Robert Hryniewicz\n<br />Twitter: @RobHryniewicz</p>\n<p>Last updated: Aug 1st, 2016 (ver 0.6)</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:209"},{"text":"%md\n## Introduction\n\nThis lab consists of two parts. In each section you will perform a basic Word Count.\n#\nIn **Part 1**, we will introduce **RDDs**, Spark's primary low-level abstraction, and several core concepts.\nIn **Part 2**, we will introduce **DataFrames**, a higher-level abstraction than RDDs, along with SparkSQL allowing you to use SQL statements to query a temporary table.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":217,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955022_1905768678","id":"20160331-233830_1038788941","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>Introduction</h2>\n<p>This lab consists of two parts. In each section you will perform a basic Word Count.</p>\n<h1></h1>\n<p>In <strong>Part 1</strong>, we will introduce <strong>RDDs</strong>, Spark's primary low-level abstraction, and several core concepts.\n<br />In <strong>Part 2</strong>, we will introduce <strong>DataFrames</strong>, a higher-level abstraction than RDDs, along with SparkSQL allowing you to use SQL statements to query a temporary table.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:210"},{"text":"%md\n### Concepts\n\nAt the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable and fault-tolerant collection of objects that is partitioned and distributed across multiple physical nodes on a cluster and they run in parallel.\n#\nTypically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.\n#\nOnce an RDD is instantiated, you can apply a **[series of operations](https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations)**.\n#\nAll operations fall into one of two types: **[Transformations](https://spark.apache.org/docs/latest/programming-guide.html#transformations)** or **[Actions](https://spark.apache.org/docs/latest/programming-guide.html#actions)**. \n#\nTransformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.\n#\nIn this lab we will use the following **[Transformations](https://spark.apache.org/docs/latest/programming-guide.html#transformations)**:\n- map(func)\n- filter(func)\n- flatMap(func)\n- reduceByKey(func)\n\nand **[Actions](https://spark.apache.org/docs/latest/programming-guide.html#actions)**:\n\n- collect()\n- count()\n- take()\n- takeOrdered(n, [ordering])\n- countByKey()\n\nA typical Spark application has the following four phases:\n1. Instantiate Input RDDs\n2. Transform RDDs\n3. Persist Intermediate RDDs\n4. Take Action on RDDs","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955022_1905768678","id":"20160331-233830_2031164924","result":{"code":"SUCCESS","type":"HTML","msg":"<h3>Concepts</h3>\n<p>At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable and fault-tolerant collection of objects that is partitioned and distributed across multiple physical nodes on a cluster and they run in parallel.</p>\n<h1></h1>\n<p>Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.</p>\n<h1></h1>\n<p>Once an RDD is instantiated, you can apply a <strong><a href=\"https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations\">series of operations</a></strong>.</p>\n<h1></h1>\n<p>All operations fall into one of two types: <strong><a href=\"https://spark.apache.org/docs/latest/programming-guide.html#transformations\">Transformations</a></strong> or <strong><a href=\"https://spark.apache.org/docs/latest/programming-guide.html#actions\">Actions</a></strong>.</p>\n<h1></h1>\n<p>Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.</p>\n<h1></h1>\n<p>In this lab we will use the following <strong><a href=\"https://spark.apache.org/docs/latest/programming-guide.html#transformations\">Transformations</a></strong>:</p>\n<ul>\n<li>map(func)</li>\n<li>filter(func)</li>\n<li>flatMap(func)</li>\n<li>reduceByKey(func)</li>\n</ul>\n<p>and <strong><a href=\"https://spark.apache.org/docs/latest/programming-guide.html#actions\">Actions</a></strong>:</p>\n<ul>\n<li>collect()</li>\n<li>count()</li>\n<li>take()</li>\n<li>takeOrdered(n, [ordering])</li>\n<li>countByKey()</li>\n</ul>\n<p>A typical Spark application has the following four phases:</p>\n<ol>\n<li>Instantiate Input RDDs</li>\n<li>Transform RDDs</li>\n<li>Persist Intermediate RDDs</li>\n<li>Take Action on RDDs</li>\n</ol>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:211"},{"text":"%md\n### Lab Pre-Check\nBefore we proceed, let's verify Spark Version. You should be running at minimum Spark 1.6.\n#\n**Note**: The first time you run `sc.version` in the paragraph below, several services will initialize in the background. This may take **1~2 min** so please **be patient**. Afterwards, each paragraph should run much more quickly since all the services will already be running.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955023_1905383929","id":"20160331-233830_1388824956","result":{"code":"SUCCESS","type":"HTML","msg":"<h3>Lab Pre-Check</h3>\n<p>Before we proceed, let's verify Spark Version. You should be running at minimum Spark 1.6.</p>\n<h1></h1>\n<p><strong>Note</strong>: The first time you run <code>sc.version</code> in the paragraph below, several services will initialize in the background. This may take <strong>1~2 min</strong> so please <strong>be patient</strong>. Afterwards, each paragraph should run much more quickly since all the services will already be running.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:212"},{"text":"%md\nTo run a paragraph in a Zeppelin notebook you can either click the `play` button (blue triangle) on the right-hand side or simply press `Shift + Enter`.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955023_1905383929","id":"20160331-233830_981276249","result":{"code":"SUCCESS","type":"HTML","msg":"<p>To run a paragraph in a Zeppelin notebook you can either click the <code>play</code> button (blue triangle) on the right-hand side or simply press <code>Shift + Enter</code>.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:213"},{"title":"Check Spark Version","text":"sc.version","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955023_1905383929","id":"20160331-233830_1782991630","result":{"code":"SUCCESS","type":"TEXT","msg":"res49: String = 1.6.2\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:214"},{"text":"%md ####Now let's proceed with our core lab.\n","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955023_1905383929","id":"20160331-233830_122830635","result":{"code":"SUCCESS","type":"HTML","msg":"<h4>Now let's proceed with our core lab.</h4>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:215"},{"text":"%md \n\n## Part 1\n#### Introduction to RDDs with Word Count example","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955023_1905383929","id":"20160331-233830_682697678","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>Part 1</h2>\n<h4>Introduction to RDDs with Word Count example</h4>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:216"},{"text":"%md\nIn this section you will perform a basic word count with RDDs.\n#\nYou will download external text data file to your sandbox. Then you will perform lexical analysis, or tokenization, by breaking up text into words/tokens.\nThe list of tokens then becomes an input for further processing to this and following sections.\n#\nBy the end of this section you should have learned how to perform low-level transformations and actions with Spark RDDs and lambda expressions.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_94748225","result":{"code":"SUCCESS","type":"HTML","msg":"<p>In this section you will perform a basic word count with RDDs.</p>\n<h1></h1>\n<p>You will download external text data file to your sandbox. Then you will perform lexical analysis, or tokenization, by breaking up text into words/tokens.\n<br />The list of tokens then becomes an input for further processing to this and following sections.</p>\n<h1></h1>\n<p>By the end of this section you should have learned how to perform low-level transformations and actions with Spark RDDs and lambda expressions.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:217"},{"text":"%md\nIn the next paragraph we are going to download data using shell commands. A shell command in a Zeppelin notebookcan can be invoked by \nprepending a block of shell commands with a line containing `%sh` characters.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_1148035148","result":{"code":"SUCCESS","type":"HTML","msg":"<p>In the next paragraph we are going to download data using shell commands. A shell command in a Zeppelin notebookcan can be invoked by\n<br />prepending a block of shell commands with a line containing <code>%sh</code> characters.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:218"},{"title":"Prepare Directories and Download a Dataset","text":"%sh\ncd /tmp\n\n# Remove old dataset file if already exists in local /tmp directory\nif [ -e /tmp/About-Apache-NiFi.txt ]\nthen\n rm -f /tmp/About-Apache-NiFi.txt\nfi\n\n# Remove old dataset if already exists in hadoop /tmp directory\nif hadoop fs -stat /tmp/About-Apache-NiFi.txt\nthen\n hadoop fs -rm /tmp/About-Apache-NiFi.txt\nfi\n\n# Download \"About-Apache-NiFi\" text file\nwget https://raw.githubusercontent.com/roberthryniewicz/datasets/master/About-Apache-NiFi.txt\n\n# Move dataset to hadoop /tmp\nhadoop fs -put About-Apache-NiFi.txt /tmp","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/sh","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_2033647788","result":{"code":"SUCCESS","type":"TEXT","msg":"2016-08-11 00:33:39\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:219"},{"title":"Preview Downloaded Text File","text":"%sh\nhadoop fs -cat /tmp/About-Apache-NiFi.txt | head","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/sh","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_168647264","result":{"code":"SUCCESS","type":"TEXT","msg":"Apache NiFi Overview \nTeam [email protected] \n\nWhat is Apache NiFi? Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we’ll use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip]. \n\nSome of the high-level challenges of dataflow include: \n\nSystems fail \nNetworks fail, disks fail, software crashes, people make mistakes. \n\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:220"},{"text":"%md\nNext we are going to run Spark Python (or PySpark) that can be invoked by prepending a block of Python code with a line containing `%pyspark`.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_1923635655","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Next we are going to run Spark Python (or PySpark) that can be invoked by prepending a block of Python code with a line containing <code>%pyspark</code>.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:221"},{"text":"%md\n#\nThe important thing to notice in the next paragraph is the `sc` object or Spark Context. Spark Context is automatically created by your driver program in Zeppelin.\n#\nSpark Context is the main entry point for Spark functionality. A Spark Context represents the connection to a Spark cluster, and can be used to create RDDs, which we will do next.\n#\nRemember that Spark doesn't have any storage layer, rather it has connectors to HDFS, S3, Cassandra, HBase, Hive etc. to bring data into memory. Thus, in the next paragraph you will read data (that you've just downloaded) from HDFS.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_178148000","result":{"code":"SUCCESS","type":"HTML","msg":"<h1></h1>\n<p>The important thing to notice in the next paragraph is the <code>sc</code> object or Spark Context. Spark Context is automatically created by your driver program in Zeppelin.</p>\n<h1></h1>\n<p>Spark Context is the main entry point for Spark functionality. A Spark Context represents the connection to a Spark cluster, and can be used to create RDDs, which we will do next.</p>\n<h1></h1>\n<p>Remember that Spark doesn't have any storage layer, rather it has connectors to HDFS, S3, Cassandra, HBase, Hive etc. to bring data into memory. Thus, in the next paragraph you will read data (that you've just downloaded) from HDFS.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:222"},{"title":"Read Text File from HDFS and Preview its Contents","text":"%pyspark\n\n# Parallelize text file using pre-initialized Spark context (sc)\nlines = sc.textFile(\"/tmp/About-Apache-NiFi.txt\")\n\n# Take a look at a few lines with a take() action.\nprint lines.take(4)\n\n# Output: Notice that each line has been placed in a seperate array bucket.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"tableHide":false,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_541232082","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'Apache NiFi Overview ', u'Team [email protected] ', u'', u'What is Apache NiFi? Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we\\u2019ll use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip]. ']\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:223"},{"text":"%md\nIn the next paragraphs we will start using Python lambda (or anonymous) functions. If you're unfamiliar with lambda expressions, \nreview **[Python Lambda Expressions](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions)** before proceeding.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_1894357129","result":{"code":"SUCCESS","type":"HTML","msg":"<p>In the next paragraphs we will start using Python lambda (or anonymous) functions. If you're unfamiliar with lambda expressions,\n<br />review <strong><a href=\"https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions\">Python Lambda Expressions</a></strong> before proceeding.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:224"},{"title":"Extract All Words from the Document","text":"%pyspark\n# Here we're tokenizing our text file by using the split() function. Each original line of text is split into words or tokens on a single space.\n# Also, since each line of the original text occupies a seperate bucket in the array, we need to use\n# a flatMap() transformation to flatten all buckets into a asingle/flat array of tokens.\n\nwords = lines.flatMap(lambda line: line.split(\" \"))","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955024_1915772149","id":"20160331-233830_2015200328","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:225"},{"text":"%md\nNote that after you click 'play' in the paragraph above \"nothing\" appears to happen.\n#\nThat's because `flatMap()` is a transformation and all transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.\n#\nBy default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_1507315859","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Note that after you click 'play' in the paragraph above “nothing” appears to happen.</p>\n<h1></h1>\n<p>That's because <code>flatMap()</code> is a transformation and all transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.</p>\n<h1></h1>\n<p>By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:226"},{"title":"Take a look at first 100 words","text":"%pyspark\nprint words.take(100) # we're using a take(n) action\n\n# Output: As you can see, each word occupies a distinc array bucket.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_1740542201","result":{"code":"SUCCESS","type":"TEXT","msg":"[u'Apache', u'NiFi', u'Overview', u'', u'Team', u'[email protected]', u'', u'', u'What', u'is', u'Apache', u'NiFi?', u'Put', u'simply', u'NiFi', u'was', u'built', u'to', u'automate', u'the', u'flow', u'of', u'data', u'between', u'systems.', u'While', u'the', u'term', u'dataflow', u'is', u'used', u'in', u'a', u'variety', u'of', u'contexts,', u'we\\u2019ll', u'use', u'it', u'here', u'to', u'mean', u'the', u'automated', u'and', u'managed', u'flow', u'of', u'information', u'between', u'systems.', u'This', u'problem', u'space', u'has', u'been', u'around', u'ever', u'since', u'enterprises', u'had', u'more', u'than', u'one', u'system,', u'where', u'some', u'of', u'the', u'systems', u'created', u'data', u'and', u'some', u'of', u'the', u'systems', u'consumed', u'data.', u'The', u'problems', u'and', u'solution', u'patterns', u'that', u'emerged', u'have', u'been', u'discussed', u'and', u'articulated', u'extensively.', u'A', u'comprehensive', u'and', u'readily', u'consumed', u'form', u'is', u'found']\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:227"},{"title":"Remove Empty Words","text":"%pyspark\n\nwordsFiltered = words.filter(lambda w: len(w) > 0)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_270532773","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:228"},{"title":"Get Total Number of Words","text":"%pyspark\n\nprint wordsFiltered.count() # using a count() action","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_229739488","result":{"code":"SUCCESS","type":"TEXT","msg":"2517\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:229"},{"text":"%md\n#### Word Counts\n\nLet's see what are the most popular words by performing a word count using `map()` and `reduceByKey()` transformations to create tuples of type (word, count).","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_55977510","result":{"code":"SUCCESS","type":"HTML","msg":"<h4>Word Counts</h4>\n<p>Let's see what are the most popular words by performing a word count using <code>map()</code> and <code>reduceByKey()</code> transformations to create tuples of type (word, count).</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:230"},{"title":"Word count with a RDD","text":"%pyspark\n\nwordCounts = wordsFiltered.map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"tableHide":false,"title":false,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_216173184","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:231"},{"text":"%md\n#### View Word Count Tuples\nNow let's take a look at top 100 words in descending order with a `takeOrdered()` action.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_1029129342","result":{"code":"SUCCESS","type":"HTML","msg":"<h4>View Word Count Tuples</h4>\n<p>Now let's take a look at top 100 words in descending order with a <code>takeOrdered()</code> action.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:232"},{"text":"%pyspark\nprint wordCounts.takeOrdered(100, lambda (w,c): -c)\n","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_743558056","result":{"code":"SUCCESS","type":"TEXT","msg":"[(u'the', 110), (u'of', 94), (u'and', 89), (u'to', 84), (u'is', 62), (u'a', 60), (u'NiFi', 41), (u'as', 32), (u'The', 28), (u'be', 26), (u'in', 25), (u'are', 22), (u'it', 22), (u'data', 20), (u'that', 20), (u'for', 19), (u'can', 19), (u'or', 19), (u'on', 17), (u'system', 16), (u'which', 14), (u'dataflow', 12), (u'will', 11), (u'flow', 11), (u'more', 11), (u'at', 11), (u'FlowFile', 9), (u'given', 9), (u'Flow', 9), (u'one', 9), (u'very', 8), (u'content', 8), (u'This', 8), (u'with', 8), (u'some', 8), (u'within', 8), (u'all', 7), (u'repository', 7), (u'use', 7), (u'A', 7), (u'Controller', 7), (u'where', 7), (u'how', 7), (u'then', 7), (u'other', 7), (u'even', 6), (u'through', 6), (u'Repository', 6), (u'make', 6), (u'well', 6), (u'each', 6), (u'their', 6), (u'between', 6), (u'an', 6), (u'threads', 5), (u'change', 5), (u'allow', 5), (u'they', 5), (u'For', 5), (u'Data', 5), (u'these', 5), (u'processes', 5), (u'flows', 5), (u'specific', 5), (u'default', 5), (u'becomes', 5), (u'designed', 5), (u'there', 5), (u'also', 5), (u'should', 5), (u'many', 5), (u'point', 5), (u'cluster', 5), (u'by', 5), (u'those', 4), (u'design', 4), (u'These', 4), (u'when', 4), (u'extensions', 4), (u'effective', 4), (u'so', 4), (u'have', 4), (u'able', 4), (u'Nodes', 4), (u'only', 4), (u'been', 4), (u'components', 4), (u'NiFi\\u2019s', 4), (u'JVM', 4), (u'host', 4), (u'about', 4), (u'extension', 4), (u'Processors', 4), (u'new', 4), (u'such', 4), (u'NCM', 4), (u'its', 4), (u'systems', 4), (u'Architecture', 4), (u'provenance', 4)]\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:233"},{"text":"%md\n#### Filter out infrequent words\nWe'll use `filter()` transformation to filter out words that occur less than five times.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955025_1915387400","id":"20160331-233830_772905299","result":{"code":"SUCCESS","type":"HTML","msg":"<h4>Filter out infrequent words</h4>\n<p>We'll use <code>filter()</code> transformation to filter out words that occur less than five times.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:234"},{"text":"%pyspark\n\nfilteredWordCounts = wordCounts.filter(lambda (w,c): c >= 5)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_90779590","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:235"},{"title":"Take a Look at Results","text":"%pyspark\n\nprint filteredWordCounts.collect() # we're using a collect() action to pull everything back to the Spark driver","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_1024657848","result":{"code":"SUCCESS","type":"TEXT","msg":"[(u'all', 7), (u'very', 8), (u'even', 6), (u'repository', 7), (u'FlowFile', 9), (u'threads', 5), (u'change', 5), (u'use', 7), (u'A', 7), (u'data', 20), (u'a', 60), (u'allow', 5), (u'through', 6), (u'they', 5), (u'content', 8), (u'This', 8), (u'given', 9), (u'For', 5), (u'Repository', 6), (u'Data', 5), (u'and', 89), (u'these', 5), (u'which', 14), (u'The', 28), (u'NiFi', 41), (u'with', 8), (u'Controller', 7), (u'processes', 5), (u'where', 7), (u'Flow', 9), (u'will', 11), (u'is', 62), (u'make', 6), (u'flows', 5), (u'well', 6), (u'the', 110), (u'specific', 5), (u'some', 8), (u'for', 19), (u'dataflow', 12), (u'default', 5), (u'flow', 11), (u'as', 32), (u'to', 84), (u'be', 26), (u'more', 11), (u'becomes', 5), (u'can', 19), (u'how', 7), (u'designed', 5), (u'or', 19), (u'then', 7), (u'each', 6), (u'there', 5), (u'one', 9), (u'system', 16), (u'their', 6), (u'that', 20), (u'also', 5), (u'should', 5), (u'are', 22), (u'between', 6), (u'many', 5), (u'point', 5), (u'it', 22), (u'cluster', 5), (u'in', 25), (u'by', 5), (u'on', 17), (u'of', 94), (u'within', 8), (u'an', 6), (u'at', 11), (u'other', 7)]\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:236"},{"title":"","text":"%md\nNow let's use `countByKey()` action for another way of returning a word count.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":false,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_753086043","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Now let's use <code>countByKey()</code> action for another way of returning a word count.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:237"},{"text":"%pyspark\n\nresult = words.map(lambda w: (w,1)).countByKey()\n\n# Print type of data structure\nprint type(result)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_1995992930","result":{"code":"ERROR","type":"TEXT","msg":"Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 4 times, most recent failure: Lost task 0.3 in stage 46.0 (TID 1458, sandbox.hortonworks.com): java.io.FileNotFoundException: File does not exist: /tmp/About-Apache-NiFi.txt\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1860)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1831)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1744)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)\n\n\tat sun.reflect.GeneratedConstructorAccessor16.newInstance(Unknown Source)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:526)\n\tat org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)\n\tat org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)\n\tat org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)\n\tat org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1223)\n\tat org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1211)\n\tat org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)\n\tat org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)\n\tat org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:266)\n\tat org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1536)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)\n\tat org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:325)\n\tat org.apache.hadoop.fs.FileSystem.open(FileSystem.java:782)\n\tat org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)\n\tat org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)\n\tat org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)\n\tat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)\n\tat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:89)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\nCaused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/About-Apache-NiFi.txt\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1860)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1831)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1744)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)\n\n\tat org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)\n\tat org.apache.hadoop.ipc.Client.call(Client.java:1496)\n\tat org.apache.hadoop.ipc.Client.call(Client.java:1396)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)\n\tat com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:270)\n\tat sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)\n\tat com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)\n\tat org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1236)\n\t... 30 more\n\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)\n\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)\n\tat scala.Option.foreach(Option.scala:236)\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:1953)\n\tat org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:933)\n\tat org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)\n\tat org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)\n\tat py4j.Gateway.invoke(Gateway.java:259)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:209)\n\tat java.lang.Thread.run(Thread.java:745)\nCaused by: java.io.FileNotFoundException: File does not exist: /tmp/About-Apache-NiFi.txt\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1860)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1831)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1744)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)\n\n\tat sun.reflect.GeneratedConstructorAccessor16.newInstance(Unknown Source)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:526)\n\tat org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)\n\tat org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)\n\tat org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238)\n\tat org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1223)\n\tat org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1211)\n\tat org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309)\n\tat org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274)\n\tat org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:266)\n\tat org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1536)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)\n\tat org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)\n\tat org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:325)\n\tat org.apache.hadoop.fs.FileSystem.open(FileSystem.java:782)\n\tat org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)\n\tat org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)\n\tat org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)\n\tat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)\n\tat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:277)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:89)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\t... 1 more\nCaused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/About-Apache-NiFi.txt\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1860)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1831)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1744)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)\n\n\tat org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)\n\tat org.apache.hadoop.ipc.Client.call(Client.java:1496)\n\tat org.apache.hadoop.ipc.Client.call(Client.java:1396)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)\n\tat com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:270)\n\tat sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)\n\tat org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)\n\tat com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)\n\tat org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1236)\n\t... 30 more\n\n(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\\n', JavaObject id=o140), <traceback object at 0x3257c20>)"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:238"},{"text":"%md\n**Note** that the **result** is an **unordered dictionary of type {word, count}**.\nSince this is a small set we can apply a simple (non-parallelizeable) python built-in function.\n","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_811124723","result":{"code":"SUCCESS","type":"HTML","msg":"<p><strong>Note</strong> that the <strong>result</strong> is an <strong>unordered dictionary of type {word, count}</strong>.\n<br />Since this is a small set we can apply a simple (non-parallelizeable) python built-in function.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:239"},{"text":"%md\nTake a look at first 20 items in our dictionary.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_347028305","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Take a look at first 20 items in our dictionary.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:240"},{"text":"%pyspark\n# Print first 20 items\nprint result.items()[0:20]","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_2086620530","result":{"code":"ERROR","type":"TEXT","msg":"Traceback (most recent call last):\n File \"/tmp/zeppelin_pyspark-7062069166207434177.py\", line 239, in <module>\n eval(compiledCode)\n File \"<string>\", line 1, in <module>\nNameError: name 'result' is not defined\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:241"},{"text":"%md\nApply a python `sorted()` function on the **result** dictionary values.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955026_1916541647","id":"20160331-233830_1423292200","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Apply a python <code>sorted()</code> function on the <strong>result</strong> dictionary values.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:242"},{"text":"%pyspark\nimport operator\n\n# Sort in descending order\nsortedResult = sorted(result.items(), key=operator.itemgetter(1), reverse=True)\n\n# Print top 20 items\nprint sortedResult[0:20]","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_451661467","result":{"code":"ERROR","type":"TEXT","msg":"Traceback (most recent call last):\n File \"/tmp/zeppelin_pyspark-7062069166207434177.py\", line 239, in <module>\n eval(compiledCode)\n File \"<string>\", line 2, in <module>\nNameError: name 'result' is not defined\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:243"},{"text":"%md\n## Part 2\n#### Introduction to DataFrames and SparkSQL","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_1867067371","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>Part 2</h2>\n<h4>Introduction to DataFrames and SparkSQL</h4>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:244"},{"text":"%md\nIn this section we will cover the concept of a DataFrame. You will convert RDDs from a previous section and then use higher level \noperations to demonstrate a different way of counting words. Then you will register a temporary table and perform a word count by \nexecuting a SQL query on that table.\n#\nBy the end of the section you will have learned higher-level Spark abstractions that hide lower-level details, speed up prototyping and execution. ","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_770254433","result":{"code":"SUCCESS","type":"HTML","msg":"<p>In this section we will cover the concept of a DataFrame. You will convert RDDs from a previous section and then use higher level\n<br />operations to demonstrate a different way of counting words. Then you will register a temporary table and perform a word count by\n<br />executing a SQL query on that table.</p>\n<h1></h1>\n<p>By the end of the section you will have learned higher-level Spark abstractions that hide lower-level details, speed up prototyping and execution.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:245"},{"title":"DataFrame","text":"%md\nA DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. [See SparkSQL docs for more info](http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes).","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_634831315","result":{"code":"SUCCESS","type":"HTML","msg":"<p>A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. <a href=\"http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes\">See SparkSQL docs for more info</a>.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:246"},{"text":"%md\nTransform your RDD into a DataFrame and perform DataFrame specific operations.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_911152909","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Transform your RDD into a DataFrame and perform DataFrame specific operations.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:247"},{"title":"Word Count with a DataFrame","text":"%pyspark\n\n# First, let's transform our RDD to a DataFrame.\n# We will use a Row to define column names.\nwordsCountsDF = (filteredWordCounts.map(lambda (w, c): \n Row(word=w,\n count=c))\n .toDF())\n\n# Print schema\nwordsCountsDF.printSchema()\n\n# Output: As you can see, the count and word types have been inferred without having to explicitly define long and string types respectively.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_41054806","result":{"code":"SUCCESS","type":"TEXT","msg":"root\n |-- count: long (nullable = true)\n |-- word: string (nullable = true)\n\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:248"},{"title":"Show top 20 rows","text":"%pyspark\n\n# Show top 20 rows\nwordsCountsDF.show()","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_665873755","result":{"code":"SUCCESS","type":"TEXT","msg":"+-----+----------+\n|count| word|\n+-----+----------+\n| 7| all|\n| 8| very|\n| 6| even|\n| 7|repository|\n| 9| FlowFile|\n| 5| threads|\n| 5| change|\n| 7| use|\n| 7| A|\n| 20| data|\n| 60| a|\n| 5| allow|\n| 6| through|\n| 5| they|\n| 8| content|\n| 8| This|\n| 9| given|\n| 5| For|\n| 6|Repository|\n| 5| Data|\n+-----+----------+\nonly showing top 20 rows\n\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:249"},{"title":"Register a Temp Table","text":"%pyspark\n\nwordsCountsDF.registerTempTable(\"word_counts\")","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_802915768","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:250"},{"text":"%md\nNow we can query the temporary `word_counts` table with a SQL statement.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955027_1916156898","id":"20160331-233830_1965558675","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Now we can query the temporary <code>word_counts</code> table with a SQL statement.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:251"},{"text":"%md\nTo execute a SparkSQL query we prepend a block of SQL code with a `%sql` line.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_403708924","result":{"code":"SUCCESS","type":"HTML","msg":"<p>To execute a SparkSQL query we prepend a block of SQL code with a <code>%sql</code> line.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:252"},{"title":"","text":"%sql\n\n-- Display word counts in descending order\nSELECT word, count FROM word_counts ORDER BY count DESC","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"tableHide":false,"title":false,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"word","index":0,"aggr":"sum"}],"values":[],"groups":[],"scatter":{"xAxis":{"name":"word","index":0,"aggr":"sum"}}},"editorMode":"ace/mode/sql","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_1235044795","result":{"code":"SUCCESS","type":"TABLE","msg":"word\tcount\nthe\t110\nof\t94\nand\t89\nto\t84\nis\t62\na\t60\nNiFi\t41\nas\t32\nThe\t28\nbe\t26\nin\t25\nare\t22\nit\t22\ndata\t20\nthat\t20\ncan\t19\nfor\t19\nor\t19\non\t17\nsystem\t16\nwhich\t14\ndataflow\t12\nflow\t11\nmore\t11\nwill\t11\nat\t11\nFlow\t9\ngiven\t9\none\t9\nFlowFile\t9\nvery\t8\nThis\t8\nsome\t8\ncontent\t8\nwith\t8\nwithin\t8\nA\t7\nother\t7\nController\t7\nall\t7\nthen\t7\nhow\t7\nwhere\t7\nuse\t7\nrepository\t7\nthrough\t6\nbetween\t6\nmake\t6\nan\t6\neven\t6\neach\t6\ntheir\t6\nRepository\t6\nwell\t6\nthreads\t5\nchange\t5\nthere\t5\nallow\t5\nprocesses\t5\nmany\t5\nalso\t5\nshould\t5\nthey\t5\nflows\t5\nbecomes\t5\npoint\t5\nFor\t5\nData\t5\nspecific\t5\ndesigned\t5\ncluster\t5\nby\t5\nthese\t5\ndefault\t5\n","comment":"","msgTable":[[{"key":"count","value":"the"},{"key":"count","value":"110"}],[{"value":"of"},{"value":"94"}],[{"value":"and"},{"value":"89"}],[{"value":"to"},{"value":"84"}],[{"value":"is"},{"value":"62"}],[{"value":"a"},{"value":"60"}],[{"value":"NiFi"},{"value":"41"}],[{"value":"as"},{"value":"32"}],[{"value":"The"},{"value":"28"}],[{"value":"be"},{"value":"26"}],[{"value":"in"},{"value":"25"}],[{"value":"are"},{"value":"22"}],[{"value":"it"},{"value":"22"}],[{"value":"data"},{"value":"20"}],[{"value":"that"},{"value":"20"}],[{"value":"can"},{"value":"19"}],[{"value":"for"},{"value":"19"}],[{"value":"or"},{"value":"19"}],[{"value":"on"},{"value":"17"}],[{"value":"system"},{"value":"16"}],[{"value":"which"},{"value":"14"}],[{"value":"dataflow"},{"value":"12"}],[{"value":"flow"},{"value":"11"}],[{"value":"more"},{"value":"11"}],[{"value":"will"},{"value":"11"}],[{"value":"at"},{"value":"11"}],[{"value":"Flow"},{"value":"9"}],[{"value":"given"},{"value":"9"}],[{"value":"one"},{"value":"9"}],[{"value":"FlowFile"},{"value":"9"}],[{"value":"very"},{"value":"8"}],[{"value":"This"},{"value":"8"}],[{"value":"some"},{"value":"8"}],[{"value":"content"},{"value":"8"}],[{"value":"with"},{"value":"8"}],[{"value":"within"},{"value":"8"}],[{"value":"A"},{"value":"7"}],[{"value":"other"},{"value":"7"}],[{"value":"Controller"},{"value":"7"}],[{"value":"all"},{"value":"7"}],[{"value":"then"},{"value":"7"}],[{"value":"how"},{"value":"7"}],[{"value":"where"},{"value":"7"}],[{"value":"use"},{"value":"7"}],[{"value":"repository"},{"value":"7"}],[{"value":"through"},{"value":"6"}],[{"value":"between"},{"value":"6"}],[{"value":"make"},{"value":"6"}],[{"value":"an"},{"value":"6"}],[{"value":"even"},{"value":"6"}],[{"value":"each"},{"value":"6"}],[{"value":"their"},{"value":"6"}],[{"value":"Repository"},{"value":"6"}],[{"value":"well"},{"value":"6"}],[{"value":"threads"},{"value":"5"}],[{"value":"change"},{"value":"5"}],[{"value":"there"},{"value":"5"}],[{"value":"allow"},{"value":"5"}],[{"value":"processes"},{"value":"5"}],[{"value":"many"},{"value":"5"}],[{"value":"also"},{"value":"5"}],[{"value":"should"},{"value":"5"}],[{"value":"they"},{"value":"5"}],[{"value":"flows"},{"value":"5"}],[{"value":"becomes"},{"value":"5"}],[{"value":"point"},{"value":"5"}],[{"value":"For"},{"value":"5"}],[{"value":"Data"},{"value":"5"}],[{"value":"specific"},{"value":"5"}],[{"value":"designed"},{"value":"5"}],[{"value":"cluster"},{"value":"5"}],[{"value":"by"},{"value":"5"}],[{"value":"these"},{"value":"5"}],[{"value":"default"},{"value":"5"}]],"columnNames":[{"name":"word","index":0,"aggr":"sum"},{"name":"count","index":1,"aggr":"sum"}],"rows":[["the","110"],["of","94"],["and","89"],["to","84"],["is","62"],["a","60"],["NiFi","41"],["as","32"],["The","28"],["be","26"],["in","25"],["are","22"],["it","22"],["data","20"],["that","20"],["can","19"],["for","19"],["or","19"],["on","17"],["system","16"],["which","14"],["dataflow","12"],["flow","11"],["more","11"],["will","11"],["at","11"],["Flow","9"],["given","9"],["one","9"],["FlowFile","9"],["very","8"],["This","8"],["some","8"],["content","8"],["with","8"],["within","8"],["A","7"],["other","7"],["Controller","7"],["all","7"],["then","7"],["how","7"],["where","7"],["use","7"],["repository","7"],["through","6"],["between","6"],["make","6"],["an","6"],["even","6"],["each","6"],["their","6"],["Repository","6"],["well","6"],["threads","5"],["change","5"],["there","5"],["allow","5"],["processes","5"],["many","5"],["also","5"],["should","5"],["they","5"],["flows","5"],["becomes","5"],["point","5"],["For","5"],["Data","5"],["specific","5"],["designed","5"],["cluster","5"],["by","5"],["these","5"],["default","5"]]},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:253"},{"text":"%md\nNow let's take a step back and perform a word count with SQL","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_1968421310","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Now let's take a step back and perform a word count with SQL</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:254"},{"title":"Convert RDD to a DataFrame and Register a New Temp Table","text":"%pyspark\n\n# Convert wordsFiltered RDD to a Data Frame\nwordsDF = wordsFiltered.map(lambda w: Row(word=w, count=1)).toDF()","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_1271375135","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:255"},{"title":"Use DataFrame Specific Functions to Determine Word Counts","text":"%pyspark\n\n(wordsDF.groupBy(\"word\")\n .sum()\n .orderBy(\"sum(count)\", ascending=0)\n .limit(10).show())","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_539606295","result":{"code":"SUCCESS","type":"TEXT","msg":"+----+----------+\n|word|sum(count)|\n+----+----------+\n| the| 110|\n| of| 94|\n| and| 89|\n| to| 84|\n| is| 62|\n| a| 60|\n|NiFi| 41|\n| as| 32|\n| The| 28|\n| be| 26|\n+----+----------+\n\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:256"},{"title":"Register as Temp Table","text":"%pyspark\n\n# Register as Temp Table\nwordsDF.registerTempTable(\"words\")","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_339558784","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:257"},{"title":"Word Count using SQL","text":"%md\n\nNow let's do a word count using a SQL statement against the `words` table and order the results in a descending order by count.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_1100432609","result":{"code":"SUCCESS","type":"HTML","msg":"<p>Now let's do a word count using a SQL statement against the <code>words</code> table and order the results in a descending order by count.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:258"},{"text":"%sql\n\nSELECT word, count(*) as count FROM words GROUP BY word ORDER BY count DESC LIMIT 10","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"tableHide":false,"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"word","index":0,"aggr":"sum"}],"values":[{"name":"count","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"word","index":0,"aggr":"sum"},"yAxis":{"name":"count","index":1,"aggr":"sum"}}},"editorMode":"ace/mode/sql","editorHide":false,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955028_1914233154","id":"20160331-233830_841691499","result":{"code":"SUCCESS","type":"TABLE","msg":"word\tcount\nthe\t110\nof\t94\nand\t89\nto\t84\nis\t62\na\t60\nNiFi\t41\nas\t32\nThe\t28\nbe\t26\n","comment":"","msgTable":[[{"key":"count","value":"the"},{"key":"count","value":"110"}],[{"value":"of"},{"value":"94"}],[{"value":"and"},{"value":"89"}],[{"value":"to"},{"value":"84"}],[{"value":"is"},{"value":"62"}],[{"value":"a"},{"value":"60"}],[{"value":"NiFi"},{"value":"41"}],[{"value":"as"},{"value":"32"}],[{"value":"The"},{"value":"28"}],[{"value":"be"},{"value":"26"}]],"columnNames":[{"name":"word","index":0,"aggr":"sum"},{"name":"count","index":1,"aggr":"sum"}],"rows":[["the","110"],["of","94"],["and","89"],["to","84"],["is","62"],["a","60"],["NiFi","41"],["as","32"],["The","28"],["be","26"]]},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:259"},{"title":"The End","text":"%md\nYou've reached the end of this lab! We hope you've been able to successfully complete all the sections and learned a thing or two about Spark.","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"title":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955029_1913848405","id":"20160331-233830_293992216","result":{"code":"SUCCESS","type":"HTML","msg":"<p>You've reached the end of this lab! We hope you've been able to successfully complete all the sections and learned a thing or two about Spark.</p>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:260"},{"text":"%md\n### Additional Resources\nThis is just the beggining of your journey with Spark. Make sure to checkout these additional useful resources:\n\n1. [Hortonworks Community Connection](https://hortonworks.com/community/)\n2. [pySpark Reference Guide](https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html)","dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"tableHide":false,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/markdown","editorHide":true,"colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955029_1913848405","id":"20160331-233830_1914786212","result":{"code":"SUCCESS","type":"HTML","msg":"<h3>Additional Resources</h3>\n<p>This is just the beggining of your journey with Spark. Make sure to checkout these additional useful resources:</p>\n<ol>\n<li><a href=\"https://hortonworks.com/community/\">Hortonworks Community Connection</a></li>\n<li><a href=\"https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html\">pySpark Reference Guide</a></li>\n</ol>\n"},"dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:261"},{"dateUpdated":"2016-11-18T06:52:35+0000","config":{"enabled":true,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"editorMode":"ace/mode/scala","colWidth":12},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1479451955029_1913848405","id":"20160331-233830_200815067","dateCreated":"2016-11-18T06:52:35+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:262"}],"name":"Labs / Spark 1.6.x / Data Worker / Python / 101 - Intro to Spark","id":"2C23PDD5H","angularObjects":{"2C25F5DA7:shared_process":[],"2C1YXUAYR:shared_process":[],"2C3W3GZQG:shared_process":[],"2C1VGS774:shared_process":[],"2C3W5NK6J:shared_process":[],"2C3A21492:shared_process":[]},"config":{"looknfeel":"default"},"info":{}}