HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
Enterprise Spark 1
Rev 1
Copyright © 2012 - 2016 Hortonworks, Inc. All rights reserved.
The contents of this course and all its lessons and related materials, including handouts to
audience members, are Copyright © 2012 - 2015 Hortonworks, Inc.
No part of this publication may be stored in a retrieval system, transmitted or reproduced in any
way, including, but not limited to, photocopy, photograph, magnetic, electronic or other record,
without the prior written permission of Hortonworks, Inc.
This instructional program, including all material provided herein, is supplied without any
guarantees from Hortonworks, Inc. Hortonworks, Inc. assumes no liability for damages or legal
action arising from the use or misuse of contents or details contained herein.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
• HDP Certified Developer: for Hadoop developers using frameworks like Pig, Hive, Sqoop and
Flume.
• HDP Certified Administrator: for Hadoop administrators who deploy and manage Hadoop
clusters.
• HDP Certified Developer: Java: for Hadoop developers who design, develop and architect
Hadoop-based solutions written in the Java programming language.
• HDP Certified Developer: Spark: for Hadoop developers who write and deploy applications for
the Spark framework.
How to Register: Visit www.examslocal.com and search for “Hortonworks” to register for an
exam. The cost of each exam is $250 USD, and you can take the exam anytime, anywhere
using your own computer. For more details, including a list of exam objectives and instructions
on how to attempt our practice exams, visit http://hortonworks.com/training/certification/
Earn Digital Badges: Hortonworks Certified Professionals receive a digital badge for each
certification earned. Display your badges proudly on your résumé, LinkedIn profile, email
signature, etc.
On Demand Learning
Hortonworks University courses are designed and developed by Hadoop experts and
provide an immersive and valuable real world experience. In our scenario-based training
courses, we offer unmatched depth and expertise. We prepare you to be an expert with
highly valued, practical skills and prepare you to successfully complete Hortonworks
Technical Certifications.
The online library accelerates time to Hadoop competency. In addition, the content is
constantly being expanded with new material, on an ongoing basis.
Visit: http://hortonworks.com/training/class/hortonworks-university-self-paced-learning-
library/
Lab Steps
Perform the following steps:
1 . Start the HDP cluster.
a. Connect to the lab environment.
c. Use SSH to connect to the Docker container – named “sandbox” – that has been a
single-node HDP cluster installation configured.
# ssh sandbox
b. Supply a username and password of admin and admin, then click the Sign in button to
get to the Ambari Web UI dashboard.
c. All services should be running. If not, start any stopped services by clicking on the
Actions button at the bottom left and selecting Start All.
d. If a restart was necessary, give the services a couple of minutes to start. One or more
of them may initially report failure, but after waiting will go green. When everything has
settled, your dashboard list of services should look similar to this:
3 . Confirm HDFS (Hadoop Distributed File System) access from the command line.
a. Go back to the terminal window that is connected to the sandbox Docker container
(reopen and reconnect if necessary) and switch users so that you can run HDFS
administrative commands.
# su hdfs
b. To verify HDFS connectivity, run the hdfs dfsadmin -report command. Verify that it
provides output similar to the screenshot provided.
c. Exit the HDFS administrative user and go back to being the root user.
# exit
d. Run the jps command and verify that a process called NameNode is running.
# jps
Result
You have successfully connected to your lab environment, used SSH to connect to the HDP cluster
Docker container, started Ambari and all HDP services, and verified connection to HDFS and operation
of the NameNode process.
Lab Steps
Perform the following steps:
1 . View the hdfs dfs command.
a. Open a Terminal window and use ssh to connect to the sandbox virtual machine.
# ssh sandbox
b. From the command line, enter the hdfs dfs command with no arguments to view its
usage.
# hdfs dfs
b. Run the command again, but this time specify the root folder for all of HDFS.
c. Create a directory named dirTest in the current user’s home directory in HDFS.
e. Verify that this directory was created in the user’s home directory.
NOTE:
There is no difference between performing the -ls command when you specify no
directories and when you specify the user’s home directory. All commands will be
executed in the user’s home directory unless otherwise specified.
f. Use -mkdir to create subdirectory dir1 in the dirTest directory. Then run the
command again with the -p option to create an additional subdirectory, dir2, which
also contains its own subdirectory, dir3.
g. Run the hdfs dfs -ls -R command to recursively view the contents of the user’s
home directory, and verify that all three directories from the previous step were
successfully created.
b. This command works because the directory is empty. Run the command again, and
this time try to delete the dir2 directory and note the error message. Then verify that the
directory still exists.
c. To delete a directory and all of its contents, use hdfs dfs -rm –R <directory
path>.
WARNING:
Be very careful not to run this without specifying a directory, as the default
behavior would be to delete the user’s home directory and all contents (in our case, the
/user/root directory and everything it contains).
Use this command to delete the dir2 directory and its contents, and verify that the directory
has been deleted.
# cd /root/spark/data/
# ls
d. Create a copy of the data.txt file named datacopy.txt and verify the operation was
successful.
QUESTION:
What do you think would have happened if the dirTest directory had not been explicitly
specified as the location for the datacopy.txt file?
e. Now delete the datacopy.txt file and verify it has been removed.
OR
# hdfs dfs -tail dirTest/data.txt
b. Download the data.txt file from HDFS to the /tmp directory on the local file system
and verify the operation was successful.
# ls /tmp/data*
c. View the contents of the small_blocks.txt file on the local file system. It should be
in the current directory.
# cat small_blocks.txt
d. Upload the small_blocks.txt into the dirTest folder in HDFS and verify that you
now have two files in dirTest.
e. Merge and download all of the contents of the dirTest directory in HDFS to a file
named merged.txt in the /tmp directory on the local file system. Verify that the
merged.txt file was successfully created.
# ls /tmp/merged*
View the contents of the merged.txt file to confirm that it contains the contents of both files
that were in the dirTest directory.
# cat /tmp/merged.txt
# cd ~
# pwd
Result
You have successfully created, manipulated, and deleted files and directories in HDFS.
Lab Steps
Perform the following steps:
1 . Access the Spark REPLs.
a. Open a Terminal window and use ssh to connect to the sandbox virtual machine.
# ssh sandbox
# spark-shell
scala> sc
scala> sc.appname
scala> sc.version
scala> exit()
# pyspark
>>> sc
>>> sc.appName
>>> sc.version
>>> exit()
b. Click Interpreter in the top menu and note that Zepplin’s default interpreter is set to
Spark and has a number of default settings configured.
c. Click on Notebook in the top menu and select Create new note from the resulting drop
down options.
e. At the top right click on the gear icon to change interpreter binding. Your administrator
has enabled an interpreter called “spark yarn-client” which is configured for the HDP
cluster you are using. Drag it to the top of the list of interpreters, and click the Save
button.
The first interpreter on the list is treated as the default interpreter. Scroll down to find the Save
button.
f. Find the values for Spark version and the Spark home directory. When you type the
commands, run them either by pressing the Shift + Enter keys, or by clicking on the
Play icon to the right of the word Ready.
NOTE:
The first time this is run, it may take a few minutes to complete. Future commands will
run much faster, including this one if repeated.
sc.version
sc.getConf.get("spark.home")
While processing, Zeppelin will display a status of RUNNING. It will also display a Pause icon
should it become necessary.
The output may vary slightly from the screenshot below, but should look something like this
when processing is completed:
Run the following commands to demonstrate this flexibility using Shell, Python, Scala,
Markdown, and Spark SQL. Execute each command by clicking on the Play icon or
pressing Shift + Enter when you are finished typing.
Shell:
%sh echo "Introduction to Zeppelin"
Python:
%pyspark
print "Introduction to Zeppelin"
Markdown:
%md Introduction to Zeppelin
Spark SQL:
%sql
show tables
Click on Notebook at the top of the browser window and find and select the notebook
labeled IoT Data Analysis (Keynote Demo) in the resulting drop-down menu.
b. At the top right click on the gear icon to change interpreter binding.
c. For the purposes of this lab, all necessary code has already been entered for you in the
saved notebook. All you have to do is scroll to the appropriate section and click the
Play icon or press Shift + Enter.
d. The first major block of code ingests data from an online source into HDFS and then
displays those files using the shell scripting interpreter. Find and run that code.
NOTE:
that the label to the left of the Play icon says FINISHED, but this will not prohibit you
from running the code again on this machine.
This notebook uses a deprecated command, hadoop fs, rather than the more updated hdfs
dfs command we used in the previous lab. This should not affect the functionality of the demo.
When the code has finished, the output at the bottom should look like this:
e. The next section of the notebook once again uses the shell scripting interpreter to view
some of the raw data in one of the downloaded files. Scroll down and run this code,
then view its output.
f. The next section of the notebook performs actions necessary to import and use this
data with Spark SQL. You may note that the status to the left of the Play icon is shown
as ERROR. This is due to the fact that the file being manipulated did not exist at the
time the notebook was opened on this system. Run this code and view the output.
g. The next block of code utilizes Spark SQL to view this data. Run this code and examine
the output.
h. Note that at the top of the results there are six buttons that allow you to display the
results using six different visualizations. Click on each one to view the differences
between them.
TIP:
In this lab you ran each section of code, known as a paragraph, individually. The entire
notebook could have been played at once, however, by clicking the Play icon labeled
Run all paragraphs directly to the right of the notebook title at the top of the browser.
Result
You have accessed the Spark REPLs for both Scala and Python, created a Zeppelin notebook and
demonstrated Zeppelin’s ability to interpret multiple languages, and used a pre-built Zeppelin
notebook to briefly explore Zeppelin’s ability to ingest, view, analyze, and visualize data.
Lab Steps
Perform the following steps:
1 . View the raw data for this lab.
a. In a new terminal window, ssh to sandbox and change directories to
/home/zeppelin/spark/data. View the files in this directory.
# ssh sandbox
# cd /home/zeppelin/spark/data/
# ls
b. Use less to view the “selfishgiant.txt” data file. Press q to quit when you are finished
reviewing.
# less selfishgiant.txt
b. Click on Notebook and select Create new note on the drop down. Name this note
Create and Manipulate RDDs.
c. At the top right click on the gear icon to change interpreter binding.
d. Place the selfishgiant.txt file into the Zeppelin user’s home directory on HDFS,
/user/zeppelin. (There are no line breaks in the code below after %sh. Please refer
to the screenshot.)
%sh
hdfs dfs –put /home/zeppelin/spark/data/selfishgiant.txt
/user/zeppelin/selfishgiant.txt
REMINDER:
After entering a command, press Shift + Enter keys or press the Play button on the
right side of the paragraph to execute the commands. The text to the left of the Play
button should change from READY to FINISHED when it is complete.
%sh
hdfs dfs –ls /user/zeppelin
f. Create an RDD named baseRdd using this file. Verify the RDD exists by using the
take() function to print the first line of the file.
%pyspark
baseRdd=sc.textFile(“/user/zeppelin/selfishgiant.txt”)
print baseRdd.take(1)
g. Each line of the file is currently a string. Transform the lines into arrays of individual
elements (words) stored in a new RDD named splitRdd, then take a look at the first
five elements.
%pyspark
splitRdd = baseRdd.flatMap(lambda line: line.split(" "))
print splitRdd.take(5)
h. Create a new RDD named filterRdd that only contains words in splitRdd that are
longer than 10 characters. Use collect() to view the entire output.
%pyspark
filterRdd = splitRdd.filter(lambda word: len(word) > 10)
print filterRdd.collect()
%pyspark
print splitRdd.count()
j. Create an RDD named distinctRdd that eliminates any duplicate words in splitRdd.
Then display a count of the number of distinct words in the RDD.
%pyspark
distinctRdd = splitRdd.distinct()
print distinctRdd.count()
k. Save the contents of distinctRDD to text in HDFS. Put the contents in a folder named
“distinct” for future reference.
%pyspark
distinctRdd.saveAsTextFile(“/user/zeppelin/distinct”)
%sh
hdfs dfs –ls /user/zeppelin/distinct
m. View the contents of one of the part-* files and verify that an array of unique words
has been generated and saved.
%sh
hdfs dfs –cat /user/zeppelin/distinct/part-00001
n. Create an RDD named numbersRdd that contains an array of the following numbers:
15, 20, 95, and 80. View the contents of the RDD to verify it was successfully created.
%pyspark
numbersRdd = sc.parallelize([15, 20, 95, 80])
print numbersRdd.collect()
%pyspark
print numbersRdd.stats()
p. Create a variable named maryFile that contains the string “Mary had a little lamb” and
then convert that variable into an RDD named maryRdd. View the RDD contents when
finished.
%pyspark
maryFile = (“Mary had a little lamb”)
maryRdd = sc.parallelize([maryFile])
print maryRdd.collect()
q. Create a new RDD named comboRdd that creates a union between maryRdd and
numbersRdd. Then view the combined RDD.
%pyspark
comboRdd = maryRdd.union(numbersRdd)
print comboRdd.collect()
Result
You have created several RDDs and performed various transactions and actions using the Zeppelin
notebook.
Lab Steps
Perform the following steps:
1 . Create a Pair RDD note in Zeppelin.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
b. Click on Notebook and select Create new note on the drop down. Name this note Pair
RDDs.
c. At the top right click on the gear icon to change interpreter binding.
In the code below, there are no line breaks between splitRdd and (“ “)). Please refer to
the screenshot.
%pyspark
print splitRdd.take(5)
NOTE:
In the previous lab, this RDD creation was performed over two steps, creating an
intermediary RDD named baseRdd. The creation of the intermediary is not necessary
unless it needs to be used in a future step.
b. Use map() to create an RDD named mappedRdd that converts each element into a key-
value pair with a value of 1. View the first five elements to confirm successful operation.
%pyspark
print mappedRdd.take(5)
3 . Create Pair RDDs using zip functions and perform simple transformations.
a. Create a variable named months that contains the values Jan, Feb, Mar, Apr, May,
Jun, and Jul as a list of string values. Convert this to an RDD named monthsRdd. Then
create another RDD named monthsIndexed0Rdd using zipWithIndex() to create a
Pair RDD that automatically assigns a value to each element based on its position in
the list.
REMINDER:
The first element will be assigned a value of “0” using this function.
%pyspark
monthsRdd = sc.parallelize(months)
monthsIndexed0Rdd = monthsRdd.zipWithIndex()
print monthsIndexed0Rdd.collect()
b. Use map() to convert the value for each month to the actual month number and store
this in a new RDD named monthsIndexed1Rdd. For reference, Jan should have a
value of 1, Feb should have a value of 2, and so on. View the new RDD to confirm
success.
%pyspark
print monthsIndexed1Rdd.collect()
c. Create a new RDD named monthsIndexed2Rdd that performs the same operation on
monthsIndexed0Rdd as in the previous step but uses mapValues() instead of map()
to perform the operation. View the new RDD and confirm it looks identical to the output
of monthsIndexed1Rdd.
%pyspark
print monthsIndexed2Rdd.collect()
NOTE:
No difference exists between the two previous lab steps from Spark’s perspective. The
mapValues function simply performs a map() and returns the key without modification,
while performing the function you define on the value.
d. Create a variable named quarters that contains the following seven values: 1, 1, 1,
2, 2, 2, and 3. Convert the variable into an RDD named quartersRdd. Then create
an RDD named monthsZipQuarters and use zip() to create a Pair RDD that assigns
each value from quartersRdd to a month in monthsRdd. Finally, view the output and
make sure that each month was assigned to the correct quarter in the final RDD.
%pyspark
quarters = (1, 1, 1, 2, 2, 2, 3)
quartersRdd = sc.parallelize(quarters)
monthsZipQuarters = monthsRdd.zip(quartersRdd)
print monthsZipQuarters.collect()
%pyspark
print monthsZipQuarters.keys().collect()
print monthsZipQuarters.values().collect()
print monthsZipQuarters.sortByKey().collect()
4 . Count the number of times words appear in a Pair RDD and manipulate the
output.
a. Use the mappedRDD created in a previous step and create a new RDD named
reducedByKeyRdd that reduces the file so that each word appears only once but has a
value equal to the number of times it appeared in the original RDD. View the first five
elements of the new RDD to confirm successful operation.
%pyspark
print reducedByKeyRdd.take(5)
b. Use map() to create a new RDD named flippedRdd that switches your keys and
values so that the current keys become the values, and the values become the keys.
View the first five elements of the new RDD to confirm successful operation.
%pyspark
print flippedRdd.take(5)
c. Create a new RDD named orderedRdd that manipulates flippedRDD and arranges
the words in descending order by number of times they appear. View the first five
elements of the new RDD to confirm successful operation.
%pyspark
orderedRdd = flippedRdd.sortByKey(ascending=False)
print orderedRdd.take(5)
Result
You have successfully created and manipulated Pair RDD’s using various functions.
Challenge Labs
The labs below work with Pair RDDs to perform real-world operations. In some cases, the solutions to
the lab utilize programming techniques not explicitly described in the course lecture. These
techniques, however, should be clear and easy to understand by carefully following the instructions. If
you have questions and are in an instructor-supported class, please ask for assistance as needed.
You may want to start by creating a new notebook named Pair RDD Challenge Labs, but this is up to
you.
Perform the following steps:
1 . Determine the airlines with the greatest number of flights.
a. Go back to a terminal window that has used SSH to connect to the sandbox Docker
environment and change to the /home/zeppelin/spark/data directory if necessary. View
the contents of this directory and confirm the existence of three files: airports.csv,
plane-data.csv, and flights.csv.
# ls
b. Use head to view the first few lines of the flights.csv file.
# head flights.csv
Each column in the file can be interpreted using the guide below. The first comma-separated
value in each line (index number 0) represents the month, the second value represents the day
of the month, and so on. Of note for our purposes: the sixth value (index number 5) represents
the carrier for each flight.
c. Use Zeppelin to import this file into the /user/zeppelin folder in HDFS.
%sh
QUESTION:
ANSW ER:
When the tasks are performed in a Zeppelin notebook, the entire series of actions can be
exported and then imported and replayed on another system. This will be discussed in
more detail in a later lab exercise.
3. Use map() to create a key-value pair from only the elements in the sixth column
(index number 5) - which can be specified by appending [5] to the anonymous
function value – and assign each instance a value of 1.
%pyspark
print carrierRdd.take(5)
NOTE:
e. Perform a reduce and sort the results, then display the top three carrier codes by
number of flights based on this data.
%pyspark
print carriersSorted.take(3)
# head airports.csv
Each column in the file can be interpreted using the guide below. The first comma-separated
value in each line (index number 0) represents the airport code, the second value represents
the airport name, and so on. Of note for our purposes: the airport code (index number 0) and
the airport city (index number 2).
From the flights.csv file used earlier, columns 13 and 14 (index values 12 and 13) will be used in this
exercise.
b. Use Zeppelin to import the airports.csv file into the /user/zeppelin folder in HDFS.
%sh
3. Use map() to pull out only the airport code and city elements in the first and third
columns (index numbers 0 and 2).
%pyspark
print cityRdd.take(5)
3. Use map() to pull out only the origin and destination elements in the 13th and 14th
columns (index numbers 12 and 13).
NOTE:
Some of this code can be copied and pasted from a previous paragraph in the Zeppelin
notebook.
%pyspark
print flightOrigDestRdd.take(5)
e. Use join() to join flightOrigDestRdd and cityRdd into a third RDD named
origJoinRdd.
This operation will result in an RDD that contains the origin code as the key, with a
value of (destination code, origin city). This is half of the operation needed to get origin
and destination cities.
%pyspark
origJoinRdd = flightOrigDestRdd.join(cityRdd)
print origJoinRdd.take(5)
This operation will result in an RDD that contains the destination code as the key, with
a value of (origin city, destination city).
%pyspark
destOrigJoinRdd = origJoinRdd.values().join(cityRdd)
print destOrigJoinRdd.take(5)
g. Create another RDD named citiesCleanedRdd that contains only the values of the
destOrigJoinRdd (in other words, just the origin and destination city names). View
the first five elements to confirm successful operation.
%pyspark
citiesCleanedRdd = destOrigJoinRdd.values()
print citiesCleanedRdd.take(5)
h. Use map() to convert the key-value pairs in citiesCleanedRdd into keys for a new
RDD named citiesKV, and give each key a value of 1. View the first five elements to
confirm successful operation.
%pyspark
print citiesKV.take(5)
%pyspark
print citiesReducedSortedRdd.take(3)
NOTE:
The top three origin city / destination combinations are New York to Boston, Boston to
New York, and Chicago to New York.
3 . Find the longest departure delays for any airline that experienced a delay of 15
minutes or more.
a. This exercise once again uses the flights.csv file. This time we use the unique carrier
code in column 6 (index value 5) and the departure delay value in minutes, which is in
column 12 (index value 11).
3. Use filter() to remove any lines for which the value of column 12 (index value 11)
is less than 15. Because the sc.textFile() operation reads in all values as strings,
you will need to cast the values in column 12 as integers prior to performing the
filter() evaluation.
4. Use map() to pull out only the carrier code and departure delay elements in the 6th
and 12th columns (index numbers 5 and 11).
For sake or readability, here is another screenshot of the above code with lines wrapped so
that the code can be viewed in a larger font.
c. Create an RDD named delayMaxRdd that reduces the elements in delayRdd and
returns only the longest delay per airline. For this exercise, it is not necessary to sort
the values from largest to smallest.
NOTE:
The reduce operation will need to compare all values for the same key and only keep the
largest value in the final output.
The values in delayRdd are strings, so to compare the values they will first need to be
cast as integers, similar to the filter() operation performed in the first step of this
exercise.
%pyspark
print delayMaxRdd.take(5)
# head plane-data.csv
Note that in the screenshot above, this file contains the column header names, followed by the
column values. In this case, the first few records only have values for the first column, and the
rest of the values are blank.
To see what complete records should look like, take a look at the last few lines of the file.
# tail plane-data.csv
Each column in the file can be interpreted using the guide below. Note that there are nine possible
column values for each record (index 0 through 8).
b. Use Zeppelin to import the plane-data.csv file into the /user/zeppelin folder in
HDFS.
%sh
c. Create an RDD named planeDataRdd from the plane-data.csv file. Before performing
any transformations, use count() to display the number of lines in the RDD.
%pyspark
planeDataRdd = sc.textFile(“/user/zeppelin/plane-data.csv”)
print planeDataRdd.count()
2. Split the lines into an array of individual elements using map(). (Hint: The elements
are comma-separated.)
3. Use filter() to remove any lines that do not have a length of exactly 9 elements.
4. Use count() to display the number of lines in the new RDD and confirm that the
data set contains fewer lines than before.
%pyspark
print cleanedPlaneDataRdd.count()
BE AW ARE:
This data contains additional challenges. The first row of the data contains column
headers, just like plane-data.csv did. However, in addition, in some cases the
description of the airline includes a comma that is not meant to separate values. For
example, the airline with code 09Q is has a description of Swift Air, LLC. The comma is
part of the business name.
Good luck!
Lab Steps
Perform the following steps:
1 . Use an HDFS directory as a streaming source.
a. Open a terminal window and SSH into sandbox.
# ssh sandbox
c. Start a new REPL specifying the local machine as the master and allocate two cores for
the streaming application.
d. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> hdfsInputDS.saveAsTextFiles("/user/root/test/stream/")
>>> hdfsInputDS.pprint()
j. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
k. Open a new terminal window, SSH to sandbox, and place the input file selfishgiant.txt
from /root/spark/data into the folder. Observe what happens a few seconds later in
the streaming terminal window.
# ssh sandbox
NOTE:
You are free to upload additional files to see more streaming take place if you want.
l. Once you observe data being streamed on-screen in the first terminal window, use the
second terminal window to list the contents of the /user/root/test/stream/
directory on HDFS.
m. In the first terminal window, stop the stream and exit the REPL. If the stream refreshes
while you are typing, that will not affect the input. Simply continue to type the
command and press enter.
sc.stop()
exit()
b. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> inputDS.saveAsTextFiles("/user/root/test/stream/")
>>> inputDS.pprint()
h. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
NOTE:
An error will appear when the application starts because the application is waiting for an
input connection.
i. In the second terminal window use the netcat utility to create a connection to port
9999.
# nc -lkv 9999
j. Start typing words separated by space, hit Enter occasionally to submit them. Observe
what happens in the streaming terminal window a few seconds after hitting Enter.
k. Once you observe data being streamed on-screen in the first terminal window, use Ctrl
+ C (or Cmd + C if using a Mac) to exit netcat in the second terminal window.
m. In the first terminal window, stop the stream and exit the REPL.
sc.stop()
exit()
Result
You have created data streams from HDFS and TCP socket sources, observed the stream in real-time,
and observed text files created from those streams for long-term storage and future use.
Lab Steps
Perform the following steps:
1 . Perform a Spark Streaming transformations using flatmap().
a. Open a terminal, connect to the sandbox cluster using SSH, and start a new instance of
the REPL that is configured to use two CPU cores.
# ssh sandbox
3. Creates an instance of that class named sscFive with a five-second time window
4. Creates a socket text DStream named inputDS that listens to “sandbox” on port
9999
6. Creates a DStream named flatMapDS that uses flatMap() to break lines into
individual elements separated by spaces
>>> sc.setLogLevel("ERROR")
>>> inputDS.saveAsTextFiles("/user/root/test/stream/")
>>> flatMapDS.pprint()
>>> sscFive.start()
NOTE:
You will see an error when it starts because it is waiting for an input connection.
c. Open a new terminal window, connect to the sandbox cluster, and connect to port
9999 using the netcat utility. Make sure both terminal windows are visible on-screen.
# ssh sandbox
# nc -lkv 9999
d. In the netcat terminal, start typing words separated by spaces. Hit the Enter key
occasionally to submit them to the stream. Observe how the words appear in the
streaming window.
e. In the streaming window, stop the stream and exit the REPL.
sc.stop()
exit()
f. In the netcat window, exit the socket by entering Ctrl + C (or CMD + C if using a Mac)
on your keyboard.
3. Creates an instance of that class named sscFive with a five-second time window
4. Creates a socket text DStream named inputDS that listens to “sandbox” on port
9999
>>> inputDS.saveAsTextFiles("/user/root/test/stream/")
>>> wc.pprint()
>>> sscFive.start()
NOTE:
You will see an error when it starts because it is waiting for an input connection.
c. In the netcat window from the previous lab section, reconnect to port 9999 using the
netcat utility. Make sure both terminal windows are visible on-screen.
# nc -lkv 9999
d. In the netcat terminal, start typing words separated by spaces, making sure to repeat
some of the words as you type. Hit the Enter key occasionally to submit them to the
stream. Observe how the words appear in the streaming window.
e. In the streaming window, stop the stream and exit the REPL.
sc.stop()
exit()
f. In the netcat window, exit the socket by entering Ctrl + C (or CMD + C if using a Mac)
on your keyboard.
# cp /root/spark/data/data.txt /root/spark/data/stream1.txt
# cp /root/spark/data/data.txt /root/spark/data/stream2.txt
# ls /root/spark/data/stream*
You can view the contents of the file if you want. As a reminder, these files contain a single line
of text: “This is a test file”
b. In the streaming window, start a new instance of the REPL that is once again
configured to use two CPU cores.
3. Creates an instance of that class named sscFive with a five-second time window
4. Creates two text file DStreams named inputDS1 and inputDS2 that both listen to
the /user/root/test/ directory on HDFS.
5. Creates a DStream named combined that uses union() to combine the two
streams into a single DStream
>>> sc.setLogLevel("ERROR")
>>> combined.pprint()
>>> sscFive.start()
d. Go to the netcat terminal window (which we’ll refer to now as the input1 window) from
the previous lab section and type the command to upload the small_blocks.txt file from
the local /root/spark/data/ directory to the /user/root/test/ directory on
HDFS, but DO NOT PRESS THE ENTER KEY.
e. Open a third terminal window (we’ll refer to this as the input2 window), connect to the
sandbox cluster, and type the same command as in the step above, but once again DO
NOT PRESS THE ENTER KEY. Make sure both terminal windows are visible on-
screen.
# ssh sandbox
# ssh sandbox
f. Wait for a screen refresh in the streaming window, then immediately go to the input1
and input2 windows and press the Enter key.
Assuming you perform both actions within a 5-second collection window, the
streaming window should display the contents of files as a combined data stream, as
displayed in the screenshot below. The content of the text files (which in our case
should be the same line of text) should each print multiple times because both streams
were monitoring the same HDFS directory.
If your timing is off the first time, simply try again with a couple of additional copies that have
unique file names like streaming3.txt and streaming4.txt.
g. In the streaming window, stop the stream and exit the REPL.
sc.stop()
exit()
Result
You have successfully used several basic transformations on DStreams.
Lab Steps
Perform the following steps:
1 . Create a streaming window using a TCP socket.
a. Start a new REPL specifying the local machine as the master and allocate two cores for
the streaming application.
b. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> sscFive.checkpoint(“/user/root/test/checkpoint/”)
>>> inputDS.pprint()
h. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
NOTE:
An error will appear when the application starts because the application is waiting for an
input connection.
i. In the second terminal window use the netcat utility to create a connection to port
9999.
# nc -lkv 9999
j. Start typing words separated by spaces, hit Enter occasionally to submit them.
Observe what happens in the streaming terminal window a few seconds after hitting
Enter.
k. Once you observe data being streamed on-screen in the first terminal window, use Ctrl
+ C (or Cmd + C if using a Mac) to exit netcat in the second terminal window.
l. In the first terminal window, stop the stream and exit the REPL. If the stream refreshes
while you are typing, that will not affect the input. Simply continue to type the
command and press Enter.
sc.stop()
exit()
2 . Create a streaming window that counts words in a DStream using a TCP socket.
a. Start a new REPL specifying the local machine as the master and allocate two cores for
the streaming application.
b. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> sscFive.checkpoint(“/user/root/test/checkpoint/”)
QUESTION:
What do you think would happen if the flatMap function were removed from the line of
code above?
>>> inputDS.pprint()
h. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
NOTE:
An error will appear when the application starts because the application is waiting for an
input connection.
i. In the second terminal window use the netcat utility to create a connection to port
9999.
# nc -lkv 9999
j. Start typing words separated by space, hit Enter occasionally to submit them.
Observe what happens in the streaming terminal window a few seconds after hitting
Enter.
k. Once you observe data being streamed on-screen in the first terminal window, use Ctrl
+ C (or Cmd + C if using a Mac) to exit netcat in the second terminal window.
l. In the first terminal window, stop the stream and exit the REPL. If the stream refreshes
while you are typing, that will not affect the input. Simply continue to type the
command and press Enter.
sc.stop()
exit()
3 . Create a streaming window that counts instances of words in a DStream using a TCP socket.
a. Start a new REPL specifying the local machine as the master and allocate two cores for
the streaming application.
b. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> sscFive.checkpoint(“/user/root/test/checkpoint/”)
>>> inputDS.pprint()
h. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
NOTE:
An error will appear when the application starts because the application is waiting for an
input connection.
i. In the second terminal window use the netcat utility to create a connection to port
9999.
# nc -lkv 9999
j. Start typing words separated by space, hit Enter occasionally to submit them. Make
sure to repeat words every so often between lines. Observe what happens in the
streaming terminal window a few seconds after hitting Enter.
k. Once you observe data being streamed on-screen in the first terminal window, use Ctrl
+ C (or Cmd + C if using a Mac) to exit netcat in the second terminal window.
l. In the first terminal window, stop the stream and exit the REPL. If the stream refreshes
while you are typing, that will not affect the input. Simply continue to type the
command and press Enter.
sc.stop()
exit()
Result
You have successfully performed various Spark Streaming Window Transformations.
Lab Steps
Perform the following steps:
1 . Create and save DataFrames and tables.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
b. Click Create new note. Name this note Create and Save DataFrames.
NOTE:
Make sure to set the interpreter to spark-yarn-client as in previous labs.
c. At the top right click on the gear icon to change interpreter binding.
The first entry in each sub-list should be a two-letter code (GG and HH). The second
entry in each sub-list should be numeric values of 20,000 and 190,000 respectively.
%pyspark
print rddNoSchema.collect()
%pyspark
dataframe1.show()
f. Create an RDD named rddWithSchema that utilizes Row objects organized so that
each element has a schema value.
The first entry in each Row should be a two-letter code (AA and BB) that are assigned a
schema value of code. The second entry in each Row should be numeric values of
150,000 and 80,000 respectively that are assigned a schema value of value.
%pyspark
print rddWithSchema.collect()
g. Use toDF() to convert this RDD to a new DataFrame named dataframe2. View the
DataFrame to confirm success.
%pyspark
dataframe2 = rddWithSchema.toDF()
dataframe2.show()
%pyspark
dataframe2.registerTempTable(“table1temp”)
sqlContext.sql(“SHOW TABLES”).show()
i. In the next paragraph, issue a Spark SQL command to SHOW TABLES. Does
table1temp show up? If so, why? If not, why not?
NOTE:
Your output may also contain tables created when you ran demos in previous labs.
%sql
SHOW TABLES
j. Issue a HiveQL CREATE TABLE command from within the DataFrames API and create a
permanent version of table1temp named table1hive. Use SHOW TABLES both from
the DataFrames API, and then in a new paragraph from Spark SQL, to confirm this table
is visible across contexts.
%pyspark
sqlContext.sql("SHOW TABLES").show()
%sql
SHOW TABLES
%sql
l. Convert this Hive table into a DataFrame named dataframe3. View the new
DataFrame to confirm success.
%pyspark
dataframe3 = sqlContext.table(“table1hive”)
dataframe3.show()
dataframe3.write.format(“json”).save(“dfJSON1”)
%sh
NOTE:
The JSON file is stored in several part-* files in the folder name you specified. If you
wanted to copy this file to your local file system for distribution outside the cluster, you
could use hdfs dfs -getmerge to combine it as a single file on your local file system.
n. View the combined contents of the files in the dfJSON1 folder on HDFS.
%sh
NOTE:
The JSON format is not what you might typically see when looking at JSON files. For
DataFrame creation, each row of information must be self-contained, and thus the
formatting you see here is a requirement for converting JSON files to DataFrames. This
same content coded in more typical JSON fashion would error out upon attempting to
read it as a DataFrame.
o. Create a new DataFrame named dataframe4 from the contents of this folder on HDFS.
View the new DataFrame to confirm success.
%pyspark
dataframe4 = sqlContext.read.format(“json”).load(“dfJSON1/*”)
dataframe4.show()
Result
You have used several methods to create and save DataFrames and tables.
Lab Steps
Perform the following steps:
1. Manipulate DataFrames using the DataFrames API
NOTE:
This lab intentionally makes use of one or more functions not discussed in the student
book. The new functions are very similar in nature to functions already discussed in Core
RDD programming and should make sense to the student. In addition, some functions
are used in ways not discussed in the student book as well. This is to encourage
exploration and experimentation, in addition to learning new ways to do things.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
b. Click on Notebook and select Create new note on the drop down. Name this note Work
with DataFrames.
c. At the top right click on the gear icon to change interpreter binding.
d. Create two DataFrames named dataframeA and dataframeB from the Hive table
named table1hive created in the previous lab. Then use unionAll() to combine the
rows of these two tables into a new DataFrame named dataframeC. Then show the
contents of dataframeC to confirm success.
%pyspark
dataframeA = sqlContext.table(“table1hive”)
dataframeB = sqlContext.table(“table1hive”)
dataframeC = dataframeA.unionAll(dataframeB)
dataframeC.show()
e. Create a DataFrame named dataframeD that adds a column named quarterly that
contains the contents of the value column multiplied by three. View the new
DataFrame to confirm success.
%pyspark
dataframeD.show()
f. Create a DataFrame named dataframeE that renames the value column to monthly.
View the new DataFrame to confirm success.
%pyspark
dataframeE.show()
g. Create a DataFrame named dataframeF that contains only those rows from
dataframeE where the quarterly value is greater than 300,000. View the new
DataFrame to confirm success.
%pyspark
dataframeF.show()
h. Create a new DataFrame named dataframeG that adds the rows of dataframeE to
dataframeF so that there are six rows total. View the new DataFrame to confirm
success.
%pyspark
dataframeG = dataframeE.unionAll(dataframeF)
dataframeG.show()
i. Use describe() on dataframeG without supplying a column name and show the
results.
QUESTION:
What happens?
%pyspark
dataframeG.describe().show()
ANSW ER:
%pyspark
dataframeG.distinct().show()
k. Use drop() to create a new DataFrame named dataframeH that contains only the
code and quarterly columns. View the new DataFrame to confirm success.
QUESTION:
What other function described in the student book, could you have used to accomplish
the same task? What would the code have been?
%pyspark
dataframeH = dataframeG.drop(‘monthly’)
dataframeH.show()
ANSW ER:
The same thing could have been accomplished using the following code:
dataframeH = dataframeG.select(‘code’, ‘quarterly’)
l. Create a new DataFrame named dataframeI that contains each unique element in the
code column and a count of the number of times each code appears dataframeH.
View the new DataFrame to confirm success.
%pyspark
dataframeI = dataframeH.groupBy(“code”).count()
dataframeI.show()
Result
You have successfully used the DataFrames API to manipulate DataFrames.
Lab Steps
Perform the following steps:
1 . Create data visualizations from a file of banking data.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
NOTE:
Zepplin’s current main backend processing engine is Apache Spark.
%sh
NOTE:
This data is a cleaned subset of a publicly available machine learning dataset. The
original dataset can be found at the following link:
http://archive.ics.uci.edu/ml/machine-learning-databases/00222/
e. Use the bankdata3.orc file to create a DataFrame named bankdata, a temporary table
named banktemp, and a Hive table named bankdataperm.
%pyspark
bankdata = sqlContext.read.format(“orc”).load(“bankdata3.orc”)
bankdata.registerTempTable(“banktemp”)
f. Use SQL to show the tables available and confirm that bankdataperm is available.
%sql
show tables
g. Use SQL to select and display all rows and columns from bankdataperm.
%sql
h. Quickly browse through the five data visualizations available by default in Zeppelin. For
most of this lab, we will work with the bar chart view.
i. Go back to the bar chart view. Then, edit your SQL query so that it only shows data for
individuals over the age of 30. Run the query and note the change in the chart.
%sql
j. Click on the settings link and notice that Zeppelin has selected the age column as the
key column and is showing the sum of the balances for all individuals in each age
bracket. Display the average balance instead of the sum of balances.
k. Click and drag the available marital field into the Groups category to modify the
visualization so that data is shown not only by age, but also grouped by marital status.
When you are finished, click the settings link again to close the pivot chart options.
l. It appears that we have what appears to be a single outlier that is skewing the data
fairly significantly. We can easily see that the vast majority of average balances are well
below $5,000. Add a dynamic form to the SQL query that allows you to filter out data
where the maximum balance for any individual exceeds a certain threshold, but set the
default to 1,000,000 so that it doesn’t immediately modify the chart. Rerun the query
with this new code, then use this dynamic form to adjust the maximum balance to
$10,000 and $5,000 and note the effects on the visualization.
%sql
select * from bankdataperm where age > 30 and balance <= ${Maximum
Balance=1000000}
QUESTIONS:
Why do you think changing the maximum balance from $10,000 to $5,000 had so little
effect on the chart?
What group (married, single, or divorced) had the most change based on changing the
maximum balance?
m. Create a URL that allows you to share this chart with others without giving them access
to the code or the Zeppelin note. Use the linked page to change the maximum balance
to $2,500, then return to your note and observe the effects the change had at the
source.
n. In the paragraph below this one, run the SQL command to read all data from
bankdataperm. Then adjust the width of the two paragraphs so that they both appear
on the same line.
o. We are now ready to prepare this note for sharing. Create a clone copy of this note
named Data Visualization Clone. Also export a copy of the note.
p. On the Data Visualization note we are going to share, hide the code for all paragraphs.
Then hide the output for every paragraph except for the two that are on the same line.
q. Next, convert this from the default view to report view. Now the URL to this note is
ready to share with your stakeholders.
r. Import the copy of this note you made earlier and name the new note Data Visualization
Imported. Confirm that the copy contains all original code and formatting.
Result
You have successfully created and manipulated Zeppelin visualizations, made them available for
collaboration, and used Zeppelin to create a shareable report.
Lab Steps
Perform the following steps:
1 . Monitor a core RDD programming job.
a. Open the Firefox browser and access your Zeppelin notebook.
http://sandbox:9995/
b. From the home page, select the Application Monitoring Python Note. This note has
prebuilt code that we will run to generate Spark job activity.
c. At the top right click on the gear icon to change interpreter binding. Your administrator
has enabled an interpreter called “spark yarn-client” which is configured for the HDP
cluster you are using. Drag it to the top of the list of interpreters, and click the Save
button.
NOTE:
The first interpreter on the list is treated as the default interpreter. Scroll down to find the
Save button.
d. Now run the code by hitting Play button or by pressing Shift + Enter.
NOTE:
The below code is for reference purposes and has already been placed in the note.
%pyspark
months = ("Jan", "Feb", "March", "April", "May", "June", "July")
rddMonths = sc.parallelize(months)
zipWIrdd = rddMonths.zipWithIndex()
print zipWIrdd.collect()
quarters = (1,1,1,2,2,2,3)
rddQuarters = sc.parallelize(quarters)
ZiPrdd = rddMonths.zip(rddQuarters)
print ZiPrdd.collect()
MapValrdd = ZiPrdd.mapValues(lambda mark: (mark, 1));
print MapValrdd.collect()
print MapValrdd.keys().collect()
print MapValrdd.values().collect()
print MapValrdd.sortByKey().collect()
e. Open a new tab on the Firefox browser and enter the following URL to view the Spark
Application UI:
NOTE:
http://sandbox:4040/ will work only once the job is submitted.
NOTE:
the URL http://sandbox:4040/ has been redirected to
http://sandbox:8088/proxy/application_ID. Port 8088 belongs to the job history server for
various applications that run on YARN. Here our application is “Zeppelin application UI”
as noted in the top-right corner of the window.
When you get to the Show Additional Metrics link, try reading about and selecting additional
metrics and view the information they provide. How might this be useful in troubleshooting
application performance problems?
# ssh sandbox
b. Start a new REPL specifying the local machine as the master and allocate two cores for
the streaming application.
c. Set the log level to ERROR to avoid screen clutter while running the streaming
application.
>>> sc.setLogLevel("ERROR")
>>> inputDS.pprint()
h. Start the streaming application. Note that only new files will be streamed, so any files
that existed at application launch will not be streamed.
>>> sscFive.start()
NOTE:
An error will appear when the application starts because the application is waiting for
an input connection.
i. In a second terminal window SSH to sandbox and use the netcat utility to create a
connection to port 9999.
# ssh sandbox
# nc -lkv 9999
j. Start typing words separated by space, hit Enter occasionally to submit them.
Observe what happens in the streaming terminal window a few seconds after hitting
Enter.
k. Once you observe data being streamed on-screen in the first terminal window, use Ctrl
+ C (or Cmd + C if using a Mac) to exit netcat in the second terminal window.
l. Since this is a new SparkContext instance, a new Spark Applications UI should now be
available. Open a new FireFox tab and browse to the Streaming Application UI URL
from before, but replace port 4040 with 4041:
n. When you have located all of the required sections, go back to the first terminal
window, stop the stream and exit the REPL. If the stream refreshes while you are
typing, that will not affect the input. Simply continue to type the command and press
Enter.
sc.stop()
exit()
Result
You have successfully monitored Spark core programming and Spark Streaming jobs using the Spark
Application UI.
Lab Steps
Perform the following steps:
1 . Practice using performance tuning techniques.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
c. At the top right click on the gear icon to change interpreter binding.
d. Create an RDD named rdd1 that contains a list of numbers one through nine, then
back down to one again (17 elements total) and set it to eight partitions. Use print to
confirm the RDD was created successfully.
%pyspark
print rdd1.collect()
e. View the default parallelism settings for your environment, and then verify that rdd1
was partitioned with eight partitions instead of the default number.
%pyspark
print sc.defaultParallelism
%pyspark
print rdd1.getNumPartitions()
f. Create an RDD named rdd2 that is a copy of rdd1 but uses only four partitions. Verify
that rdd2 has only four partitions.
%pyspark
rdd2 = rdd1.coalesce(4)
%pyspark
print rdd2.getNumPartitions()
g. Create an RDD named rdd3 that is a copy of rdd2 but expands the number of
partitions from four to six. Verify that rdd3 has six partitions.
%pyspark
rdd3 = rdd2.repartition(6)
print rdd3.getNumPartitions()
h. Create an RDD named rdd4 that contains a larger set of data by combining rdd3,
rdd2, and rdd1. The view this list of 51 numbers.
%pyspark
rdd4 = rdd3.union(rdd2.union(rdd1))
print rdd4.collect()
i. Create an RDD named rdd5 that turns this list into a Pair RDD using the existing
numbers as keys and assign each key a value of one. View rdd5 to confirm successful
operation.
%pyspark
print rdd5.collect()
j. Create an RDD named rdd6 that uses partitionBy() to create eight hashed
partitions from rdd5. View rdd6 to confirm successful operation.
%pyspark
rdd6 = rdd5.partitionBy(8)
print rdd6.collect()
k. Cache rdd6 in memory so that it will be quickly available should we want to use the
hash partitioning in a future operation.
%pyspark
rdd6.cache()
l. Create a new RDD named rdd7 that reduces rdd6 by key. View the results, and pay
attention to the time it took to generate it.
%pyspark
print rdd7.collect()
m. Create a directory named checkperf in your HDFS home directory, then configure it
as your checkpoint directory for Spark applications.
%sh
%pyspark
sc.setCheckpointDir(“checkperf”)
n. Checkpoint rdd6 so that future operations can use it as the starting point for lineage-
tracking purposes.
%pyspark
rdd6.checkpoint()
o. Open a terminal window and connect to sandbox using SSH. Switch to the zeppelin
user. Then view the contents of the checkperf directory and confirm that a checkpoint
file exists. Then exit the zeppelin user back to root.
# ssh sandbox
# su zeppelin
# exit
1. Create a variable named oddNums that contains a list of odd numbers 1-9.
3. Create a broadcast variable named filterOdd that contains the values in oddNums.
4. Print the results of a filter operation where only numbers that appear in the filterOdd
broadcast variable show up in the output.
%pyspark
print rdd1.collect()
filterOdd = sc.broadcast(oddNums)
Result
You have used several of the performance tuning tools and practices discussed in the lesson.
Lab Steps
Perform the following steps:
1 . Build and Submit a Spark RDD application
a. Open a terminal and use SSH to connect to sandbox:
# ssh sandbox
b. OPTIONAL:
If you have a favorite Linux text editor already, you may use it for the rest of the lab. If
you are not already familiar with a Linux text editor, we recommend that you download
and install nano – a small, simple to use editor that will be used for the commands and
screenshots in this lab.
yum -y install nano
# cd /root/spark/applications/python/templates/
# nano SparkRDD.py
d. The objective is to build an application based on this template and the comments
posted on this template. You may try to do this on your own, or use the solution steps
below:
e. Exit the text editor and save your changes (in nano, press Ctrl + X to exit and press
Y to save your changes.
NOTE:
This application will now use YARN as the resource manager with number of executors
as 2 and 1g of memory.
Monitor the submitted Job. Open a new tab on the Firefox browser and browse to:
http://sandbox:4040/
IMPORTANT:
The UI below will only be available while the job is running. If you are unable to see the
UI, run the application again and quickly switch to the provided link.
Result
You have successfully built and submitted a Spark applications to a YARN cluster.
Lab Steps
Perform the following steps:
1 . Import the note, read through it, and run code examples.
a. Open the Firefox browser and enter the following URL to view the Zeppelin UI.
http://sandbox:9995/
NOTE:
Zepplin’s current main backend processing engine is Apache Spark.
https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-
notebooks/master/2BNDT63TY/note.json
Name this note Machine Learning Lab. It should appear in the list of available notes on the
Zeppelin home page.
NOTE:
If for some reason the URL is not working, your instructor should know the location of a
JSON copy of this note that can be imported instead of importing it from an Internet link.
d. Read through the note. A fair number of paragraphs are there for context and
instructions. When you come to the first paragraph that displays code, run the code in
that paragraph and view the results.
e. Continue down the note, reading the descriptions and explanations and running the
code as instructed, until you reach the end of the note.
Result
You have walked through a preconfigured Zeppelin note that contained multiple examples of machine
learning code.