ESDL Lab Manual
ESDL Lab Manual
*****
Data Science and Big Data Analytics
Page 1
Lab. 2 Introduction to R
Purpose
This lab introduces you to the use of the R statistical package within the Data Science and Big
Data Analytics environment. After completing the tasks in this lab student should able to:
a) Read data sets into R, save them, and examine the contents
Tasks students will complete in this lab include:
a) Invoke the R environment and examine the R workspace
b) Read tables created in Lab 1 into the R statistical package
c) Examine, manipulate and save data sets
d) Exit the R environment
Lab. Reference
a) R Commands Quick Reference
b) Surviving LINUX Quick Reference
*****
Lab. 3 Basic Statistics, Visualization, and Hypothesis Tests
Purpose
The lab introduces you to the analysis of data using the R statistical package within the Data
Science and Big Data Analytics environment.
After completing the tasks in this lab student should able to:
a) Perform summary (descriptive) statistics on the data sets
b) Create basic visualizations using R both to support investigation of the data as well as
exploration
c) of the data
d) Create plot visualizations of the data using a graphics package
e) Test a hypothesis about the data
Tasks students will complete in this lab include:
a) Reload data sets into the R statistical package
b) Perform summary statistics on the data
c) Remove outliers from the data
d) Plot the data using R
e) Plot the data using lattice and ggplot
f) Test a hypothesis about the data
Lab. Reference
a) R Commands - Quick Reference
b) Surviving LINUX - Quick Reference
*****
Data Science and Big Data Analytics
Page 2
*****
Lab. 5 Association Rules
Purpose
This lab is designed to investigate and practice Association Rules.
After completing the tasks in this lab student should able to:
a. Use R functions for Association Rule based models
Tasks
Tasks students will complete in this lab include:
a) Use the R -Studio environment to code Association Rule models
b) Apply constraints in the Market Basket Analysis methods such as minimum thresholds on
c) support and confidence measures that can be used to select interesting rules from the set
of all possible rules
d) Use R graphics "arules" to execute and inspect the models and the effect of the various
thresholds
Lab. Reference
The groceries data set - provided for arules by Michael Hahsler, Kurt Hornik and Thomas
Reutterer. http://rss.acs.unt.edu/Rdoc/library/arules/html/Groceries.html
1. Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006) Implications of probabilistic
data modeling for mining association rules.
Data Science and Big Data Analytics
Page 3
*****
Lab. 6 Linear Regression
Purpose
This lab is designed to investigate and practice the Linear Regression method.
After completing the tasks in this lab student should able to:
a) Use R functions for Linear Regression (Ordinary Least Squares - OLS)
b) Predict the dependent variables based on the model
c) Investigate different statistical parameter tests that measure the effectiveness of the model
d) Task
Tasks students will complete in this lab include:
a) Use the R -Studio environment to code OLS models
b) Review the methodology to validate the model and predict the dependent variable for a
set of given independent variables
c) Use R graphics functions to visualize the results generated with the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference
*****
Lab. 7 Logistic Regression
Purpose
This lab is designed to investigate and practice the Logistic Regression method.
After completing the tasks in this lab student should able to:
a. Use R functions for Logistic Regression - also known as Logit)
b. Predict the dependent variables based on the model
c. Investigate different statistical parameter tests that measure the effectiveness of the model
Tasks students will complete in this lab include:
a. Use R -Studio environment to code Logit models
b. Review the methodology to validate the model and predict the dependent variable for a
set of
a. given independent variables
b. Use R graphics functions to visualize the results generated with the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference
Data Science and Big Data Analytics
Page 4
*****
Lab. 9 Decision Trees
Purpose
This lab is designed to investigate and practice Decision Tree (DT) models covered in the course
work.
After completing the tasks in this lab student should able to:
a. Use R functions for Decision Tree models
b. Predict the outcome of an attribute based on the model
Tasks students will complete in this lab include:
a. Use the R -Studio environment to code Decision Tree Models
b. Build a Decision Tree Model based on data whose schema is composed of attributes
c. Predict the outcome of one attribute based on the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference.
*****
Data Science and Big Data Analytics
Page 5
*****
Lab. 11 Hadoop, HDFS, MapReduce and Pig
Purpose
This lab introduces the Hadoop and MapReduce environment that you will be working on for the
next lab.
After completing the tasks in this lab student should able to:
a. Get help on the various Hadoop commands
b. Observe a MapReduce job in action
c. Query various Hadoop servers regarding status
d. Understand and execute "Pig" statements
Tasks students will complete in this lab include:
a. Run Hadoop and Hadoop fs and collect help information
b. Run a shell script to perform a word count activity
c. Run a MapReduce job to produce similar output
d. Investigate the UI for MapReduce/HDFS components to track system behavior
e. Run "Pig" statements to execute the same tasks done with MapReduce
Lab. Reference
References used in this lab are located in your Student Resource Guide. See the Guide for:
a. Hadoop Commands
b. HDFS Commands
Data Science and Big Data Analytics
Page 6
*****
Lab. 13 Final Lab Exercise on Big Data Analytics
Purpose
This lab allows students to apply what they have learned from the analytical methods and tools to
a big data problem using the Analytics Lab Environment.
Tasks students will complete in this lab include:
a. Explore the big data set provided and prepare the data for analysis
b. Assess data quality, outliers and training sets
c. Conduct model selection, code, execute and score the model
d. Use R and PSQL statements during your analysis of big data
e. Create a narrative summary of your findings, using the methods covered earlier in this
module
Lab. Reference
Following files are pre-loaded in this lab:
Analyst.ppt Analyst presentation template
Sponsor.ppt Sponsor presentation template
*.asc encrypted files with suggested code for the solution. The decrypting of these files are
performed with the following command at the $ prompt in the FINAL_LAB directory: gpg -o *.*
-d *.*.asc (*.* represents the filename with extension name)
You will be prompted for a passphrase. Your instructor will provide the pass phrase.
*****
Data Science and Big Data Analytics
Page 7