0% found this document useful (0 votes)
2K views7 pages

ESDL Lab Manual

This document outlines 11 labs focused on developing employability skills in data science and big data analytics. The labs cover topics including introduction to the data environment and tools like R Studio, basic statistics and visualization in R, machine learning algorithms like k-means clustering, linear regression, logistic regression, naive Bayesian classification, decision trees, time series analysis with ARIMA, and working with Hadoop, HDFS, MapReduce and Pig. Each lab has learning objectives and specific tasks for students to complete to gain experience with these analytical techniques and systems.

Uploaded by

anbhute3484
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
2K views7 pages

ESDL Lab Manual

This document outlines 11 labs focused on developing employability skills in data science and big data analytics. The labs cover topics including introduction to the data environment and tools like R Studio, basic statistics and visualization in R, machine learning algorithms like k-means clustering, linear regression, logistic regression, naive Bayesian classification, decision trees, time series analysis with ARIMA, and working with Hadoop, HDFS, MapReduce and Pig. Each lab has learning objectives and specific tasks for students to complete to gain experience with these analytical techniques and systems.

Uploaded by

anbhute3484
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

314448 : Employability Skill Development Laboratory

314448 : EMPLOYABILITY SKILL DEVELOPMENT LABORATORY


Subject: DATA SCIENCE AND BIG DATA ANALYTICS
Laboratory Assignments
Open Source Softwares used:
1. R Studio
2. Greenplum Hadoop
3. WinSCP
4. PGAdmin
5. Putty
6. Postgress SQL
7. VMware
Lab. 1 Introduction to Data Environment
Purpose
The first lab introduces the Analytics Lab Environment you will be working on throughout the
course.
After completing the tasks in this lab students should able to:
a. Authenticate and access the Virtual Machine (VM) assigned to you for all of your lab
exercises
b. Use SQL and Meta commands in PSQL to navigate through the data sets
c. Create subsets of the data, using table joins and filters to analyze subsequent lab
exercises Tasks
Tasks performed in this lab exercise includes:
a. Exploring databases and datasets
b. Using PSQL statements and Meta commands.
c. Creating subsets of data for use in subsequent lab exercises
Lab. Reference
a. PSQL Commands Quick Reference
b. PSQL Meta Commands Quick Reference
c. Surviving LINUX Quick Reference
d. R Quick Reference

*****
Data Science and Big Data Analytics

Page 1

314448 : Employability Skill Development Laboratory

Lab. 2 Introduction to R
Purpose
This lab introduces you to the use of the R statistical package within the Data Science and Big
Data Analytics environment. After completing the tasks in this lab student should able to:
a) Read data sets into R, save them, and examine the contents
Tasks students will complete in this lab include:
a) Invoke the R environment and examine the R workspace
b) Read tables created in Lab 1 into the R statistical package
c) Examine, manipulate and save data sets
d) Exit the R environment
Lab. Reference
a) R Commands Quick Reference
b) Surviving LINUX Quick Reference

*****
Lab. 3 Basic Statistics, Visualization, and Hypothesis Tests
Purpose
The lab introduces you to the analysis of data using the R statistical package within the Data
Science and Big Data Analytics environment.
After completing the tasks in this lab student should able to:
a) Perform summary (descriptive) statistics on the data sets
b) Create basic visualizations using R both to support investigation of the data as well as
exploration
c) of the data
d) Create plot visualizations of the data using a graphics package
e) Test a hypothesis about the data
Tasks students will complete in this lab include:
a) Reload data sets into the R statistical package
b) Perform summary statistics on the data
c) Remove outliers from the data
d) Plot the data using R
e) Plot the data using lattice and ggplot
f) Test a hypothesis about the data
Lab. Reference
a) R Commands - Quick Reference
b) Surviving LINUX - Quick Reference

*****
Data Science and Big Data Analytics

Page 2

314448 : Employability Skill Development Laboratory

Lab. 4 K-means Clustering


Purpose
This lab is designed to investigate and practice K-means Clustering.
After completing the tasks in this labyou should able to:
a) Use R functions to create K-means Clustering models
b) Use ODBC connection to the database and execute SQL statements and load datasets
from the
c) database in an R environment Visualize the effectiveness of the K-means Clustering
algorithm using graphic capabilities in R
d) Use MADlib functions for K-means clustering
Tasks students will complete in this lab include:
a) Use the R -Studio environment to code K-means Clustering models
b) Use the ODBC connection in the R environment to create the average household income
from
c) the census database as test data for K-means Clustering
d) Use R graphics functions to visualize the effectiveness of the K-means Clustering
algorithm
e) Use MADlib functions for K-means clustering
Lab. Reference
http://www.statmethods.net/advstats/cluster.html (originally from Everitt & Hothorn).

*****
Lab. 5 Association Rules
Purpose
This lab is designed to investigate and practice Association Rules.
After completing the tasks in this lab student should able to:
a. Use R functions for Association Rule based models
Tasks
Tasks students will complete in this lab include:
a) Use the R -Studio environment to code Association Rule models
b) Apply constraints in the Market Basket Analysis methods such as minimum thresholds on
c) support and confidence measures that can be used to select interesting rules from the set
of all possible rules
d) Use R graphics "arules" to execute and inspect the models and the effect of the various
thresholds
Lab. Reference
The groceries data set - provided for arules by Michael Hahsler, Kurt Hornik and Thomas
Reutterer. http://rss.acs.unt.edu/Rdoc/library/arules/html/Groceries.html
1. Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006) Implications of probabilistic
data modeling for mining association rules.
Data Science and Big Data Analytics

Page 3

314448 : Employability Skill Development Laboratory

2. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From


Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data
Analysis, and Knowledge Organization, pages 598-605. Springer-Verlag.

*****
Lab. 6 Linear Regression
Purpose
This lab is designed to investigate and practice the Linear Regression method.
After completing the tasks in this lab student should able to:
a) Use R functions for Linear Regression (Ordinary Least Squares - OLS)
b) Predict the dependent variables based on the model
c) Investigate different statistical parameter tests that measure the effectiveness of the model
d) Task
Tasks students will complete in this lab include:
a) Use the R -Studio environment to code OLS models
b) Review the methodology to validate the model and predict the dependent variable for a
set of given independent variables
c) Use R graphics functions to visualize the results generated with the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference

*****
Lab. 7 Logistic Regression
Purpose
This lab is designed to investigate and practice the Logistic Regression method.
After completing the tasks in this lab student should able to:
a. Use R functions for Logistic Regression - also known as Logit)
b. Predict the dependent variables based on the model
c. Investigate different statistical parameter tests that measure the effectiveness of the model
Tasks students will complete in this lab include:
a. Use R -Studio environment to code Logit models
b. Review the methodology to validate the model and predict the dependent variable for a
set of
a. given independent variables
b. Use R graphics functions to visualize the results generated with the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference
Data Science and Big Data Analytics

Page 4

314448 : Employability Skill Development Laboratory

Lab. 8 Naive Bayesian Classifier


Purpose
This lab is designed to investigate and practice the Nave Bayesian Classifier analytic technique.
After completing the tasks in this lab you should be able to:
a. Use R functions for Nave Bayesian Classification
b. Apply the requirements for generating appropriate training data
c. Validate the effectiveness of the Nave Bayesian Classifier with the big data
Tasks students will complete in this lab include:
a. Use R -Studio environment to code the Nave Bayesian Classifier
b. Use the ODBC connection to the "census" database to create a training data set for Nave
a. Bayesian Classifier from the big data
b. Use the Nave Bayesian Classifier program and evaluate how well it predicts the results
using the training data and then compare the results with original data
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference

*****
Lab. 9 Decision Trees
Purpose
This lab is designed to investigate and practice Decision Tree (DT) models covered in the course
work.
After completing the tasks in this lab student should able to:
a. Use R functions for Decision Tree models
b. Predict the outcome of an attribute based on the model
Tasks students will complete in this lab include:
a. Use the R -Studio environment to code Decision Tree Models
b. Build a Decision Tree Model based on data whose schema is composed of attributes
c. Predict the outcome of one attribute based on the model
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference.

*****
Data Science and Big Data Analytics

Page 5

314448 : Employability Skill Development Laboratory

Lab. 10 Time Series Analysis with ARIMA


Purpose
This lab is designed to investigate and practice Time Series Analysis with ARIMA models (BoxJenkinsmethodology).
After completing the tasks in this lab student should able to:
a. Use R functions for ARIMA models
b. Apply the requirements for generating appropriate training data
c. Validate the effectiveness of the ARIMA models
Tasks students will complete in this lab include:
a. Use the R -Studio environment to code ARIMA models
b. Use the ODBC connection to the database to create the weekly sales data from the retail
database
c. Prepare the data (sorting and rendering the data as a Time series)
d. Generate a model and evaluate how well it predicts the results and compare the results
with original data
Lab. Reference
a. R Commands - Quick Reference
b. Surviving LINUX - Quick Reference

*****
Lab. 11 Hadoop, HDFS, MapReduce and Pig
Purpose
This lab introduces the Hadoop and MapReduce environment that you will be working on for the
next lab.
After completing the tasks in this lab student should able to:
a. Get help on the various Hadoop commands
b. Observe a MapReduce job in action
c. Query various Hadoop servers regarding status
d. Understand and execute "Pig" statements
Tasks students will complete in this lab include:
a. Run Hadoop and Hadoop fs and collect help information
b. Run a shell script to perform a word count activity
c. Run a MapReduce job to produce similar output
d. Investigate the UI for MapReduce/HDFS components to track system behavior
e. Run "Pig" statements to execute the same tasks done with MapReduce
Lab. Reference
References used in this lab are located in your Student Resource Guide. See the Guide for:
a. Hadoop Commands
b. HDFS Commands
Data Science and Big Data Analytics

Page 6

314448 : Employability Skill Development Laboratory

Lab. 12 In-database Analytics


Purpose
This lab is designed to familiarize you and give you practice with the in-database analytics
methods covered in lessons three and four of Module 5.
After completing the tasks in this lab student should able to:
a. Use window functions
b. Implement user defined aggregates and user defined functions
c. Use ordered aggregates
d. Use Regular Expressions (Regex) in SQL for text filtering
e. e. Use MADlib functions and plot results from MADlib function outputs
Tasks students will complete in this lab include:
a. Process Clickstream analysis data using window functions, User defined functions, User
defined aggregates and regular expressions
b. Compute median household income using ordered aggregates
c. Use MADlib functions for logistic regression and direct output to plot the results
Lab. Reference
Student resource guide http://doc.madlib.net/v0.2beta/group__grp__logreg.html

*****
Lab. 13 Final Lab Exercise on Big Data Analytics
Purpose
This lab allows students to apply what they have learned from the analytical methods and tools to
a big data problem using the Analytics Lab Environment.
Tasks students will complete in this lab include:
a. Explore the big data set provided and prepare the data for analysis
b. Assess data quality, outliers and training sets
c. Conduct model selection, code, execute and score the model
d. Use R and PSQL statements during your analysis of big data
e. Create a narrative summary of your findings, using the methods covered earlier in this
module
Lab. Reference
Following files are pre-loaded in this lab:
Analyst.ppt Analyst presentation template
Sponsor.ppt Sponsor presentation template
*.asc encrypted files with suggested code for the solution. The decrypting of these files are
performed with the following command at the $ prompt in the FINAL_LAB directory: gpg -o *.*
-d *.*.asc (*.* represents the filename with extension name)
You will be prompted for a passphrase. Your instructor will provide the pass phrase.

*****
Data Science and Big Data Analytics

Page 7

You might also like