Data Analytics Lab Manual
Data Analytics Lab Manual
LAB MANUAL
Academic year 2018-19
Subject Incharge
CERTIFICATE
INDEX
Sr Date of
Date of Marks
N Title completio Signature
performance Obtained
o n
Assignment No: 1
Title: Study of Iris Flower Data Set
Problem Statement:
Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform following –
How many features are there and what are their types (e.g., numeric, nominal)?
Compute and display summary statistics for each feature available in the dataset. (eg.
minimum value, maximum value, mean, range, standard deviation, variance and percentiles
Data Visualization-Create a histogram for each feature in the dataset to illustrate the feature
distributions. Plot each histogram.
Create a boxplot for each feature in the dataset. All of the boxplots should be combined into a
single plot. Compare distributions and identify outliers.
Theory:
This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is
linearly separable from the other 2; the latter are NOT linearly separable from each other.
The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
• Pandas:
Series
DataFrame
Panel
The table represents the data of a sales team of an organization with their overall performance
rating. The data is represented in rows and columns. Each column represents an attribute and each
row represents a person
pandas.DataFrame
The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
a b c
0 1 2 NaN
1 5 10 20.0
• matplotlib
- matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates
a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels,
etc.
- In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of
things like the current figure and plotting area, and the plotting functions are directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of a figure and not the strict mathematical term for more than one axis).
Example 1
• Sklearn
Example 1:
# import `datasets` from `sklearn`
from sklearn import ________
• How to install?
How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by
calculating the center values using mean, mode and median. Use the ranges and
standard deviations of the sets to examine the variability of data.
Calculating Mean
The mean identifies the average value of the set of numbers. For example,
consider the data set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of all
terms)÷(how many terms or values in the set).
Adding Data Set
Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean
Insert the values into the formula to calculate the mean. The mean equals the sum
of the values (175) divided by the number of data points (7). Since 175÷7=25, the
mean of this data set equals 25. Not all mean values will equal a whole number.
Calculating Range
Range shows the mathematical distance between the lowest and highest values in the
data set. Range measures the variability of the data set. A wide range indicates greater
variability in the data, or perhaps a single outlier far from the rest of the data. Outliers
may skew, or shift, the mean value enough to impact data analysis.
Identifying Low and High Values
In the sample group, the lowest value is 20 and the highest value is 36.
Calculating Range
To calculate range, subtract the lowest value from the highest value. Since 36-
20=16, the range equals 16.
Standard deviation measures the variability of the data set. Like range, a smaller
standard deviation indicates less variability.
Formula
Finding standard deviation requires summing the squared difference between each
data point and the mean [∑(x-µ)2], adding all the squares, dividing that sum by one
less than the number of values (N-1), and finally calculating the square root of the
dividend.
Mathematically, start with calculating the mean.
Calculating the Mean
Calculate the mean by adding all the data point values, then dividing by the number
of data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum,
175, by the number of data points, 7, or 175÷7=25. The mean equals 25.
Squaring the Difference
Next, subtract the mean from each data point, then square each difference. The
formula looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value
and µ represents the mean value.
Continuing with the example set, the values become:
20-25=-5 and -52=25; 24-25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and
112=121; 25-25=0 and 02=0; 22-25=-3 and -32=9; and 23-25=-2 and -22=4.
Adding the Squared Differences
may represent erroneous data or may suggest unforeseen circumstances and should
be carefully considered when interpreting data.
Import packages
Sample Output
Conclusion:
Thus we have learnt and implemented various extraction, visualization and box plot for each feature.
Also compared distributions and identify outliers
Assignment No: 2
Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for classification
- Load the data from CSV file and split it into training and test datasets.
- Summarize the properties in the training dataset so that we can calculate probabilities and
make predictions.
- Classify samples from a test dataset and a summarized training dataset.
Theory :
Dataset
The dataset includes data from 768 women with 8 characteristics, in particular:
The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)
The Problem
The type of dataset and problem is a classic supervised binary classification. Given a number of
elements all with certain characteristics (features), we want to build a machine learning model to
identify people affected by type 2 diabetes.
To solve the problem we will have to analyze the data, do any required transformation and
normalization, apply a machine learning algorithm, train a model, check the performance of the
trained model and iterate with other algorithms until we find the most performant for our type of
dataset.
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem
with the “naive” assumption of independence between every pair of features
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Let’s understand it using an example. Below I have a training data set of weather and corresponding
target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players
will play or not based on weather condition. Let’s follow the below steps to perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
1. Handle Data: Load the data from CSV file and split it into training and test datasets.
2. Summarize Data: summarize the properties in the training dataset so that we can calculate
probabilities and make predictions.
3. Make a Prediction: Use the summaries of the dataset to generate a single prediction.
4. Make Predictions: Generate predictions given a test dataset and a summarized training
dataset.
5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the
percentage correct out of all predictions made.
6. Tie it Together: Use all of the code elements to present a complete and standalone
implementation of the Naive Bayes algorithm
Applications:
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
Input:
Structured Dataset : PimaIndiansDiabetes Dataset
File: PimaIndiansDiabetes.csv
Output:
1. Splitted dataset according to Split ratio.
2. Conditional probability of each feature.
3. visualization of the performance of an algorithm with confusion matrix
Assignment No: 3
Problem Statement:
Use trip history dataset that is from a bike sharing service in the United States. The data is
provided quarter-wise from 2010 (Q4) onwards. Each file has 7 columns. Predict the class of user.
Problem Definition:
Theory:
Data Set Information
Bike sharing systems are new generation of traditional bike rentals where whole process from
membership, rental and return back has become automatic. Through these systems, user is able to
easily rent a bike from a particular position and return back at another position.
Attribute Information:
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
dteday : date
hr : hour (0 to 23)
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-
t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/
(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
cnt: count of total rental bikes including both casual and registered
Classification Problem: Prediction of the biker’s class Label: Member / Casual Classification:
predicts categorical class labels (discrete or nominal)
Classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Conclusion:
Thus we have used trip history dataset and learn to predict the class of user.
Assignment No: 4
Problem Statement:
Write a Hadoop program that counts the number of occurrences of each word in a text file.
Problem Definition :
Theory:
What is MapReduce ?
Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
Under the MapReduce model, the data processing primitives are called mappers and
reducers.
• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
• Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
• Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
Terminologies
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Hadoop Streaming
Steps :
Conclusion:
Thus we have learnt Mapper and Reducer concept and implemented a Hadoop program that
counts the number of occurrences of each word in a text file is implemented