0% found this document useful (0 votes)

7K views

Data Analytics Lab Manual

The document is a lab manual for a data analytics course. It provides instructions for 4 assignments involving analyzing different datasets using Python and R. The first assignment involves analyzing the Iris flower dataset, including summarizing features, computing statistics, and creating histograms and boxplots to visualize the data distributions and identify outliers. Key steps are described like downloading the dataset, exploring feature types and distributions, and using pandas and matplotlib for data visualization.

Uploaded by

Anushka Joshi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7K views

Data Analytics Lab Manual

Uploaded by

Anushka Joshi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

lOMoARcPSD|8677763

Data Analgesic Lab Manual

computer engineer (Savitribai Phule Pune University)

StuDocu is not sponsored or endorsed by any college or university

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)
lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

SINHGAD TECHNICAL EDUCATION SOCIETY’S

RMD SINHGAD SCHOOL OF ENGINEERING
Warje, Pune 411033

DEPARTMENT OF COMPUTER ENGINEERING

LAB MANUAL
Academic year 2018-19

B.E. COMPUTER (SEM – I)

DATA ANALYTICS LAB (Laboratory Practice I )

Subject Incharge

TEACHING SCHEME EXAMINATION SCHEME

PRACTICAL: 4 HRS/WEEK UNIVERSITY PRACTICAL : 50 MARKS
UNIVERSITY TERMWORK: 50 MARKS
CREDITS: 02

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

RMD Sinhgad School of Engineering

DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that Mr. /Miss ___________________________ of class _

Div __________ Roll No ._________ Examination Seat No. ______________ has completed all
the practical work in the subject of Data Analytics Lab, satisfactorily, as prescribed by Savitribai
Phule Pune University in the Academic Year 2018 - 2019.

Mrs. Vina Lomte Dr. V. V. Dixit

Staff In-charge Head of Department Principal

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

INDEX
Sr Date of
Date of Marks
N Title completio Signature
performance Obtained
o n

1 Study of Iris flower Data Set

Naive Bayes‟ Algorithm for
2
classification
3 Trip History Analysis
4 Bigmart Sales Analysis

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 1
Title: Study of Iris Flower Data Set
Problem Statement:
Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform following –
 How many features are there and what are their types (e.g., numeric, nominal)?
 Compute and display summary statistics for each feature available in the dataset. (eg.
minimum value, maximum value, mean, range, standard deviation, variance and percentiles
 Data Visualization-Create a histogram for each feature in the dataset to illustrate the feature
distributions. Plot each histogram.
 Create a boxplot for each feature in the dataset. All of the boxplots should be combined into a
single plot. Compare distributions and identify outliers.

Theory:

This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is
linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

This is an exceedingly simple domain.

This data differs from the data presented in Fishers article.

The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.

The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

The extra modules needed for coding

• Pandas:

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data

structures and data analysis tools for the Python programming language.

For Fedora Users:

sudo yum install numpyscipy python-matplotlibipython python-pandas sympy
python-nose atlas-devel

For Ubuntu Users:

sudo aptget install python3pandas

Pandas deals with the following three data structures −

 Series
 DataFrame
 Panel

DataFrame is a two-dimensional array with heterogeneous data. For example,

Name Age Gender Rating

Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

 The table represents the data of a sales team of an organization with their overall performance
rating. The data is represented in rows and columns. Each column represents an attribute and each
row represents a person

pandas.DataFrame

A pandas DataFrame can be created using the following constructor –

pandas.DataFrame( data, index, columns, dtype, copy)

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

S.No Parameter & Description

data
1 data takes various forms like ndarray, series, map, lists, dict, constants
and also another DataFrame.
index
2 For the row labels, the Index to be used for the resulting frame is
Optional Default np.arrange(n) if no index is passed.
columns
3 For column labels, the optional default syntax is - np.arrange(n). This is
only true if no index is passed.
dtype
4
Data type of each column.
copy
5 This command (or whatever it is) is used for copying of data, if the
default is False.
Example 1

The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Its output is as follows −

a b c
0 1 2 NaN
1 5 10 20.0

• matplotlib

- matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates
a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels,
etc.

- In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of
things like the current figure and plotting area, and the plotting functions are directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of a figure and not the strict mathematical term for more than one axis).

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Example 1

Generating visualizations with pyplot is very quick:

import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

• Sklearn

- Scikit-learn provide a range of supervised and unsupervised learning algorithms via a

consistent interface in Python.
- It is on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine
learning and statistical modeling including classification, regression, clustering and
dimensionality reduction.

Loading Your Data Set

- The first step to about anything in data science is loading in your data. This is also the starting
point of this scikit-learn
- To load in the data, you import the module datasets from sklearn. Then, you can use the
load_digits() method from datasets to load in the data:

Example 1:
# import `datasets` from `sklearn`
from sklearn import ________

# Load in the `digits` data

digits = datasets.load_digits()

# Print the `digits` data

print(______)

• How to install?

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

– sudo aptget install python3pandas

– sudo aptget install python3matplotlib
– sudo aptget install python3sklearn

 How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by
calculating the center values using mean, mode and median. Use the ranges and
standard deviations of the sets to examine the variability of data.
Calculating Mean

The mean identifies the average value of the set of numbers. For example,
consider the data set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of all
terms)÷(how many terms or values in the set).
Adding Data Set

Add the numbers in the example data set: 20+24+25+36+25+22+23=175.

Finding Divisor

Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean

Insert the values into the formula to calculate the mean. The mean equals the sum
of the values (175) divided by the number of data points (7). Since 175÷7=25, the
mean of this data set equals 25. Not all mean values will equal a whole number.
Calculating Range

Range shows the mathematical distance between the lowest and highest values in the
data set. Range measures the variability of the data set. A wide range indicates greater
variability in the data, or perhaps a single outlier far from the rest of the data. Outliers
may skew, or shift, the mean value enough to impact data analysis.
Identifying Low and High Values

In the sample group, the lowest value is 20 and the highest value is 36.
Calculating Range

To calculate range, subtract the lowest value from the highest value. Since 36-
20=16, the range equals 16.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Calculating Standard Deviation

Standard deviation measures the variability of the data set. Like range, a smaller
standard deviation indicates less variability.
Formula

Finding standard deviation requires summing the squared difference between each
data point and the mean [∑(x-µ)2], adding all the squares, dividing that sum by one
less than the number of values (N-1), and finally calculating the square root of the
dividend.
Mathematically, start with calculating the mean.
Calculating the Mean

Calculate the mean by adding all the data point values, then dividing by the number
of data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum,
175, by the number of data points, 7, or 175÷7=25. The mean equals 25.
Squaring the Difference

Next, subtract the mean from each data point, then square each difference. The
formula looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value
and µ represents the mean value.
Continuing with the example set, the values become:
20-25=-5 and -52=25; 24-25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and
112=121; 25-25=0 and 02=0; 22-25=-3 and -32=9; and 23-25=-2 and -22=4.
Adding the Squared Differences

Adding the squared differences yields:

25+1+0+121+0+9+4=160. Division by N-1
Divide the sum of the squared differences by one less than the number of data points.
The example data set has 7 values, so N-1 equals 7-1=6. The sum of the squared
differences, 160, divided by 6 equals approximately 26.6667.
Standard Deviation
Calculate the standard deviation by finding the square root of the division by N-1. In
the example, the square root of 26.6667 equals approximately 5.164. Therefore, the
standard deviation equals approximately 5.164.
Evaluating Standard Deviation
Standard deviation helps evaluate data. Numbers in the data set that fall within one
standard deviation of the mean are part of the data set. Numbers that fall outside of
two standard deviations are extreme values or outliers. In the example set, the value
36 lies more than two standard deviations from the mean, so 36 is an outlier. Outliers

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

may represent erroneous data or may suggest unforeseen circumstances and should
be carefully considered when interpreting data.

The Data Set

Import packages

Sample Output

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Conclusion:
Thus we have learnt and implemented various extraction, visualization and box plot for each feature.
Also compared distributions and identify outliers

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 2

Title: Naive Bayes‟ Algorithm for classification

Problem Statement:

Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for classification
- Load the data from CSV file and split it into training and test datasets.
- Summarize the properties in the training dataset so that we can calculate probabilities and
make predictions.
- Classify samples from a test dataset and a summarized training dataset.

Theory :

Implement a classification algorithm that is Naïve Bayes. Implement the

following operations:
1. Split the dataset into Training and Test dataset.
2. Calculate conditional probability of each feature in training dataset.
3. Classify sample from a test dataset.
4. Display confusion matrix with predicted and actual values.

Dataset

The dataset includes data from 768 women with 8 characteristics, in particular:

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)

The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)

The Problem

The type of dataset and problem is a classic supervised binary classification. Given a number of
elements all with certain characteristics (features), we want to build a machine learning model to
identify people affected by type 2 diabetes.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

To solve the problem we will have to analyze the data, do any required transformation and
normalization, apply a machine learning algorithm, train a model, check the performance of the
trained model and iterate with other algorithms until we find the most performant for our type of
dataset.

What is Naive Bayes algorithm?

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem
with the “naive” assumption of independence between every pair of features
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

How Naive Bayes algorithm works?

Let’s understand it using an example. Below I have a training data set of weather and corresponding
target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players
will play or not based on weather condition. Let’s follow the below steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

The algorithm is categorized into the following steps:

1. Handle Data: Load the data from CSV file and split it into training and test datasets.
2. Summarize Data: summarize the properties in the training dataset so that we can calculate
probabilities and make predictions.
3. Make a Prediction: Use the summaries of the dataset to generate a single prediction.
4. Make Predictions: Generate predictions given a test dataset and a summarized training
dataset.
5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the
percentage correct out of all predictions made.
6. Tie it Together: Use all of the code elements to present a complete and standalone
implementation of the Naive Bayes algorithm

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Applications:
 Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.

 Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.

 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers

mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)

 Recommendation System: Naive Bayes Classifier and Collaborative Filtering

together builds a Recommendation System that uses machine learning and data mining
techniques to filter unseen information and predict whether a user would like a given
resource or not

Input:
Structured Dataset : PimaIndiansDiabetes Dataset
File: PimaIndiansDiabetes.csv
Output:
1. Splitted dataset according to Split ratio.
2. Conditional probability of each feature.
3. visualization of the performance of an algorithm with confusion matrix

Conclusion: Hence, we have studied classification algorithm that is Naïve Bayes

classification.
Questions:
1. What is Bayes Theorem?
2. What is confusion matrix?
3. Which function is used to split the dataset in R?
4. What are steps of Naïve Bayes algorithm?
5. What is conditional probability?

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 3

Title: Trip History Analysis

Problem Statement:

Use trip history dataset that is from a bike sharing service in the United States. The data is
provided quarter-wise from 2010 (Q4) onwards. Each file has 7 columns. Predict the class of user.

Problem Definition:

Theory:
Data Set Information
Bike sharing systems are new generation of traditional bike rentals where whole process from
membership, rental and return back has become automatic. Through these systems, user is able to
easily rent a bike from a particular position and return back at another position.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

 instant: record index

 dteday : date

 season : season (1:springer, 2:summer, 3:fall, 4:winter)

 yr : year (0: 2011, 1:2012)

 mnth : month ( 1 to 12)

 hr : hour (0 to 23)

 holiday : weather day is holiday or not (extracted from [Web Link])

 weekday : day of the week

 workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered
clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

 temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-
t_min), t_min=-8, t_max=+39 (only in hourly scale)

 atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/
(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

 hum: Normalized humidity. The values are divided to 100 (max)

 windspeed: Normalized wind speed. The values are divided to 67 (max)

 casual: count of casual users

 registered: count of registered users

 cnt: count of total rental bikes including both casual and registered

Classification Problem: Prediction of the biker’s class Label: Member / Casual Classification:
predicts categorical class labels (discrete or nominal)

Classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

1. Language Used: Python with Pandas, Scikit learn library

2. Functions Defined: Alphanumeric to Numeric Data conversion

3. Classifier Selection and Training: KNN, Decision Tree

4. Classifier Usage to predict the class Label of Unseen Data

Conclusion:

Thus we have used trip history dataset and learn to predict the class of user.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 4

Title: Wordcount using MapReduce

Problem Statement:

Write a Hadoop program that counts the number of occurrences of each word in a text file.

Problem Definition :

Hadoop Installation Guide :

https://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-
cluster/

Theory:

What is MapReduce ?

 MapReduce is a framework using which we can write applications to process huge

amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.
 MapReduce is a processing technique and a program model for distributed
computing based on java.
 The MapReduce algorithm contains two important tasks, namely Map and Reduce.

Map and Reduce

 Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
 Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
 The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
 Under the MapReduce model, the data processing primitives are called mappers and
reducers.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

 Decomposing a data processing application into mappers and reducers is sometimes

nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change.
 This simple scalability is what has attracted many programmers to use the MapReduce
model.
The Algorithm

• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
• Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
• Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.

Data input and output

Terminologies

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
 DataNode - Node where data is presented in advance before any processing takes place.
 MasterNode - Node where JobTracker runs and which accepts job requests from clients.
 SlaveNode - Node where Map and Reduce program runs.
 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker - Tracks the task and reports status to JobTracker.
 Job - A program is an execution of a Mapper and Reducer across a dataset.
 Task - An execution of a Mapper or a Reducer on a slice of data.

The wordcount flow

Hadoop Streaming

Steps :

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

1. The following command is to create a directory to store the compiled classes.

$ mkdir words
2. Download hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program
3. Compile the program
4. The following command is used to create an input directory in HDFS.
$hadoop fs -mkdir /input
5. The following command is used to copy input dataset file on HDFS.
$hadoop fs -put fruits.txt /input
6. The following command is used to verify the files in the input directory.
$hadoop fs -ls /input
7. The following command is used to run the Wordcount application by taking the input files
from the input directory
$cat fruits.txt | python mapper.py | sort | python reducer.py
Output:
apple 4
banana 2
grapes 3
mango 2
orange 2
pineapple 1
plum 2
pomegranate 1
raspberry 2

Conclusion:

Thus we have learnt Mapper and Reducer concept and implemented a Hadoop program that
counts the number of occurrences of each word in a text file is implemented

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

MACHINE LEARNING R23 material
100% (7)
MACHINE LEARNING R23 material
32 pages
Data Visualization LAB MANUAL
100% (1)
Data Visualization LAB MANUAL
85 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Machine Learning Unit 1
100% (7)
Machine Learning Unit 1
112 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Ad3411 Data Science and Analytics Laboratory
100% (7)
Ad3411 Data Science and Analytics Laboratory
24 pages
Ad3301 - Data Exploration and Visualization
100% (4)
Ad3301 - Data Exploration and Visualization
2 pages
ML Unit-1
100% (2)
ML Unit-1
12 pages
R20 Cse: R Programming Lab Manual
83% (6)
R20 Cse: R Programming Lab Manual
17 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Unit3 BD
100% (1)
Unit3 BD
104 pages
NLP Lab Manual
83% (6)
NLP Lab Manual
56 pages
Faculty Python Lab Manual
100% (1)
Faculty Python Lab Manual
33 pages
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
33% (3)
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
17 pages
Data Science Lab Manual - CS3361-Ramprakash S
No ratings yet
Data Science Lab Manual - CS3361-Ramprakash S
47 pages
Vtu ML Lab Manual
67% (3)
Vtu ML Lab Manual
47 pages
Data Science-Lab Manual
100% (1)
Data Science-Lab Manual
15 pages
cs3361 Data Science Lab Record Manual
89% (9)
cs3361 Data Science Lab Record Manual
92 pages
Ccs334 - Big Data Analytics
100% (3)
Ccs334 - Big Data Analytics
2 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Question Bank: Subject Name: Artificial Intelligence & Machine Learning Subject Code: 18CS71 Sem: VII
100% (2)
Question Bank: Subject Name: Artificial Intelligence & Machine Learning Subject Code: 18CS71 Sem: VII
8 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
Big Data Analysis Lab Manual
No ratings yet
Big Data Analysis Lab Manual
39 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
21cs644 Module 3
No ratings yet
21cs644 Module 3
95 pages
Co-Po Big Data Analytics
No ratings yet
Co-Po Big Data Analytics
41 pages
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
ME P4252-II Semester - MACHINE LEARNING
No ratings yet
ME P4252-II Semester - MACHINE LEARNING
48 pages
CP7019-Managing Big Data-Anna University - Question Paper
75% (4)
CP7019-Managing Big Data-Anna University - Question Paper
4 pages
Question Bank-Big Data
0% (3)
Question Bank-Big Data
1 page
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Challenges InThreading A Loop - Doc1
100% (2)
Challenges InThreading A Loop - Doc1
6 pages
Data-Structure CSE AKTU Notes
No ratings yet
Data-Structure CSE AKTU Notes
137 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
Database Design Management Lab Manual
100% (1)
Database Design Management Lab Manual
96 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF All Chapters Instant Download
100% (7)
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF All Chapters Instant Download
69 pages
Chapter 02 Understanding of Data
No ratings yet
Chapter 02 Understanding of Data
96 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Cs3491 - Aiml - Unit III - Introduction To Machine Learning1
100% (1)
Cs3491 - Aiml - Unit III - Introduction To Machine Learning1
23 pages
Machine Learning Unit Wise Important Questions
100% (2)
Machine Learning Unit Wise Important Questions
2 pages
Os Lab Manual
No ratings yet
Os Lab Manual
107 pages
cs3362 Foundations of Data Science Lab Manual
75% (8)
cs3362 Foundations of Data Science Lab Manual
53 pages
Deep Learning Lab Manual
100% (10)
Deep Learning Lab Manual
30 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
Ad3511 Deep Learning Lab Manual III Yearjnn
No ratings yet
Ad3511 Deep Learning Lab Manual III Yearjnn
58 pages
Relational Database Management System: Self Learning Material
No ratings yet
Relational Database Management System: Self Learning Material
213 pages
A Convergence of Key Trends: Kept Large Amounts of Information Information On Tape
No ratings yet
A Convergence of Key Trends: Kept Large Amounts of Information Information On Tape
14 pages
Dbms Vtu Notes
100% (1)
Dbms Vtu Notes
104 pages
DSBDA Lab Manual 2022-23
100% (2)
DSBDA Lab Manual 2022-23
148 pages
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
No ratings yet
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
3 pages
Understanding Big Data
No ratings yet
Understanding Big Data
117 pages
cb3401-unit-5
No ratings yet
cb3401-unit-5
12 pages
Business Analytics Lab Manual-Ai
100% (1)
Business Analytics Lab Manual-Ai
93 pages
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
No ratings yet
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
89 pages
final dev record
No ratings yet
final dev record
49 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
J.O. COMBS Unified School District Electronic Information Services User Agreement
No ratings yet
J.O. COMBS Unified School District Electronic Information Services User Agreement
3 pages
Wassce Waec 2023 Data Processing Paper 2 Past Questions and Answer PDF
No ratings yet
Wassce Waec 2023 Data Processing Paper 2 Past Questions and Answer PDF
18 pages
Dbms Notes For Vtu Students
No ratings yet
Dbms Notes For Vtu Students
105 pages
ETSI Ts - 10291601v010201p
No ratings yet
ETSI Ts - 10291601v010201p
14 pages
The Simon Game: Mini-Project On
No ratings yet
The Simon Game: Mini-Project On
10 pages
Credit Card Frauds
No ratings yet
Credit Card Frauds
52 pages
Quark'25 Rulebook
No ratings yet
Quark'25 Rulebook
63 pages
Unit 3-MAD
No ratings yet
Unit 3-MAD
10 pages
Informatica Interview Questions (Scenario-Based) :: Source Qualifier Transformation Filter Transformation
No ratings yet
Informatica Interview Questions (Scenario-Based) :: Source Qualifier Transformation Filter Transformation
8 pages
Excel - Introduction To Data Analysis
No ratings yet
Excel - Introduction To Data Analysis
4 pages
Waterfall Model Documents
100% (1)
Waterfall Model Documents
14 pages
TCS Interview Questions TCS Recruitment Process - Javatpoint
No ratings yet
TCS Interview Questions TCS Recruitment Process - Javatpoint
23 pages
Meizhou Chinese Level 5 Homework Answers
100% (1)
Meizhou Chinese Level 5 Homework Answers
7 pages
Instant Download Database Systems Design Implementation Management 12th Edition Carlos Coronel PDF All Chapter
100% (3)
Instant Download Database Systems Design Implementation Management 12th Edition Carlos Coronel PDF All Chapter
53 pages
Yamini Sahu-1
No ratings yet
Yamini Sahu-1
1 page
Title:: Abdullah FA16-BCE-132
No ratings yet
Title:: Abdullah FA16-BCE-132
4 pages
Aditya 2019 IOP Conf. Ser. Mater. Sci. Eng. 662 022055
No ratings yet
Aditya 2019 IOP Conf. Ser. Mater. Sci. Eng. 662 022055
6 pages
Useful SCOM SQL Queries
No ratings yet
Useful SCOM SQL Queries
7 pages
Cover Letter For Web Developer Job Application
100% (1)
Cover Letter For Web Developer Job Application
5 pages
ISO Weld Symbol
No ratings yet
ISO Weld Symbol
4 pages
Python Vocabularies
100% (1)
Python Vocabularies
101 pages
CG Animation Report ShrikantMore112015135
No ratings yet
CG Animation Report ShrikantMore112015135
2 pages
PATIL-VINAY-SUNIL-UPSC-2023-Rank-122-MGP-Answer-Copy-Ethics-Paper
No ratings yet
PATIL-VINAY-SUNIL-UPSC-2023-Rank-122-MGP-Answer-Copy-Ethics-Paper
64 pages
Vertx Evo v2000 Controller Ds en
No ratings yet
Vertx Evo v2000 Controller Ds en
2 pages
CPP Stream IO FileIO A3
No ratings yet
CPP Stream IO FileIO A3
11 pages
pandago-API-version-1.1.0-1
No ratings yet
pandago-API-version-1.1.0-1
22 pages
Guardian Newsletter January 2015 PDF
No ratings yet
Guardian Newsletter January 2015 PDF
11 pages
Plug-In Manual: Xitron Part Number Doc-1014 05/05
No ratings yet
Plug-In Manual: Xitron Part Number Doc-1014 05/05
8 pages
Day 2
No ratings yet
Day 2
4 pages
Eset File Security For Linux 7 Enu
No ratings yet
Eset File Security For Linux 7 Enu
80 pages