SL-III Lab Manual
SL-III Lab Manual
Patil Pratishthan’s
DR. D. Y. PATIL COLLEGE OF ENGINEERING, AKURDI PUNE
Prepared By:
Mr. Pratik Chopade
Mrs. Anagha Jawalkar
Course Course Teaching Scheme(Hrs./ Week) Credits
Code Name
317534 Software Laboratory III 04 02
Course Objectives:
To understand principles of Data Science for the analysis of real time problems.
To develop in depth understanding and implementation of the key technologies in Data Science.
To analyze and demonstrate knowledge of statistical data analysis techniques for decision-making.
To gain practical, hands-on experience with statistics programming languages and Big Data tools.
Course Outcomes:
CO1: Apply principles of Data Science for the analysis of real time problems
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
CO4: Demonstrate text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
The instructor is expected to frame the assignments by understanding the prerequisites, technological
aspects, utility and recent trends related to the topic. The assignment framing policy need to address the
average students and inclusive of an element to attract and promote the intelligent students. Use of open
source software is encouraged. Based on concepts learned. Instructor may also set one assignment or mini-
project that is suitable to respective branch beyond the scope of syllabus.
Set of suggested assignment list is provided in groups- A and B. Each student must perform 13 assignments
(10 from group A, 3 from group B), 2 mini projects from Group C
2
Table of Contents
Sr. No Title of experiment CO PageN
Mapping o
Group A
1. Data Wrangling, I Perform the following operations using Python on any CO1 1
open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate open source data from the web (e.g.,https://www.kaggle.com).
Provide a clear description of the data and its source (i.e., URL of the
web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas
isnull(), describe() function to get some initial statistics. Provide variable
descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of
variables by checking the data types (i.e., character, numeric, integer,
factor, and logical) of the variables in the data set. If variables are not in
the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
2. Data Wrangling II CO1
Create an “Academic performance” dataset of students and perform the
following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there
are missing values and/or inconsistencies, use any of the suitable
techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any
of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The
purpose of this transformation should be one of the following
reasons: to change the scale for better understanding of the variable,
to convert a non-linear relation into a linear one, or to decrease the
skewness and convert the distribution into a normal distribution.
3
3. Descriptive Statistics - Measures of Central Tendency and variability CO2
Perform the following operations on any open source dataset (e.g.,
data.csv)
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error
rate, Precision, Recall on the given dataset.
7 Text Analytics CO4
4
8 Data Visualization I CO5
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and
contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can find
any patterns in the data.
2. Write a code to check how the price of the ticket (column name:
'fare') for each passenger is distributed by plotting a histogram.
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a
box plot for distribution of age with respect to each gender along with
the information about whether they survived or not. (Column names :
'sex' and 'age')
Download the Iris flower dataset or any other dataset into a DataFrame.
(e.g., https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and
give the inference as:
1. List down the features and their types (e.g., numeric, nominal)
available in the dataset.
Group B
11 Write a code in JAVA for a simple WordCount application that counts CO6
the number of occurrences of each word in a given input set using the
Hadoop MapReduce framework on local-standalone set-up.
12 Design a distributed application using MapReduce which processes a CO6
log file of a system.
5
Lab Assignment 1
PROBLEM STATEMENT:
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the
data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the
correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
THEORY:
Wrangling the data is crucial, yet it is considered as a backbone to the entire analysis part. The main purpose
of data wrangling is to make raw data usable. In other words, getting data into a shape. 0n average, data
scientists spend 75% of their time wrangling the data, which is not a surprise at all. The important needs of
data wrangling include,
It makes sense to the resultant dataset, as it gathers data that acts as a preparation stage for the data
mining process.
6
Helps to make concrete and take a decision by cleaning and structuring raw data into the required
format.
To create a transparent and efficient system for data management, the best solution is to have all data
in a centralized location so it can be used in improving compliance.
Wrangling the data helps make decisions promptly and helps the wrangler clean, enrich, and transform
the data into a perfect picture.
1. DISCOVERING:
Discovering is a term for an entire analytic process, and it’s a good way to learn how to use the data to explore
and it brings out the best approach for analytics explorations. It is a step in which the data is to be understood
more deeply.
2. STRUCTURING:
Raw data is given randomly. There will not be any structure to it in most cases because raw data comes from
many formats of different shapes and sizes. The data must be organized in such a manner where the analytics
attempt to use it in his analysis part.
3. CLEANING:
High-quality analysis happens here where every piece of data is checked carefully and redundancies are
removed that don’t fit the data for analysis. Data containing the Null values have to be changed either to an
empty string or zero and the formatting will be standardized to make the data of higher quality. The goal of
data cleaning or remediation is to ensure that there are no possible ways that the final data could be influenced
that is to be taken for final analysis.
7
4. ENRICHING:
Enriching is like adding some sense to the data. In this step, the data is derived into new kinds of data from
the data which already exits from cleaning into the formatted manner. This is where the data need to strategize
that you have in your hand and to make sure that you have is the best-enriched data. The best way to get the
refined data is to down sample, upscale it, and finally augur the data.
5. VALIDATING:
For analysis and evaluation of the quality of specific data set data quality rules are used. After processing the
data, the quality and consistency are verified which establish a strong surface to the security issues. These are
to be conducted along multiple dimensions and to adhere to syntactic constraints.
6. PUBLISHING:
The final part of the data wrangling is Publishing which gives the sole purpose of the entire wrangling
process. Analysts prepare the wrangled data that use further down the line that is its purpose after all. The
finalized data must match its format for the eventual data’s target. Now the cooked data can be used for
analytics.
Pandas are an open-source mainly used for Data Analysis. Data wrangling deals with the following
functionalities.
Data exploration: Visualization of data is made to analyze and understand the data.
Dealing with missing values: Having Missing values in the data set has been a common issue when
dealing with large data set and care must be taken to replace them. It can be replaced either by mean,
mode or just labelling them as NaN value.
Reshaping data: Here the data is either modified from the addressing of pre-existing data or the data
is modified and manipulated according to the requirements.
Filtering data: The unwanted rows and columns are filtered and removed which makes the data into a
compressed format.
Others: After making the raw data into an efficient dataset, it is bought into useful for data
visualization, data analyzing, training the model, etc.
Data Preprocessing is carried out to remove the cause of unformatted real-world data which we discussed
above. First of all, let's explain how missing data can be handled during Data Preparation. Three different
steps can be executed which are given below -
Ignoring the missing record - It is the simplest and efficient method for handling the missing data.
But, this method should not be performed at the time when the number of missing values is immense
or when the pattern of data is related to the unrecognized primary root of the cause of the statement
problem.
8
Filling the missing values manually - This is one of the best-chosen methods of Data Preparation
process. But there is one limitation that when there are large data set, and missing values are
significant then, this approach is not efficient as it becomes a time-consuming task.
Filling using computed values - The missing values can also be occupied by computing mean, mode
or median of the observed given values. Another method could be the predictive values in Data
Preprocessing are that are computed by using any Machine Learning or Deep Learning tools and
algorithms. But one drawback of this approach is that it can generate bias within the data as the
calculated values are not accurate concerning the observed values.
Data Formatting
We should make sure that every column is assigned to the correct data type. This can be checked through the
property dtypes.
Tweet Id object
Tweet URL object
Tweet Posted Time (UTC) object
Tweet Content object
Tweet Type object
Client object
Retweets Received int64
Likes Received int64
Tweet Location object
Tweet Language object
User Id object
Name object
Username object
User Bio object
Verified or Non-Verified object
Profile URL object
Protected or Non-protected object
User Followers int64
User Following int64
User Account Creation Date object
Impressions int64
dtype: object
We can convert the column Tweet Location to string by using the function astype() as follows:
9
Data Normalization with Pandas
Data Normalization could also be a typical practice in machine learning which consists of transforming
numeric columns to a standard scale. In machine learning, some feature values differ from others multiple
times. The features with higher values will dominate the learning process.
Data Normalization involves adjusting values measured on different scales to a common scale.
Normalization applies only to columns containing numeric values. Normalization methods are:
min max
z-score
Min-Max scaling
Z-score normalization
Z=(x−μ)/ σ
When we look at the categorical data, the first question that arises to anyone is how to handle those data,
because machine learning is always good at dealing with numeric values. We could make machine
learning models by using text data. So, to make predictive models we have to convert categorical data into
numeric form.
10
Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a
dataset of people’s salaries based on their level of education. This is an ordinal type of categorical variable.
We will convert their education levels into numeric terms.
Syntax:
Replacing the values is not the most efficient way to convert them. Pandas provide a method
called get_dummies which will return the dummy variable columns.
One hot encoding is the most widespread approach, and it works very well unless your categorical variable
takes on a large number of values One hot encoding creates new (binary) columns, indicating the presence of
each possible value from the original data.It uses get_dummies() Method
Method 3:
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-
readable form. Machine learning algorithms can then decide in a better way how those labels must be
operated. It is an important pre-processing step for the structured dataset in supervised learning.
Example:
Suppose we have a column Height in some dataset.
11
After applying label encoding, the Height column is converted into: where 0 is the label for tall, 1 is the
label for medium, and 2 is a label for short height.
Example :# Import dataset
label_encoder = preprocessing.LabelEncoder()
df[‘Height’]= label_encoder.fit_transform(df[Height’])
df[‘Height’].unique()
Procedure-
12
STEP 4: #DESCRIBE() FUNCTION TO GET SOME INITIAL STATISTICS
DF.DESCRIBE()
#CHECK THE DIMENSIONS OF THE DATA FRAME
DF.SHAPE
#TOTAL NUMBER OF ELEMENTS IN THE DATAFRAME
DF.SIZE
STEP 5: DATA FORMATTING
DF.DTYPES
DF.ASTYPES(“COLUMN_NAME”)
DF = DF.ASTYPE({"ENGINE-LOCATION":'CATEGORY', " HORSEPOWER":'INT64'})
PROGRAM :
13
14
15
16
CONCLUSION:
I have understood how important data wrangling is for data and using different techniques optimized results
can be obtained. Hence wrangle the data, before processing for analysis.
17
Lab Assignment 2
PROBLEM STATEMENT:
Create an “Academic performance” dataset of students and perform the following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with
them.
3. Apply data transformations on at least one of the variables. The purpose of this transformation should be
one of the following reasons: to change the scale for better understanding of the variable, to convert a non-
linear relation into a linear one, or to decrease the skewness and convert the distribution into a normal
distribution.
THEORY:
Dataframe.isnull():-
18
Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are True for NaN values
Dataframe.notnull():-
Syntax: Pandas.notnull(“DataFrame Name”) or DataFrame.notnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are False for NaN values
2. dataframe.replace() function is used to replace a string, regex, list, dictionary, series, number etc.
from a dataframe. This is a very rich function as it has many variations.The most powerful thing about
this function is that it can work with Python regex (regular expressions).
Syntax: DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False,
method=’pad’, axis=None)
Parameters:
to_replace : [str, regex, list, dict, Series, numeric, or None] pattern that we are trying to replace in
dataframe.
value : Value to use to fill holes (e.g. 0), alternately a dict of values specifying which value to use for each
column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace : If True, in place. Note: this will modify any other views on this object (e.g. a column from a
19
DataFrame). Returns the caller if this is True.
limit : Maximum size gap to forward or backward fill
regex : Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace
must be a string. Otherwise, to_replace must be None because this parameter will be interpreted as a regular
expression or a list, dict, or array of regular expressions.
method : Method to use when for replacement, when to_replace is a list.
Returns: filled : NDFrame
2. Z-score Z- Score is also called a standard score. This value/score helps to understand that how far is the
data point from the mean. And after setting up a threshold value one can utilize z score values of data points
to define the outliers.
Zscore = (data_point -mean) / std. deviation
20
3. IQR (Inter Quartile Range)
IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most commonly used
and most trusted approach used in the research field.
IQR = Quartile3 – Quartile1
Data transformation:-
Data transformation is the process of converting raw data into a format or structure that would be more
suitable for model building and also data discovery in general. Data transformation predominantly deals with
normalizing also known as scaling data , handling skewness and aggregation of attributes
Min Max Scaler - normalization
MinMaxScaler() is applied when the dataset is not distorted. It normalizes the data into a range between 0
and 1 based on the formula:
21
Robust Scaling- RobustScaler() is more suitable for dataset with skewed distributions and outliers because
it transforms the data based on median and quantile, specifically
x' = (x – median) / inter-quartile range.
Z score normalization:
Z score normalization is- In Z score normalization, we perform following mathematical transformation.
Skewness of data:
skewness() :
Skewness basically gives the shape of normal distribution of values.
If skewness value lies above +1 or below -1, data is highly skewed. If it lies between +0.5 to -0.5, it is
moderately skewed. If the value is 0, then the data is symmetric
the skewness level, we should know whether it is positively skewed or negatively skewed.
Positively skewed data:
If tail is on the right as that of the second image in the figure, it is right skewed data. It is also called positive
skewed data.Common transformations of this data include square root, cube root, and log.
a. Cube root transformation:
The cube root transformation involves converting x to x^(1/3). This is a fairly strong transformation with a
substantial effect on distribution shape: but is weaker than the logarithm. It can be applied to negative and
zero values too. Negatively skewed data.
b. Square root transformation:
Applied to positive values only. Hence, observe the values of column before applying.
c. Logarithm transformation:
The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong
transformation and can be used to reduce right skewness.
22
Negatively skewed data:
If the tail is to the left of data, then it is called left skewed data. It is also called negatively skewed data.
Common transformations include square , cube root and logarithmic.
a. Square transformation:
The square, x to x², has a moderate effect on distribution shape and it could be used to reduce left skewness.
Another method of handling skewness is finding outliers and possibly removing them.
23
Moderately negatively Skewed i.e does not follow a normal distribution
24
Checking the distribution of variables using a Q-Q plot
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of
quantiles came from the same distribution, we should see the points forming a roughly straight line.
That is, if the data falls in a straight line then the variable follows normal distribution otherwise not.
25
26
CONCLUSION:
Students will learn about data transformation techniques and outliers. Techniques to detect & remove outliers.
Normal Distribution, Scaling and techniques to transform data
27
Lab Assignment 3
PROBLEM STATEMENT:
Descriptive Statistics - Measures of Central Tendency and variability perform the following operations on any
open source dataset (e.g., data.csv)
1.Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a dataset (age,
income etc.) with numeric variables grouped by one of the qualitative (categorical) variables. For example,
if your categorical variable is age groups and quantitative variable is income, then provide summary
statistics of income grouped by the age groups. Create a list that contains a numeric value for each response
to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean, standard deviation
etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ of iris.csv dataset.
THEORY:
What is Statistics?
Statistics is the science of collecting data and analyzing them to infer proportions (sample) that are
representative of the population. In other words, statistics is interpreting data in order to make predictions for
the population.
DESCRIPTIVE STATISTICS: Descriptive Statistics is a statistics or a measure that describes the data.
INFERENTIAL STATISTICS: Using a random sample of data taken from a population to describe
and make inferences about the population is called Inferential Statistics.
Descriptive Statistics
Descriptive Statistics is summarizing the data at hand through certain numbers like mean, median etc. so as to
make the understanding of the data easier. It does not involve any generalization or inference beyond what is
available. This means that the descriptive statistics are just the representation of the data (sample) available
and not based on any theory of probability.
28
Measures of Central Tendency
A Measure of Central Tendency is a one number summary of the data that typically describes the center of the
data. These one number summary is of three types.
1. Mean: Mean is defined as the ratio of the sum of all the observations in the data to the total number of
observations. This is also known as Average. Thus mean is a number around which the entire data set
is spread.
2. Median: Median is the point which divides the entire data into two equal halves. One-half of the data
is less than the median, and the other half is greater than the same. Median is calculated by first
arranging the data in either ascending or descending order.
If the number of observations is odd, median is given by the middle observation in the sorted form.
If the number of observations is even, median is given by the mean of the two middle observations in
the sorted form.
An important point to note that the order of the data (ascending or descending) does not affect the median
3. Mode: Mode is the number which has the maximum frequency in the entire data set, or in other words,
mode is the number that appears the maximum number of times. A data can have one or more than one mode.
Functions & Description: To calculate Mean, Standard Deviation, Median, Max, and Min we can apply
these functions.
29
6 std() Standard Deviation of the Values
We can use the describe function to generate the statistics above and apply it to multiple columns
simultaneously. It also provides the lower, median and upper percentiles.
Suppose we wanted to know the average runtime for each genre. We can use the ‘groupby()’ method to
calculate these statistics:
30
The group by method is used to support this type of operations. More general, this fits in the more
general split-apply-combine pattern:
The apply and combine steps are typically done together in pandas.
Example:-
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Output:
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Grouping can be done by multiple columns at the same time. Provide the column names as a list to
the groupby() method.
31
Procedure-
STEPS:
PROGRAM: TO Provide summary statistics (mean, median, minimum, maximum, standard deviation)
for a dataset (age, income etc.) with numeric variables grouped by one of the qualitative (categorical)
variables
import pandas as pd
df = pd.read_csv("C:/Users/Admin/Downloads/adult.csv")
print(df)
df.groupby("gender")["age"].describe()
df.groupby("marital-status")["age"].mean()
df.groupby("marital-status")["age"].median()
df.groupby(["gender","marital-status"])["age"].std()
df.groupby("income")["age"].mean()
df.groupby(["income","gender"])["age"].mean()
df.groupby("marital-status")["marital-status"].count()
#The value_counts() method counts the number of records for each category in a column.
df["marital-status"].value_counts()
32
Write a Python program to display some basic statistical details like percentile, mean, standard
deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ of iris.csv dataset.
import pandas as pd
d = pd.read_csv("C:/Users/Admin/Downloads/Iris.csv")
print('Iris-setosa')
print(d[setosa].describe())
print('\nIris-versicolor')
print(d[setosa].describe())
print('\nIris-virginica')
print(d[setosa])
print(d[setosa].describe())
import pandas as pd
d = pd.read_csv("C:/Users/Admin/Downloads/Iris.csv")
#Species
d.groupby(["Species"])["SepalLengthCm"].mean()
d.groupby(["Species"])["SepalLengthCm"].std()
d.groupby(["Species"])["SepalLengthCm"].describe()
d.groupby(["Species"])["SepalLengthCm"].quantile(q=0.75)
d.groupby(["Species"])["SepalLengthCm"].quantile(q=0.25)
a=d.groupby(["Species"])["SepalLengthCm"].mean()
print(a)
b=d.groupby(["Species"])["SepalLengthCm"].median()
print(b)
33
list=[a,b]
print(list)
34
35
CONCLUSION:
To summarize, here we discussed how to generate summary statistics using the Pandas library. Here, we
discussed how to use pandas methods to generate mean, median, max, min and standard deviation. We also
saw describe () method which allows us to generate percentiles, in addition to the mean, median, max, min
and standard deviation, for any numerical column. Finally, we showed how to generate aggregate statistics
for categorical columns.
36
Lab Assignment 4
PROBLEM STATEMENT:
Create a Linear Regression Model using Python/R to predict home prices using Boston Housing Dataset
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about various
houses in Boston through different parameters. There are 506 samples and 14 feature variables in this dataset.
THEORY:
Machine Learning is a part of Artificial Intelligence (AI), where the model will learn from the data and can
predict the outcome. Machine Learning is a study of statistical computer algorithm that improves
automatically from the data. Unlike computer algorithms, rely on human beings.
Types of Machine Learning Algorithms
Supervised Machine Learning
In Supervised Learning, we will have both the independent variable (predictors) and the dependent variable
(response). Our model will be trained using both independent and dependent variables. So we can predict the
outcome when the test data is given to the model. Here, using the output our model can measure its accuracy
and can learn over time. In supervised learning, we will solve both Regression and Classification problems.
Unsupervised Machine Learning
In Unsupervised Learning, our model will won’t be provided an output variable to train. So we can’t use the
model to predict the outcome like Supervised Learning. These algorithms will be used to analyze the data and
find the hidden pattern in it. Clustering and Association Algorithms are part of unsupervised learning.
Reinforcement Learning
Reinforcement learning is the training of machine learning models which make a decision sequentially. In
simple words, the output of the model will depend on the present input, and the next input will depend on the
previous output of the model.
What is Regression?
Regression analysis is a statistical method that helps us to understand the relationship between dependent and
one or more independent variables,
Dependent Variable
This is the Main Factor that we are trying to predict.
Independent Variable
These are the variables that have a relationship with the dependent variable.
Here, y and x are the dependent variables, and independent variables respectively. b1(m) and b0(c) are slope
and y-intercept respectively.
Slope(m) tells, for one unit of increase in x, How many units does it increase in y. When the line is steep, the
slope will be higher, the slope will be lower for the less steep line.
Constant(c) means, What is the value of y when the x is zero.
How the Model will Select the Best Fit Line?
First, our model will try a bunch of different straight lines from that it finds the optimal line that predicts our
data points well.
38
For finding the best fit line our model uses the cost function. In machine learning, every algorithm has a cost
function, and in simple linear regression, the goal of our algorithm is to find a minimal value for the cost
function.And in linear regression (LR), we have many cost functions, but mostly used cost function is
MSE(Mean Squared Error). It is also known as a Least Squared Method.
Yi – Actual value,
Y^i – Predicted value,
n – number of records.
( yi – yi_hat ) is a Loss Function. And you can find in most times people will interchangeably use the word
loss and cost function. But they are different, and we are squaring the terms to neglect the negative value.
Loss Function
It is a calculation of loss for single training data.
Cost Function
It is a calculation of average loss over the entire dataset.
39
From the above picture, blue data points are representing the actual values from training data, a red
line(vector) is the predicted value for that actual blue data point. we can notice a random error, the actual
value-predicted value, model is trying to minimize the error between the actual and predicted value. Because
in the real world we need a model, which makes the prediction very well. So our model will find the loss
between all the actual and predicted values respectively. And it selects the line which has an average error of
all points lower.
Steps
1. Our model will fit all possible lines and find an overall average error between the actual and predicted
values for each line respectively.
2. Selects the line which has the lowest overall error. And that will be the best fit line.
41
print(" ")
print("Root Mean Squared Error: {}".format(rmse))
print("R^2: {}".format(r2))
print("\n")
Output:-
CONCLUSION:
We studied & applied the concepts of linear regression on the Boston housing dataset. Also we calculated the
accuracy of the model.
42
Lab Assignment 5
PROBLEM STATEMENT:
THEORY:
What is Logistic Regression?
Logistic Regression: Classification techniques are an essential part of machine learning and data
mining applications. Approximately 70% of problems in Data Science are classification problems.
There are lots of classification problems that are available, but logistic regression is common and is a
useful regression method for solving the binary classification problem.
Another category of classification is Multinomial classification, which handles the issues where
multiple classes are present in the target variable. For example, the IRIS dataset is a very famous
example of multi-class classification. Other examples are classifying article/blog/document categories.
Logistic Regression can be used for various classification problems such as spam detection. Diabetes
prediction, if a given customer will purchase a particular product or will they churn another
competitor, whether the user will click on a given advertisement link or not, and many more examples
are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for
two-class classification. It is easy to implement and can be used as the baseline for any binary
classification problem. Its basic fundamental concepts are also constructive in deep learning.
Logistic regression describes and estimates the relationship between one dependent binary variable
and independent variables. Logistic regression is a statistical method for predicting binary classes. The
outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible
classes. For example, it can be used for cancer detection problems. It computes the probability of an
event occurring.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of
odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary
event utilising a logit function.
Linear Regression Equation:
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
43
Sigmoid Function:
Differentiate between Linear and Logistic Regression Linear regression gives you a continuous output,
but logistic regression provides a constant output. An example of the continuous output is house price
and stock price. Examples of the discrete output is predicting whether a patient has cancer or not,
predicting whether the customer will churn. Linear regression is estimated using Ordinary Least
Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE)
approach.
Sigmoid Function The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that
can take any real-valued number and map it into a value between 0 and 1. If the curve goes to positive
infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become
0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and
if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the output is 0.75,
we can say in terms of probability as: There is a 75 percent chance that a patient will suffer from
cancer.
44
Types of Logistic Regression
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not
Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories such as
predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories such as
restaurant or product rating from 1 to 5.
The two limitations of using a linear regression model for classification problems are:
the predicted value may exceed the range (0,1)
error rate increases if the data has outliers
There definitely is a need for Logistic regression here.
45
Basic layout of a Confusion Matrix
We can obtain four different combinations from the predicted and actual values of a classifier:
Confusion Matrix
True Positive: The number of times our actual positive values are equal to the predicted positive. You
predicted a positive value, and it is correct.
False Positive: The number of times our model wrongly predicts negative values as positives. You
predicted a negative value, and it is actually positive.
True Negative: The number of times our actual negative values are equal to predicted negative values. You
predicted a negative value, and it is actually negative.
False Negative: The number of times our model wrongly predicts negative values as positives. You
predicted a negative value, and it is actually positive.
Accuracy: The accuracy is used to find the portion of correctly classified values. It tells us how often our
classifier is right. It is the sum of all true values divided by total values.
46
Precision: Precision is used to calculate the model's ability to classify positive values correctly. It is
the true positives divided by the total number of predicted positive values.
Recall: It is used to calculate the model's ability to predict positive values. "How often does the model
predict the correct positive values?". It is the true positives divided by the total number of actual
positive values.
F1-Score: It is the harmonic mean of Recall and Precision. It is useful when you need to take both
Precision and Recall into account.
CONCLUSION:
In this way we have done data analysis using logistic regression for Social Media Adv. and evaluate the
performance of model.
47
Lab Assignment 6
PROBLEM STATEMENT:
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given
dataset.
THEORY:
Naive Bayes algorithm
In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the
classification task. Naïve Bayes classification is based on applying Bayes’ theorem with strong independence
assumption between the features. Naïve Bayes classification produces good results when we use it for textual
data analysis such as Natural Language Processing.
Naïve Bayes models are also known as simple Bayes or independent Bayes. All these names refer to
the application of Bayes’ theorem in the classifier’s decision rule. Naïve Bayes classifier applies the Bayes’
theorem in practice. This classifier brings the power of Bayes’ theorem to machine learning.
Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for each class
such as the probability that given record or data point belongs to a particular class. The class with the highest
probability is considered as the most likely class. This is also known as the Maximum A Posteriori (MAP).
MAP (A)
= max (P (A | B))
= max (P (B | A) * P (A))
Here, P (B) is evidence probability. It is used to normalize the result. It remains the same, So, removing it
would not affect the result.
Naïve Bayes Classifier assumes that all the features are unrelated to each other. Presence or absence of a
feature does not influence the presence or absence of any other feature.
48
3. Types of Naive Bayes algorithm
There are 3 types of Naïve Bayes algorithm. The 3 types are listed below:-
When we have continuous attribute values, we made an assumption that the values associated with
each class are distributed according to Gaussian or Normal distribution. For example, suppose the training
data contains a continuous attribute x. We first segment the data by the class, and then compute the mean and
variance of x in each class. Let µi be the mean of the values and let σi be the variance of the values associated
with the ith class. Suppose we have some observation value xi . Then, the probability distribution of xi given a
class can be computed by the following equation –
With a Multinomial Naïve Bayes model, samples (feature vectors) represent the frequencies with
which certain events have been generated by a multinomial (p1, . . . ,pn) where pi is the probability that event
i occurs. Multinomial Naïve Bayes algorithm is preferred to use on data that is multinomially distributed. It is
one of the standard algorithms which is used in text categorization classification.
In the multivariate Bernoulli event model, features are independent boolean variables (binary
variables) describing inputs. Just like the multinomial model, this model is also popular for document
classification tasks where binary term occurrence features are used rather than term frequencies.
Naïve Bayes is one of the most straightforward and fast classification algorithm. It is very well suited for large
volume of data. It is successfully used in various applications such as :
1. Spam filtering
2. Text classification
3. Sentiment analysis
49
4. Recommender systems
Confusion Matrix?
We can obtain four different combinations from the predicted and actual values of a classifier:
Confusion Matrix
True Positive: The number of times our actual positive values are equal to the predicted positive. You
predicted a positive value, and it is correct.
False Positive: The number of times our model wrongly predicts negative values as positives. You
predicted a negative value, and it is actually positive.
True Negative: The number of times our actual negative values are equal to predicted negative values. You
predicted a negative value, and it is actually negative.
False Negative: The number of times our model wrongly predicts negative values as positives. You
predicted a negative value, and it is actually positive.
Accuracy: The accuracy is used to find the portion of correctly classified values. It tells us how often our
classifier is right. It is the sum of all true values divided by total values.
Precision: Precision is used to calculate the model's ability to classify positive values correctly. It is
the true positives divided by the total number of predicted positive values.
50
Recall: It is used to calculate the model's ability to predict positive values. "How often does the model
predict the correct positive values?". It is the true positives divided by the total number of actual
positive values.
F1-Score: It is the harmonic mean of Recall and Precision. It is useful when you need to take both
Precision and Recall into account.
51
52
53
CONCLUSION:
In this way we have learned and performed data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
54
Lab Assignment 7
PROBLEM STATEMENT:
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS
Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of documents by calculating Term Frequency and Inverse Document Frequency.
THEORY:
Basic concepts of Text Analytics
One of the most frequent types of day-to-day conversion is text communication. In our everyday
routine, we chat, message, tweet, share status, email, create blogs, and offer opinions and criticism. All
of these actions lead to a substantial amount of unstructured text being produced. It is critical to
examine huge amounts of data in this sector of the online world and social media to determine people's
opinions.
Text mining is also referred to as text analytics. Text mining is a process of exploring sizable textual
data and finding patterns. Text Mining processes the text itself, while NLP processes with the
underlying metadata. Finding frequency counts of words, length of the sentence, presence/absence of
specific words is known as text mining. Natural language processing is one of the components of text
mining. NLP helps identify sentiment, finding entities in the sentence, and category of blog/article.
Text mining is preprocessed data for text analytics. In Text Analytics, statistical and machine learning
algorithms are used to classify information.
NLTK(natural language toolkit) is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces and lexical resources such as WordNet, along with a suite
of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic
reasoning and many more. Analysing movie reviews is one of the classic examples to demonstrate a
simple NLP Bag-of-words model, on movie reviews.
Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text paragraph into
smaller chunks such as words or sentences is called Tokenization. Token is a single entity that is the
building blocks for a sentence or paragraph.
Sentence tokenization : split a paragraph into list of sentences using sent_tokenize() method
55
Stop words removal:
Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an,
the, etc. In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.
Stemming is a normalization technique where lists of tokenized words are converted into shortened
root words to remove redundancy. Stemming is the process of reducing inflected (or sometimes
derived) words to their word stem, base or root form. A computer program that stems word may be
called a stemmer. E.g. A stemmer reduces the words like fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues,
arguing, and argus to the stem argu .
Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on its
meaning and context. Lemmatization usually refers to the morphological analysis of words, which
aims to remove inflectional endings. It helps in returning the base or dictionary form of a word known
as the lemma. Eg. Lemma for studies is study
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or
end of the word. On the contrary, Lemmatization is a more powerful operation, and it takes into consideration
morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms.
In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be
looked in the dictionary. Hence, lemmatization helps in forming better machine learning features.
POS Tagging
POS (Parts of Speech) tell us about grammatical information of words of the sentence by assigning specific
token (Determiner, noun, adjective , adverb , verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc)
to each words. Word can have more than one POS depending upon the context where it is used. We can use
POS tags as statistical NLP tasks. It distinguishes a sense of word which is very helpful in text realization and
infer semantic information from text for sentiment analysis.
Term frequency–inverse document frequency(TFIDF) , is a numerical statistic that is intended to reflect how
important a word is to a document in a collection or corpus.
Term Frequency (TF) It is a measure of the frequency of a word (w) in a document (d). TF is defined
as the ratio of a word’s occurrence in a document to the total number of words in a document. The
denominator term in the formula is to normalize since all the corpus documents are of different
lengths.
56
Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not consider the importance of
words. Some words such as’ of’, ‘and’, etc. can be most frequently present but are of little significance. IDF
provides weightage to each word based on its frequency in the corpus D.
It is the product of TF and IDF. TFIDF gives more weight-age to the word that is rare in the corpus (all the
documents). TFIDF provides more importance to the word that is more frequent in the document.
57
58
59
60
CONCLUSION:
We have performed Text Analysis experiment using TF-IDF algorithm
61
Lab Assignment 8
PROBLEM STATEMENT:
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the
passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to see if we can find any
patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by
plotting a histogram.
THEORY:
Data visualization is the graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends,
outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts
of information and make data-driven decisions.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for
drawing attractive and informative statistical graphics. Seaborn helps you explore and understand your data.
Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the
62
necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented,
declarative API lets you focus on what the different elements of your plots mean, rather than on the details of
how to draw them.
Behind the scenes, seaborn uses matplotlib to draw its plots. For interactive work, it’s recommended to use a
Jupyter/IPython interface in matplotlib mode, or else you’ll have to call matplotlib.pyplot.show() when you
want to see the plot.
Now, let’s perform the operations in the problem statement on our data set.
63
Assign a variable to x to plot a univariate distribution along the x axis:
64
Check how well the histogram represents the data by specifying a different bin width:
CONCLUSION:
We have successfully implemented operations of the ‘seaborn’ library on the ‘titanic’ dataset, and explored
some patterns in the data. We have also successfully plotted a histogram to see the ticket price distribution
65
Lab Assignment 9
PROBLEM STATEMENT:
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with
respect to each gender along with the information about whether they survived or not. (Column names : 'sex'
and 'age')
THEORY:
Data visualization is the graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends,
outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts
of information and make data-driven decisions.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for
drawing attractive and informative statistical graphics. Seaborn helps you explore and understand your data.
Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the
necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented,
66
declarative API lets you focus on what the different elements of your plots mean, rather than on the details of
how to draw them.
Behind the scenes, seaborn uses matplotlib to draw its plots. For interactive work, it’s recommended to use a
Jupyter/IPython interface in matplotlib mode, or else you’ll have to call matplotlib.pyplot.show() when you
want to see the plot.
Now, let’s perform the operations in the problem statement on our data set.
67
68
Inferences -
Let's try to understand the box plot for female. The first quartile starts at around 5 and ends at 22 which means
that 25% of the passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at
around 32 which means that 25% of the passengers are aged between 23 and 32. Similarly, the third quartile
starts and ends between 34 and 42, hence 25% passengers are aged within this range and finally the fourth or
last quartile starts at 43 and ends around 65.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are called outliers and
are represented by dots on the box plot.
Now in addition to the information about the age of each gender, you can also see the distribution of the
passengers who survived. For instance, you can see that among the male passengers, on average more younger
people survived as compared to the older ones. Similarly, you can see that the variation among the age of
female passengers who did not survive is much greater than the age of the surviving female passengers.
CONCLUSION:
We have successfully implemented operations of the ‘seaborn’ library on the ‘titanic’ dataset, and explored
some patterns in the data. We have also successfully plotted a histogram to see the ticket price distribution
69
Lab Assignment 10
PROBLEM STATEMENT:
Download the Iris flower dataset or any other dataset into a DataFrame.(e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
THEORY:
Histogram:-
Pandas.DataFrame.hist() function is useful in understanding the distribution of numeric variables. This
function splits up the values into the numeric variables. Its main functionality is to make the Histogram of a
given Data frame.
The distribution of data is represented by Histogram. When Function Pandas DataFrame.hist() is used, it
automatically calls the function matplotlib.pyplot.hist() on each series in the DataFrame
Box Plot :-
Box Plot is the visual representation of the depicting groups of numerical data through their quartiles. Boxplot
is also used for detect the outlier in data set. It captures the summary of the data efficiently with a simple box
and whiskers and allows us to compare easily across groups. Boxplot summarizes a sample data using 25th,
50th and 75th percentiles. These percentiles are also known as the lower quartile, median and upper quartile.
70
A box plot consist of 5 things.
Minimum
Maximum
Syntax :
seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,
orient=None, color=None, palette=None, saturation=0.75, width=0.8, dodge=True, fliersize=5,
linewidth=None, whis=1.5, notch=False, ax=None, **kwargs)
Parameters:
x = feature of dataset
y = feature of dataset
hue = feature of dataset
data = dataframe or full dataset
color = color name
Identify outliers:-
Detect and Remove the Outliers using Python
An Outlier is a data-item/object that deviates significantly from the rest of the (so-called normal)objects. They
can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier
mining. There are many ways to detect the outliers, and the removal process is the data frame same as
removing a data item from the panda’s data frame.
Here pandas data frame is used for a more realistic approach as in real-world project need to detect the
outliers arouse during the data analysis step, the same approach can be used on lists and series-type objects.
1. Visualization
71
Z- Score is also called a standard score. This value/score helps to understand that how far is the data point
from the mean. And after setting up a threshold value one can utilize z score values of data points to define
the outliers.
Inter Quartile Range approach to finding the outliers is the most commonly used and most trusted approach
used in the research field.
72
73
CONCLUSION:
We have successfully implemented operations on the ‘iris’ dataset, also we have successfully plotted a
histogram, boxplot and identified outliers.
74