Data Analytics Unit 3 Notes
Data Analytics Unit 3 Notes
(Professional Elective - I)
Subject Code: CS513PE
NOTES MATERIAL
UNIT 3
Faculty:
Kailash Sinha
DEPARTMENT OF CSE
SHREE DATTHA INSTITUTE OF ENGINEERING
AND SCIENCE
DATA ANALYTICS UNIT–3
UNIT - III
Linear & Logistic Regression
Syllabus
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable
Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.
Topics:
1. Regression – Concepts
2. Blue property assumptions
3. Least Square Estimation
4. Variable Rationalization
5. Model Building etc.
6. Logistic Regression - Model Theory
7. Model fit Statistics
8. Model Construction
9. Analytics applications to various Business Domains
Unit-2 Objectives:
1. To explore the Concept of Regression
2. To learn the Linear Regression
3. To explore Blue Property Assumptions
4. To Learn the Logistic Regression
5. To understand the Model Theory and Applications
Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe the Concept of Regression
2. To demonstrate Linear Regression
3. To analyze the Blue Property Assumptions
4. To explore the Logistic Regression
5. To describe the Model Theory and Applications
KS SDES – C S E 2|P a g e
DATA ANALYTICS UNIT–3
Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of the average
value of one variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
“Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables
&
One or More Independent variables / Predictor variables.
Regression describes how an independent variable is numerically related to the
dependent variable.
Regression can be used for prediction, estimation and hypothesis testing, and modeling
causal relationships.
KS SDES – C S E 3|P a g e
DATA ANALYTICS UNIT–3
x mean(x)
i
2
i1
B0 mean y – B1*meanx
KS SDES – C S E 4|P a g e
DATA ANALYTICS UNIT–3
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple
linear regression. The procedure for linear regression is different and simpler than
that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x) *
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
yi-mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2
i1
B0 mean y – B1*meanx
We can find from the above formulas,
B1=2 and B0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")
KS SDES – C S E 5|P a g e
DATA ANALYTICS UNIT–3
p yi
2
i
Err i 1
n
p is the predicted value and y is the actual
value, i is the index for a specific instance, n is
the number of predictions, because we must
calculate the error across all predicted values.
Estimating Error for y=0.8*x+0.4
x y = y-actual p = y-predicted p-y (p-y)^2 mean(x)= 3
1 1 1.2 0.2 0.04 s = sum of (p-y)2 = 2.4
2 3 2 -1 1
s/n = 2.4 / 5 = 0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64 sqrt(s/n) = sqrt(0.48) = 0.692
5 5 4.4 -0.6 0.36 RMSE = 0.692
KS SDES – C S E 6|P a g e
DATA ANALYTICS UNIT–3
KS SDES – C S E 7|P a g e
DATA ANALYTICS UNIT–3
Homoscedasticity vs Heteroscedasticity:
KS SDES – C S E 8|P a g e
DATA ANALYTICS UNIT–3
1. Stepwise Forward Selection: This procedure starts with an empty set of attributes as the
minimal set. The most relevant attributes are chosen (having minimum p-value) and are
added to the minimal set. In each iteration, one attribute is added to a reduced set.
Stepwise Backward Elimination: Here all the attributes are considered in the initial set
of attributes. In each iteration, one attribute is eliminated from the set of attributes whose
p-value is higher than significance level.
Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant attributes most
efficiently. This is the most common technique which is generally used for attribute selection.
Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute. Each
branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute
that is not the part of tree is considered irrelevant and hence discarded.
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1. Problem Definition
The first step in constructing a model is to
understand the industrial problem in a more comprehensive way. To identify the purpose of
the problem and the prediction target, we must define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation
KS SDES – C S E 9|P a g e
DATA ANALYTICS UNIT–3
Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders
into account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the
dataset.
3. Data Collection
Data collection is gathering data from relevant sources regarding the analytical problem,
then we extract meaningful insights from the data for prediction.
4. Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying
any algorithmic model to data, we have to explore it first.
By inspecting the data, we get to understand the explicit and hidden trends in data. We find
the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.
There are several sub steps involved in data exploration:
o Feature Identification:
You need to analyze which data features are available and which ones are
not.
Identify independent and target variables.
Identify data types and categories of these variables.
KS SDES – C S E 10 | P a g e
DATA ANALYTICS UNIT–3
o Univariate Analysis:
We inspect each variable one by one. This kind of analysis depends on the
variable type whether it is categorical and continuous.
Continuous variable: We mainly look for statistical trends like mean,
median, standard deviation, skewness, and many more in the dataset.
Categorical variable: We use a frequency table to understand the
spread of data for each category. We can measure the counts and
frequency of occurrence of values.
o Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two or more
variables.
We can find the correlation in case of continuous variables and the case of
categorical, we look for association and dissociation between them.
o Filling Null Values:
Usually, the dataset contains null values which lead to lower the potential of
the model. With a continuous variable, we fill these null values using the
mean or mode of that specific column. For the null values present in the
categorical column, we replace them with the most frequently occurred
categorical value. Remember, don’t delete that rows because you may lose
the information.
5. Predictive Modeling
Predictive modeling is a mathematical approach to create a statistical model to forecast
future behavior based on input test data.
Steps involved in predictive modeling:
Algorithm Selection:
o When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques. When we have unstructured data and want
to predict the clusters of items to which a particular input test sample belongs, we
use unsupervised algorithms. An actual data scientist applies multiple algorithms to
get a more accurate model.
Train Model:
o After assigning the algorithm and getting the data handy, we train our model using
the input data applying the preferred algorithm. It is an action to determine the
correspondence between independent variables, and the prediction targets.
Model Prediction:
KS SDES – C S E 11 | P a g e
DATA ANALYTICS UNIT–3
o We make predictions by giving the input test data to the trained model. We measure
the accuracy by using a cross-validation strategy or ROC curve which performs well
to derive model output for test data.
6. Model Deployment
There is nothing better than deploying the model in a real-time environment. It helps us to
gain analytical insights into the decision-making procedure. You constantly need to update
the model with additional features for customer satisfaction.
To predict business decisions, plan market strategies, and create personalized customer
interests, we integrate the machine learning model into the existing production domain.
When you go through the Amazon website and notice the product recommendations
completely based on your curiosities. You can experience the increase in the involvement of
the customers utilizing these services. That’s how a deployed model changes the mindset of
the customer and convince him to purchase the product.
Key Takeaways
KS SDES – C S E 12 | P a g e
DATA ANALYTICS UNIT–3
Logistic Regression:
Model Theory, Model fit Statistics, Model Construction
Introduction:
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or
1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether or not the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; therefore, it
falls under the classification algorithm.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Types of Logistic Regressions:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Definition: Multi-collinearity:
Multicollinearity is a statistical phenomenon in which multiple independent variables
show high correlation between each other and they are too inter-related.
Multicollinearity also called as Collinearity and it is an undesired situation for any
statistical regression model since it diminishes the reliability of the model itself.
KS SDES – C S E 13 | P a g e
DATA ANALYTICS UNIT–3
If two or more independent variables are too correlated, the data obtained from the
regression will be disturbed because the independent variables are actually
dependent between each other.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
Logistic Regression uses a more complex cost function, this cost function can be
defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a
linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0
and
1. Therefore linear functions fail to represent it as it can have a value greater than 1
or less than 0 which is not possible as per the hypothesis of logistic regression.
1
z sigmoid ( y) ( y)
1 e y
Hypothesis Representation
When using linear regression, we used a formula for the line equation as:
y b0 b1x1 b2 x2 ... bn xn
In the above equation y is a response x1, x2 ,...xn are the predictor variables,
variable,
and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.
> rm(list=ls())
> attach(mtcars) #attaching a
data set into the R environment
> input <- mtcars[,c("mpg","disp","hp","wt")]
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
KS SDES – C S E 15 | P a g e
DATA ANALYTICS UNIT–3
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
KS SDES – C S E 16 | P a g e
DATA ANALYTICS UNIT–3
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
KS SDES – C S E 17 | P a g e
DATA ANALYTICS UNIT–3
True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error
KS SDES – C S E 18 | P a g e
DATA ANALYTICS UNIT–3
Understanding True Positive, True Negative, False Positive and False Negativ
e in a Confusion Matrix
True Positive (TP)
The predicted value matches the actual value
The actual value was positive and the model predicted a positive value
True Negative (TN)
The predicted value matches the actual value
The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value
Also known as the Type 1
errorFalse Negative (FN) – Type 2
error
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error
To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of
correctly predicted observation to the total observations.
Accuracy is a great measure to understand that the model is Best.
Accuracy is dependable only when you have symmetric datasets where values
of false positive and false negatives are almost same.
TP TN
Accuracy
TP FP TN FN
Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
Precision is a useful metric in cases where False Positive is a higher concern
than False Negatives.
KS SDES – C S E 19 | P a g e
DATA ANALYTICS UNIT–3
TP
Recall =
TP+ FN
Recall is a useful metric in cases where False Negative trumps False Positive.
Recall is important in medical cases where it doesn’t matter whether we raise a
false alarm but the actual positive cases should not go undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these
two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.
F Score 2
2*
Precision* Recall
Precesion Precision Recall
1 1 1
Recall
F1 is usually more useful than accuracy, especially if you have an uneven class
distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s better to
look at both Precision and Recall.
But there is a catch here. If the interpretability of the F1-score is poor, means
that we don’t know what our classifier is maximizing – precision or recall? So,
we use it in combination with other evaluation metrics which gives us a complete
picture of the result.
KS SDES – C S E 20 | P a g e
DATA ANALYTICS UNIT–3
Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it
and get the below confusion matrix:
KS SDES – C S E 21 | P a g e
DATA ANALYTICS UNIT–3
Precision:
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict correctly
with our model.
TP 560
Precision= 0.903
TP+ FP 560 60
We can easily calculate Precision and Recall for our model by plugging in the values
into the above questions:
TP 560
Recall = 0.918
TP+ FN 560 50
F1-Score
Precision* Recall
F1 Score 2*
Precision Recall
0.903* 0.918 0.8289
F Score 2 * 0.4552
1
0.903 0.918 1.821
KS SDES – C S E 22 | P a g e
DATA ANALYTICS UNIT–3
of separability. It tells how much the model is capable of distinguishing between classes.
Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By
analogy, the Higher the AUC, the better the model is at distinguishing between patients
with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is
on the x-axis.
Specificity
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TPTP+FN
KS SDES – C S E 23 | P a g e
DATA ANALYTICS UNIT–3
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.
KS SDES – C S E 24 | P a g e
DATA ANALYTICS UNIT–3
Business analysts work with data to help stakeholders understand the things that
affect operations and the bottom line. Identifying things like equipment downtime,
inventory levels, and maintenance costs help companies streamline inventory
management, risks, and supply-chain management to create maximum
efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring
marketing and advertising metrics, identifying consumer behaviour and the target
audience, and analyzing market trends.
KS SDES – C S E 26 | P a g e
DATA ANALYTICS UNIT–3
TOBE DISCUSSED:
Receiver Operating Characteristics:
ROC & AUC
Here, we add the constant term b0, by setting x0 = 1. This gives us K+1 parameters. The
left hand side of the above equation is called the logit of P (hence, the name logistic
regression).
KS SDES – C S E 27 | P a g e
DATA ANALYTICS UNIT–3
The right hand side of the top equation is the sigmoid of z, which maps the real line to
the interval (0, 1), and is approximately linear near the origin. A useful fact about P(z)
is that the derivative P'(z) = P(z) (1 – P(z)). Here’s the derivation:
Later, we will want to take the gradient of P with respect to the set of coefficients
b, rather than z. In that case, P'(z) = P(z) (1 – P(z))z‘, where ‘ is the gradient taken
with respect to b.
KS SDES – C S E 28 | P a g e