0% found this document useful (0 votes)
5 views

m2 Data analytic and visualization

Module II of the Data Analytics and Visualization course covers regression models, including simple and multiple linear regression, logistic regression, and key concepts such as dependent and independent variables, outliers, multicollinearity, underfitting, and overfitting. It discusses various regression techniques, model selection, and cross-validation methods, emphasizing the importance of understanding relationships between variables for prediction and analysis. Practical applications of regression in fields like real estate, demand forecasting, and medical research are also highlighted.

Uploaded by

yatinchauhan786
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

m2 Data analytic and visualization

Module II of the Data Analytics and Visualization course covers regression models, including simple and multiple linear regression, logistic regression, and key concepts such as dependent and independent variables, outliers, multicollinearity, underfitting, and overfitting. It discusses various regression techniques, model selection, and cross-validation methods, emphasizing the importance of understanding relationships between variables for prediction and analysis. Practical applications of regression in fields like real estate, demand forecasting, and medical research are also highlighted.

Uploaded by

yatinchauhan786
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Data Analytics and

Visualization
Course Code- CSC601
Module II
Regression Models

By Bhavika Gharat
Module II- Regression Models (8Hr, CO2)
2.1- Introduction to simple Linear Regression: The Regression Equation,
Fittedvalue and Residuals, Least Square
Introduction to Multiple Linear Regression: Assessing the Model, Cross-Validation,
Model Selection and Stepwise Regression, Prediction Using Regression
2.2- Logistic Regression: Logistic Response function and logit, Logistic
Regression and GLM, Generalized Linear model, Predicted values from Logistic
Regression, Interpreting the coefficients and odds ratios, Linear and Logistic
Regression: similarities and Differences, Assessing the models.
(Note- Numericals with Theory)
Terminology of Regression Analysis
-Dependant and Independent Variable.

-Outliers

-Multicollinearity

-Underfitting and Overfitting


Terminology of Regression Analysis
• Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
• Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
• Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting. And if our algorithm does not perform
well even with training dataset, then such problem is called underfitting.
Dependant and Independent Variable.
• Dependent Variable: The main factor
in Regression analysis which we want
to predict or understand is called the
dependent variable. It is also
called target variable.
• Independent Variable: The factors
which affect the dependent variables or
which are used to predict the values of
the dependent variables are called
independent variable, also called as
a predictor.
Outliers
● Outliers: Outlier is an observation which
contains either very low value or very
high value in comparison to other
observed values. An outlier may hamper
the result, so it should be avoided.
Multicollinearity
● Multicollinearity: If the
independent variables are highly
correlated with each other than
other variables, then such condition
is called Multicollinearity. It should
not be present in the dataset,
because it creates problem while
ranking the most affecting variable.
Underfitting and Overfitting
● Underfitting and Overfitting: If
our algorithm works well with the
training dataset but not well with
test dataset, then such problem is
called Overfitting.
● if our algorithm does not perform
well even with training dataset,
then such problem is
called underfitting.
Introduction
Regression analysis is a statistical technique of measuring the
relationship between variables. It provides the values of the
dependent variable from the value of an independent variable. The
main use of regression analysis is to determine the strength of
predictors, forecast an effect, a trend, etc.
For example, a gym supplement company can use regression
analysis techniques to determine how prices and advertisements can
affect the sales of its supplements.
Regression
● we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the
datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.
Types of Regression Model

There are three different types of regression models:


1. Simple and Multiple Regression
2. Linear and Non-linear Regression
Simple Linear Regression
Multiple Linear Regression
1. Linear Regression-
Linear regression is one of the simplest predictive modelling algorithms which predicts the values of
dependent variables by taking into account just one independent variable. For example, predicting the salary
by looking at the qualification or predicting the height by looking at the age, etc.
Simple Linear Regression

This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple linear
regression is:

Y = a + bX

where Y is the response (dependent) variable,


X is the predictor (independent) variable,
b is the estimated slope
a is the estimated intercept.
Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.
Uses of Linear Regression-
Linear regression is often used in business, government, and other scenarios. Some common
practical applications of linear regression in the real world include the following:
● Real estate: A simple linear regression analysis can be used to model residential home
prices as a function of the home's living area. Such a model helps set or evaluate the list
price of a home on the market. The model could be further improved by including other
input variables such as number of bathrooms, number of bed rooms, lot size, school district
ran kings, crime statistics, and property taxes.
● Demand forecasting: Businesses and governments can use linear regression models to
predict demand for goods and services. For example, restaurant chains can appropriately
prepare for the predicted type and quantity of food that customers will consume based upon
the weather, the day of the week, whether an item is offered as a special, the time of day,
and the reservation volume. Similar models can be built to predict retail sales, emergency
room visits, and ambulance dispatches.
● Medical: A linear regression model can be used to analyze the effect of a proposed
radiation treatment on reducing tumor sizes. Input variables might include duration of a
single radiation treatment, frequency of radiation treatment, and patient attributes such as
age or weight.
Coefficient of Regression

Theorem 1
The correlation coefficient is the geometric mean between the regression coefficients, i.e.
r2 = byx ⋅ bxy
∴ r =
Remark
If regression coefficient are positive, we take r +ve and if
regression coefficients are negative, we take r – ve.
University Questions

(iii) Also, find the value of y at x=12


Examples
Examples
Example
MULTIPLE LINEAR REGRESSION MODEL
In multiple linear regression f is linear.
i.e., Y = β0 + β1 X1 + β2 X2 + ..... + βk Xk.
Suppose Y depends on two independent variables X1 and X2;
i.e., Y = β0 + β1 X1 + β2 X2
To estimate the coefficients β0, β1, β2; we apply the least square method to minimise.
Examples
Cross-validation

Cross-validation is a technique for validating the model efficiency by training it on the


subset of input data and testing on previously unseen subset of the input data. It is a
technique how a statistical model generalises to an independent dataset.

The basic steps of cross-validation are :


● (i) Reserve a subset of the dataset as a validation set.
● (ii) Provide the training to the model using the training dataset.
● (iii) Evaluate model performance using the validation set.
Methods used for Cross-Validation

The common methods used for cross-validation are :


(1) Validation set approach
(2) Leave-P-out cross-validation
(3) Leave one out cross- validation
(4) K-fold cross-validation
(5) Stratified K-fold cross-validation

Among these, K-fold cross-validation is easy to understand, and the output is less
biased than other methods.
K-Fold Cross-Validation

This K-fold cross-validation helps us to build the model which is a generalized one.
To achieve this K-fold cross validation, we have to split the data set into three sets,
training, testing and validation, with the challenge of the volume of the data.
The dataset X is divided randomly into K equal-sized parts, Xi, i = 1, 2, …, K.
To generate each pair, we keep one of the K parts out as validation set and combine the
remaining (K – 1) parts to form the training set.

Doing this K times, each time leaving out another one of the K parts out, we get K pairs :
V1 = X1, T1 = X2 ∪ X3 ∪ ... ∪ XK
V2 = X2, T2 = X1 ∪ X3 ∪ ... ∪ XK

VK = XK, TK = X1 ∪ X2 ∪ ... ∪ XK – 1
Life Cycle of K-fold Cross-Validation

∙ Let us have a generalised K-value. If K = 5, it means, we are splitting the given dataset into 5
folds and running the Train and Test.
•During each run, one fold is considered for testing and the rest will be for training and moving on with
iterations, the below pictorial representation gives an idea of the flow of the fold-defined size
Model Selection

∙ Model selection is the task of selecting a model from among various candidates.
∙ It depends on the basis of performance criterion to choose the best one in the context of
learning.
∙ It is the selection of a statistical model from a set of candidate models from given data.
∙ In the simplest cases, a pre-existing set of data is considered.
∙ Model selection is also referred to the problem of selecting a few representative models
from a large set of computational models for the purpose of decision making or
optimisation under uncertainty
Principle of Model Selection

∙ Model selection is the fundamental task of scientific inquiry.


∙ The principle that explains a series of observations is linked directly to a mathematical
model, predicting these observations.
∙ The mathematical approach decides among a set of candidate models; and this set must
be chosen by the researcher.
∙ Model selection technique is supposed to balance goodness of fit with simplicity. More
complex models will be better able to adapt their shape to fit the data.
● Goodness of fit is generally determined using a ‘likelihood ratio’ approach and that
leads to a ‘chi-squared test.
∙ A standard example of model selection is that of curve fitting. We must select a curve
that describes the function that generated the points.
∙ Selection of a curve depends on the given set of points and other background
knowledge.
Two Directions of Model Selection

∙ There are two main objectives in inference and learning from data.
∙ One is scientific discovery. It is also called as statistical inference.
∙ It is understanding of the underlying data-generating mechanism and interpretation of
the nature of the data.
∙ Another objective is for predicting future or unseen observations, it is also called as
statistical prediction.
∙ Generally, data scientists are interested in both directions.
∙ Along with two different objectives, model selection can also have two directions: (i)
model selection for inference and (ii) model selection for prediction.
Methods of Choosing the Set of Candidate Models

Assisting method in choosing the candidate models :


(i) Data transformation (statistics)
(ii) Exploratory data analysis
(iii)Model specification
(iv) Scientific method.

For model selection there are three main approaches :


(1)Optimisation of some selection criterion
(2)Tests of hypotheses, and
(3) ad hoc methods.
Model Selection Used For

∙ In statistics, model selection is a process researchers use to compare the relative value
of different statistical models and determine which one is the best fit for the observed
data.
∙ The Akaike information criterion is one of the most common method of model
selection.
Stepwise Regression

∙ Stepwise Regression is the step by step iterative construction of a ‘regression’ model.


It involves the selection of independent variables to be used in a final model.
∙ It involves adding or removing explanatory variables and testing for statistical
significance after each iteration.
∙ The statistical software packages make stepwise regression possible.
Features of Stepwise Regression

(i) Stepwise regression is a method that iteratively examines the statistical


significance of each independent variable in a linear regression model.
(ii) The forward selection approach starts with nothing and adds each new
variable incrementally, testing for statistical significance.
(iii) The backward elimination method begins with a full model loaded with several
variables and then removes one variable to test its importance relative to overall
results.
(iv) Stepwise regression is an approach that fits data into a model to achieve the
desired result.
Types of Stepwise Regression

(1) Forward Selection


It begins with no variables in the model. It tests each variable as it is added to the model.
And those that are statistically significant are kept.
The process is repeated till the results are optimal.
(2) Backward elimination
This begins with a set of independent variable. Then deletes one at a time, and testing to
see if the removed variable is statistically significant.
(3) Bidirectional Elimination
This is a combination of the first tow methods that test which variables should be included
or excluded.
Limitations of Stepwise Regression

∙ Regressin analysis, both linear and multivariate, is widely used in the economics and
investment world.
∙ A simple linear regression might look at the ‘price-to-earnings ratios’ and stock returns
to determine if stocks with low P/E ratios ofter high returns.
∙ The problem with this approach is that market conditions often change, and the
relationships in the past need not hold true in the present or the future.
Stepwise Regression Formula

Let us standardise each dependent and independent variable, that is we subtract the
mean and divide by standard deviation of a variable, we get the standardised
regression coefficients.
We mention the formula :
bj⋅std = bj
Where Sy and Sxj are the standard deviations for the dependent variable and the
corresponding jth independent variable.
The percentage change in the square-root of mean square error, (this is true when the
specified variable are added to, or deleted from the model) is called RMSE.
The value is used by MinMSE method.
This percentage change in Root Mean Square Error (RMSE) is calculate as :
introduction to logistic regression
● What are the differences between supervised learning, unsupervised learning &
reinforcement learning?
1. Supervised Learning - Learning where data is labeled, and the motivation is to classify
something or predict a value. Example: Detecting fraudulent transactions from a list of credit
card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the motivation is to find
patterns in given data. In this case, you are asking the machine learning model to process
the data from which you can then draw conclusions. Example: Customer segmentation
based on spend data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to how humans
learn. The motivation is to find optimal policy of how to act in a given environment. The
machine learning model examines all possible actions, makes a policy that maximizes
benefit, and implements the policy(trial). If there are errors from the initial policy, apply
reinforcements back into the algorithm and continue to do this until you reach the optimal
policy. Example: Personalized recommendations on streaming platforms like YouTube.
Logistic Regression-
● Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given class or
not. Logistic regression is a statistical algorithm which analyze the relationship between two
data factors.
● Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between
0 and 1.
● For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 it
belongs to Class 0. It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
● Key Points:
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Terminologies involved in Logistic Regression
• Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
• Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
• Logistic function: The formula used to represent how the independent and dependent variables relate to one another.
The logistic function transforms the input variables into a probability value between 0 and 1, which represents the
likelihood of the dependent variable being 1 or 0.
• Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the
probability is the ratio of something occurring to everything that could possibly occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression,
the log odds of the dependent variable are modeled as a linear combination of the independent variables and the
intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables
relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log odds when all independent
variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the logistic regression model, which
Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Z= is the input to the sigmoid function
e=is [Euler's number] i.e e= 2.718
Types of Logistic Regression
1. Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
Assumptions of Logistic Regression
We will explore the assumptions of logistic regression as understanding these assumptions is
important to ensure that we are using appropriate application of the model. The assumption
include:
1. Independent observations: Each observation is independent of the other. meaning there is no
correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent variable must be
binary or dichotomous, meaning it can take only two values. For more than two categories
SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The relationship between
the independent variables and the log odds of the dependent variable should be linear.
4. No outliers: There should be no outliers in the dataset.
5. Large sample size: The sample size is sufficiently large
Working of Logistic Regression Algorithm:

Logistic Regression measures the


relationship between the categorical
dependent variable(which we want to
predict) and one or more independent
variables(i.e features) by estimating
probabilities using a logistic/sigmoid
function.
Here x1,x2,x3 and x4 are the input features These probabilities must then be
,now the model will estimate the probability transformed into binary values in order to
of an event and this output (probability) goes actually make a prediction.
to sigmoid function as an input ,then sigmoid
function predicts the output as 0 or 1.
Mathematics behind logistic regression
● Probability always ranges between 0 (does not happen) and 1 (happens).
Using Covid-19 example, in the case of binary classification, the probability of
testing positive and not testing positive will sum up to 1.
● We use logistic function or sigmoid function to calculate probability in logistic
regression. The logistic function is a simple S-shaped curve used to convert
data into a value between 0 and 1.
Training data assumptions for logistic regression
● Training data that satisfies the below assumptions is usually a good fit for
logistic regression.
• The predicted outcome is strictly binary or dichotomous. (This applies to
binary logistic regression).
• The factors, or the independent variables, that influence the outcome are
independent of each other. In other words there is little or no multicollinearity
among the independent variables.
• The independent variables can be linearly related to the log odds.
• Fairly large sample sizes.
● If your training data does not satisfy the above assumptions, logistic
regression may not work for your use case.
Where to use logistic regression?
● Logistic regression is used to solve classification problems, and the most common
use case is binary logistic regression, where the outcome is binary (yes or no). In
the real world, you can see logistic regression applied across multiple areas and
fields.
• In health care, logistic regression can be used to predict if a tumor is likely to be
benign or malignant.
• In the financial industry, logistic regression can be used to predict if a transaction is
fraudulent or not.
• In marketing, logistic regression can be used to predict if a targeted audience will
respond or not.
● Are there other use cases for logistic regression aside from binary logistic
regression? Yes. There are two other types of logistic regression that depend on the
number of predicted outcomes.
Applications of Logistic Regression :

1. Image Segmentation and Categorization.


2. Geographic Image Processing.
3. Handwriting recognition.
4. Healthcare : Analyzing a group of over million people for myocardial
infarction within a period of 10 years is an application area of logistic
regression.
5. Diseases diagnostics
6. Spam or no spam in emails.
7. Emergency detection.
Linear Regression Logistic Regression
Linear regression is used to predict the continuous Logistic regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
variables. variables.

Linear regression is used for solving regression problem. It is used for solving classification problems.

In this we predict the value of continuous variables In this we predict values of categorical variables

In this we find best fit line. In this we find S-Curve.

The output must be continuous value, such as price, age, Output must be categorical value such as 0 or 1, Yes or
etc. no, etc.

It required linear relationship between dependent and


It not required linear relationship.
independent variables.

There may be collinearity between the independent There should not be collinearity between independent
variables. variables.

You might also like