m2 Data analytic and visualization
m2 Data analytic and visualization
Visualization
Course Code- CSC601
Module II
Regression Models
By Bhavika Gharat
Module II- Regression Models (8Hr, CO2)
2.1- Introduction to simple Linear Regression: The Regression Equation,
Fittedvalue and Residuals, Least Square
Introduction to Multiple Linear Regression: Assessing the Model, Cross-Validation,
Model Selection and Stepwise Regression, Prediction Using Regression
2.2- Logistic Regression: Logistic Response function and logit, Logistic
Regression and GLM, Generalized Linear model, Predicted values from Logistic
Regression, Interpreting the coefficients and odds ratios, Linear and Logistic
Regression: similarities and Differences, Assessing the models.
(Note- Numericals with Theory)
Terminology of Regression Analysis
-Dependant and Independent Variable.
-Outliers
-Multicollinearity
This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple linear
regression is:
Y = a + bX
Theorem 1
The correlation coefficient is the geometric mean between the regression coefficients, i.e.
r2 = byx ⋅ bxy
∴ r =
Remark
If regression coefficient are positive, we take r +ve and if
regression coefficients are negative, we take r – ve.
University Questions
Among these, K-fold cross-validation is easy to understand, and the output is less
biased than other methods.
K-Fold Cross-Validation
This K-fold cross-validation helps us to build the model which is a generalized one.
To achieve this K-fold cross validation, we have to split the data set into three sets,
training, testing and validation, with the challenge of the volume of the data.
The dataset X is divided randomly into K equal-sized parts, Xi, i = 1, 2, …, K.
To generate each pair, we keep one of the K parts out as validation set and combine the
remaining (K – 1) parts to form the training set.
Doing this K times, each time leaving out another one of the K parts out, we get K pairs :
V1 = X1, T1 = X2 ∪ X3 ∪ ... ∪ XK
V2 = X2, T2 = X1 ∪ X3 ∪ ... ∪ XK
⋮
VK = XK, TK = X1 ∪ X2 ∪ ... ∪ XK – 1
Life Cycle of K-fold Cross-Validation
∙ Let us have a generalised K-value. If K = 5, it means, we are splitting the given dataset into 5
folds and running the Train and Test.
•During each run, one fold is considered for testing and the rest will be for training and moving on with
iterations, the below pictorial representation gives an idea of the flow of the fold-defined size
Model Selection
∙ Model selection is the task of selecting a model from among various candidates.
∙ It depends on the basis of performance criterion to choose the best one in the context of
learning.
∙ It is the selection of a statistical model from a set of candidate models from given data.
∙ In the simplest cases, a pre-existing set of data is considered.
∙ Model selection is also referred to the problem of selecting a few representative models
from a large set of computational models for the purpose of decision making or
optimisation under uncertainty
Principle of Model Selection
∙ There are two main objectives in inference and learning from data.
∙ One is scientific discovery. It is also called as statistical inference.
∙ It is understanding of the underlying data-generating mechanism and interpretation of
the nature of the data.
∙ Another objective is for predicting future or unseen observations, it is also called as
statistical prediction.
∙ Generally, data scientists are interested in both directions.
∙ Along with two different objectives, model selection can also have two directions: (i)
model selection for inference and (ii) model selection for prediction.
Methods of Choosing the Set of Candidate Models
∙ In statistics, model selection is a process researchers use to compare the relative value
of different statistical models and determine which one is the best fit for the observed
data.
∙ The Akaike information criterion is one of the most common method of model
selection.
Stepwise Regression
∙ Regressin analysis, both linear and multivariate, is widely used in the economics and
investment world.
∙ A simple linear regression might look at the ‘price-to-earnings ratios’ and stock returns
to determine if stocks with low P/E ratios ofter high returns.
∙ The problem with this approach is that market conditions often change, and the
relationships in the past need not hold true in the present or the future.
Stepwise Regression Formula
Let us standardise each dependent and independent variable, that is we subtract the
mean and divide by standard deviation of a variable, we get the standardised
regression coefficients.
We mention the formula :
bj⋅std = bj
Where Sy and Sxj are the standard deviations for the dependent variable and the
corresponding jth independent variable.
The percentage change in the square-root of mean square error, (this is true when the
specified variable are added to, or deleted from the model) is called RMSE.
The value is used by MinMSE method.
This percentage change in Root Mean Square Error (RMSE) is calculate as :
introduction to logistic regression
● What are the differences between supervised learning, unsupervised learning &
reinforcement learning?
1. Supervised Learning - Learning where data is labeled, and the motivation is to classify
something or predict a value. Example: Detecting fraudulent transactions from a list of credit
card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the motivation is to find
patterns in given data. In this case, you are asking the machine learning model to process
the data from which you can then draw conclusions. Example: Customer segmentation
based on spend data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to how humans
learn. The motivation is to find optimal policy of how to act in a given environment. The
machine learning model examines all possible actions, makes a policy that maximizes
benefit, and implements the policy(trial). If there are errors from the initial policy, apply
reinforcements back into the algorithm and continue to do this until you reach the optimal
policy. Example: Personalized recommendations on streaming platforms like YouTube.
Logistic Regression-
● Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given class or
not. Logistic regression is a statistical algorithm which analyze the relationship between two
data factors.
● Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between
0 and 1.
● For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 it
belongs to Class 0. It’s referred to as regression because it is the extension of linear
regression but is mainly used for classification problems.
● Key Points:
• Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Terminologies involved in Logistic Regression
• Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
• Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
• Logistic function: The formula used to represent how the independent and dependent variables relate to one another.
The logistic function transforms the input variables into a probability value between 0 and 1, which represents the
likelihood of the dependent variable being 1 or 0.
• Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the
probability is the ratio of something occurring to everything that could possibly occur.
• Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression,
the log odds of the dependent variable are modeled as a linear combination of the independent variables and the
intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables
relate to one another.
• Intercept: A constant term in the logistic regression model, which represents the log odds when all independent
variables are equal to zero.
• Maximum likelihood estimation: The method used to estimate the coefficients of the logistic regression model, which
Logistic Function – Sigmoid Function
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Z= is the input to the sigmoid function
e=is [Euler's number] i.e e= 2.718
Types of Logistic Regression
1. Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
Assumptions of Logistic Regression
We will explore the assumptions of logistic regression as understanding these assumptions is
important to ensure that we are using appropriate application of the model. The assumption
include:
1. Independent observations: Each observation is independent of the other. meaning there is no
correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent variable must be
binary or dichotomous, meaning it can take only two values. For more than two categories
SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The relationship between
the independent variables and the log odds of the dependent variable should be linear.
4. No outliers: There should be no outliers in the dataset.
5. Large sample size: The sample size is sufficiently large
Working of Logistic Regression Algorithm:
Linear regression is used for solving regression problem. It is used for solving classification problems.
In this we predict the value of continuous variables In this we predict values of categorical variables
The output must be continuous value, such as price, age, Output must be categorical value such as 0 or 1, Yes or
etc. no, etc.
There may be collinearity between the independent There should not be collinearity between independent
variables. variables.