100% found this document useful (1 vote)

98 views73 pages

Module 1 Notes

The document discusses regression analysis techniques for predictive modeling. It defines regression analysis as evaluating the relationship between dependent and independent variables to enable forecasting or finding relationships between variables. It then describes 9 common types of regression analysis: simple linear regression, multiple linear regression, polynomial regression, logistic regression, ridge regression, lasso regression, Bayesian linear regression, decision tree regression, and random forest regression. The document explains when each technique is best applied and how models can be selected based on adjusted R-squared, predicted R-squared, and p-values.

Uploaded by

20EUIT173 - YUVASRI KB

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

100% found this document useful (1 vote)

98 views73 pages

Module 1 Notes

Uploaded by

20EUIT173 - YUVASRI KB

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 73

MODULE 1

What is Regression Analysis?

A predictive modeling technique that evaluates the relation between dependent
(i.e. the target variable) and independent variables is known as regression
analysis. Regression analysis can be used for forecasting, time series modeling, or
finding the relation between the variables and predict continuous values. For example,
the relationship between household locations and the power bill of the household by
a driver is best studied through regression.

We can analyze data and perform data modeling using regression analysis. Here, we
create a decision boundary/line according to the data points, such that the differences
between the distances of data points from the curve or line are minimized.

Need for Regression techniques

The applications of regression analysis, advantages of linear regression, as well as the
benefits of regression analysis and the regression method of forecasting can help a
small business, and indeed any business, create a better understanding of the variables
(or factors) that can impact its success in the coming weeks, months and years into the
future.
Data are essential figures that define the complete business. Regression analysis
helps to analyze the data numbers and help big firms and businesses to make
better decisions. Regression forecasting is analyzing the relationships between data
points, which can help you to peek into the future.

9 Types of Regression Analysis

The types of regression analysis that we are going to study here are:

1. Simple Linear Regression

2. Multiple Linear Regression
3. Polynomial Regression
4. Logistic Regression
5. Ridge Regression
6. Lasso Regression
7. Bayesian Linear Regression

There are some algorithms we use to train a regression model to create predictions
with continuous values.

8. Decision Tree Regression

9. Random Forest Regression

There are various different types of regression models to create predictions. These
techniques are mostly driven by three prime attributes: one the number of
independent variables, second the type of dependent variables, and lastly the shape
of the regression line.
1) Simple Linear Regression

Linear regression is the most basic form of regression algorithms in machine learning.
The model consists of a single parameter and a dependent variable has a linear
relationship. When the number of independent variables increases, it is called the
multiple linear regression models.

We denote simple linear regression by the following equation given below.

y = mx + c + e

where m is the slope of the line, c is an intercept, and e represents the error in the
model.

The best-fit decision boundary is determined by varying the values of m and c for
different combinations. The difference between the observed values and the predicted
value is called a predictor error. The values of m and c get selected to minimum
predictor error.

Points to keep in mind:

1. Note that a simple linear regression model is more susceptible to outliers hence;
it should not be used in the case of big-size data.
2. There should be a linear relationship between independent and dependent
variables.
3. There is only one independent and dependent variable.
4. The type of regression line: a best fit straight line.

2) Multiple Linear Regression

Simple linear regression allows a data scientist or data analyst to make predictions
about only one variable by training the model and predicting another variable. In a
similar way, a multiple regression model extends to several more than one variable.

Simple linear regression uses the following linear function to predict the value of a
target variable y, with independent variable x?.

y = b0 + b1x1

To minimize the square error we obtain the parameters b? and b? that best fits the
data after fitting the linear equation to observed data.

Points to keep in mind:

1. Multiple regression shows these features multicollinearity, autocorrelation,

heteroscedasticity.
2. Multicollinearity increases the variance of the coefficient estimates and makes
the estimates very sensitive to minor changes in the model. As a result, the
coefficient estimates are unstable.
3. In the case of multiple independent variables, we can go with a forward
selection, backward elimination, and stepwise approach for feature
selection.

3) Polynomial Regression

In a polynomial regression, the power of the independent variable is more than 1. The
equation below represents a polynomial equation:

y = a + bx2

In this regression technique, the best fit line is not a straight line. It is rather a curve
that fits into the data points.

Points to keep in mind:

1. In order to fit a higher degree polynomial to get a lower error, can result in
overfitting. To plot the relationships to see the fit and focus to make sure that
the curve fits according to the nature of the problem. Here is an example of
how plotting can help:
4) Logistic Regression

Logistic regression is a type of regression technique when the dependent variable is

discrete. Example: 0 or 1, true or false, etc. This means the target variable can have only
two values, and a sigmoid function shows the relation between the target variable and
the independent variable.

The logistic function is used in Logistic Regression to create a relation between the
target variable and independent variables. The below equation denotes the logistic
regression.

here p is the probability of occurrence of the feature.

5) Ridge Regression

Ridge Regression is another type of regression in machine learning and is usually used
when there is a high correlation between the parameters. This is because as the
correlation increases the least square estimates give unbiased values. But if the
collinearity is very high, there can be some bias value. Therefore, we introduce a bias
matrix in the equation of Ridge Regression. It is a powerful regression method where
the model is less susceptible to overfitting.

Below is the equation used to denote the Ridge Regression, λ (lambda) resolves the
multicollinearity issue:

β = (X^{T}X + λ*I)^{-1}X^{T}y

6) Lasso Regression

Lasso Regression performs regularization along with feature selection. It avoids the
absolute size of the regression coefficient. This results in the coefficient value getting
nearer to zero, this property is different from what in ridge regression.
Therefore we use feature selection in Lasso Regression. In the case of Lasso Regression,
only the required parameters are used, and the rest is made zero. This helps avoid the
overfitting in the model. But if independent variables are highly collinear, then Lasso
regression chooses only one variable and makes other variables reduce to zero. Below
equation represents the Lasso Regression method:

N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)

7) Bayesian Linear Regression

Bayesian Regression is used to find out the value of regression coefficients. In Bayesian
linear regression, the posterior distribution of the features is determined instead of
finding the least-squares. Bayesian Linear Regression is a combination of Linear
Regression and Ridge Regression but is more stable than simple Linear Regression.

Now, we will learn some types of regression analysis which can be used to train
regression models to create predictions with continuous values.

8) Decision Tree Regression

The decision tree as the name suggests works on the principle of conditions. It is
efficient and has strong algorithms used for predictive analysis. It has mainly attributed
that include internal nodes, branches, and a terminal node.

Every internal node holds a “test” on an attribute, branches hold the conclusion of the
test and every leaf node means the class label. It is used for both classifications as well
as regression which are both supervised learning algorithms. Decisions trees are
extremely delicate to the information they are prepared on — little changes to the
preparation set can bring about fundamentally different tree structures.
9) Random Forest Regression

Random forest, as its name suggests, comprises an enormous amount of individual

decision trees that work as a group or as they say, an ensemble. Every individual
decision tree in the random forest lets out a class prediction and the class with the
most votes is considered as the model's prediction.

Random forest uses this by permitting every individual tree to randomly sample from
the dataset with replacement, bringing about various trees. This is known as bagging.
How to select the right regression model?
Each type of regression model performs differently and the model efficiency depends
on the data structure. Different types of algorithms help determine which parameters
are necessary for creating predictions. There are a few methods to perform model
selection.

1. Adjusted R-squared and predicted R-square: The models with larger

adjusted and predicted R-squared values are more efficient. These statistics can
help you avoid the fundamental problem with regular R-squared—it always
increases when you add an independent variable. This property can lead to
more complex models, which can sometimes produce misleading results.


o Adjusted R-squared increases when a new parameter improves the
model. Low-quality parameters can decrease model efficiency.
o Predicted R-squared is a cross-validation method that can also decrease
the model accuracy. Cross-validation partitions the data to determine
whether the model is a generic model for the dataset.

2. P-values for the independent variables: In regression, smaller p-values than

significance level indicate that the hypothesis is statistically significant.
“Reducing the model” is the process of including all the parameters in the
model, and then repeatedly removing the term with the highest non-significant
p-value until the model contains only significant weighted terms.
3. Stepwise regression and Best subsets regression: The two algorithms that
we discussed for automated model selection that pick the independent
variables to include in the regression equation. When we have a huge amount
of independent variables and require a variable selection process, these
automated methods can be very helpful.

Conclusion
The different types of regression analysis in data science and machine learning
discussed in this tutorial can be used to build the model depending upon the structure
of the training data in order to achieve optimum model accuracy.
Introduction to Multivariate Regression

Multivariate Regression is a type of machine learning algorithm that

involves multiple data variables for analysis. It is mostly considered as a
supervised machine learning algorithm. Steps involved for Multivariate
regression analysis are feature selection and feature engineering,
normalizing the features, selecting the loss function and hypothesis
parameters, optimize the loss function, Test the hypothesis and
generate the regression model. The major advantage of multivariate
regression is to identify the relationships among the variables
associated with the data set. It helps to find the correlation between the
dependent and multiple independent variables. Multivariate linear
regression is a commonly used machine learning algorithm.

Why single Regression model will not work?

 As known that regression analysis is mainly used to exploring the
relationship between a dependent and independent variable.
 In the real world, there are many situations where many independent
variables are influential by other variables for that we have to move to
different options than a single regression model that can only take one
independent variable.

What is Multivariate Regression?

 Multivariate Regression helps use to measure the angle of more than one
independent variable and more than one dependent variable. It finds the
relation between the variables (Linearly related).
 It used to predict the behavior of the outcome variable and the
association of predictor variables and how the predictor variables are
changing.
 It can be applied to many practical fields like politics, economics,
medical, research works and many different kinds of businesses.
 Multivariate regression is a simple extension of multiple regression.
 Multiple regression is used to predicting and exchange the values of one
variable based on the collective value of more than one value of
predictor variables.
 First, we will take an example to understand the use of multivariate
regression after that we will look for the solution to that issue.

Examples of Multivariate Regression

 If E-commerce Company has collected the data of its customers such as
Age, purchased history of a customer, gender and company want to find
the relationship between these different dependents and independent
variables.
 A gym trainer has collected the data of his client that are coming to his
gym and want to observe some things of client that are health, eating
habits (which kind of product client is consuming every week), the
weight of the client. This wants to find a relation between these
variables.

As you have seen in the above two examples that in both of the
situations there is more than one variable some are dependent and
some are independent, so single regression is not enough to analyze
this kind of data.

Here is the multivariate regression that comes into the picture.

1. Feature selection
The selection of features plays the most important role in multivariate
regression.

Finding the feature that is needed for finding which variable is

dependent on this feature.

2. Normalizing Features
For better analysis features are need to be scaled to get them into a
specific range. We can also change the value of each feature.

3. Select Loss function and Hypothesis

The loss function calculates the loss when the hypothesis predicts the
wrong value.

And hypothesis means predicted value from the feature variable.

4. Set Hypothesis Parameters

Set the hypothesis parameter that can reduce the loss function and can
predict.

5. Minimize the Loss Function

Minimizing the loss by using some lose minimization algorithm and use
it over the dataset which can help to adjust the hypothesis parameters.
Once the loss is minimized then it can be used for prediction.

There are many algorithms that can be used for reducing the loss such
as gradient descent.

6. Test the hypothesis function

Check the hypothesis function how correct it predicting values, test it
on test data.

Steps to follow archive Multivariate Regression

1) Import the necessary common libraries such as numpy, pandas

2) Read the dataset using the pandas’ library

3) As we have discussed above that we have to normalize the data for

getting better results. Why normalization because every feature has a
different range of values.

4) Create a model that can archive regression if you are using linear
regression use equation

Y = mx + c

In which x is given input, m is a slop line, c is constant, y is the output

variable.

5) Train the model using hyperparameter. Understand the

hyperparameter set it according to the model. Such as learning rate,
epochs, iterations.

6) As discussed above how the hypothesis plays an important role in

analysis, checks the hypothesis and measure the loss/cost function.

7) The loss/ Cost function will help us to measure how hypothesis value
is true and accurate.

8) Minimize the loss/cost function will help the model to improve

prediction.

9) The loss equation can be defined as a sum of the squared difference

between the predicted value and actual value divided by twice the size
of the dataset.

10) To minimize the Lose/cost function use gradient descent, it starts

with a random value and finds the point their loss function is least.

By following the above we can implement Multivariate regression

Advantages of Multivariate Regression

 The multivariate technique allows finding a relationship between
variables or features
 It helps to find a correlation between independent and dependent
variables.

Disadvantages of Multivariate Regression

 Multivariate techniques are a little complex and high-level mathematical
calculation
 The multivariate regression model’s output is not easily interpretable
and sometimes because some loss and error output are not identical.
 It cannot be applied to a small dataset because results are more
straightforward in larger datasets.

Conclusion- Multivariate Regression

 The main purpose to use multivariate regression is when you have more
than one variables are available and in that case, single linear regression
will not work.
 Mainly real world has multiple variables or features when multiple
variables/features come into play multivariate regression are used.

IB Geography Internal Assessment Checklist
No ratings yet
IB Geography Internal Assessment Checklist
7 pages
Homework 4
No ratings yet
Homework 4
4 pages
Research 1 4 Stem Curriculum Guide
75% (4)
Research 1 4 Stem Curriculum Guide
16 pages
Applied Data Science Camp - Info
100% (1)
Applied Data Science Camp - Info
12 pages
10 RepeatedMeasuresAndMixedANOVA
No ratings yet
10 RepeatedMeasuresAndMixedANOVA
30 pages
Machine Learning Algorithm
100% (2)
Machine Learning Algorithm
20 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Assignment Updated 101
100% (1)
Assignment Updated 101
24 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
XG Boost PDF
100% (1)
XG Boost PDF
3 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Regression
100% (1)
Regression
20 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
Bagging and Boosting
100% (1)
Bagging and Boosting
19 pages
Vinee
100% (1)
Vinee
28 pages
Class 7
No ratings yet
Class 7
42 pages
Linear Regression
No ratings yet
Linear Regression
28 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
EMF CheatSheet V4
100% (1)
EMF CheatSheet V4
2 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Bagging, Boosting
100% (1)
Bagging, Boosting
32 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Charmi Shah 20bcp299 Lab2
100% (1)
Charmi Shah 20bcp299 Lab2
7 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Taxi Trips Analysis Project 1682332303
100% (2)
Taxi Trips Analysis Project 1682332303
28 pages
Final Project - Regression Models
100% (1)
Final Project - Regression Models
35 pages
Sales Forecasting
100% (1)
Sales Forecasting
10 pages
Cloud Motion Tracking (1) (Read-Only)
100% (1)
Cloud Motion Tracking (1) (Read-Only)
10 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
HW1
100% (1)
HW1
8 pages
Csi 5155 ML Project Report
100% (1)
Csi 5155 ML Project Report
24 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
100% (1)
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
33 pages
9 Regression
100% (1)
9 Regression
14 pages
Correlation and Regression - The Simple Case
100% (2)
Correlation and Regression - The Simple Case
106 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Bank Marketing Data
100% (2)
Bank Marketing Data
14 pages
Project 1 - Radio Link Failure Prediction
100% (1)
Project 1 - Radio Link Failure Prediction
8 pages
ML Lect1
100% (1)
ML Lect1
51 pages
PR01
100% (1)
PR01
41 pages
Logistic Regression & Practice
100% (1)
Logistic Regression & Practice
51 pages
Chapter 3 & 4
100% (1)
Chapter 3 & 4
27 pages
Groebner ch05
No ratings yet
Groebner ch05
69 pages
Time Series
No ratings yet
Time Series
23 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Teleco Cutomer Churn
100% (1)
Teleco Cutomer Churn
5 pages
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
100% (1)
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
13 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
CCA Module 1 Question Bank
No ratings yet
CCA Module 1 Question Bank
15 pages
PCD - QN Bank
No ratings yet
PCD - QN Bank
3 pages
Module-1 QuestionBank
No ratings yet
Module-1 QuestionBank
4 pages
1.2-Kotlin Datatypes, Operators
No ratings yet
1.2-Kotlin Datatypes, Operators
25 pages
Module 1 - QN Bank
No ratings yet
Module 1 - QN Bank
9 pages
Measurement & Scaling Techniques
100% (1)
Measurement & Scaling Techniques
34 pages
Southeast Univerisity: Final Term Assignment
No ratings yet
Southeast Univerisity: Final Term Assignment
11 pages
RSC2601 Exam Pack 2018-1
No ratings yet
RSC2601 Exam Pack 2018-1
145 pages
PR2 ATG MET 1 (Repaired)
No ratings yet
PR2 ATG MET 1 (Repaired)
26 pages
Sociological Research
No ratings yet
Sociological Research
22 pages
A Study On Brand Preference of Mobile Phone Users
100% (1)
A Study On Brand Preference of Mobile Phone Users
6 pages
Laboratory Exercise 3
No ratings yet
Laboratory Exercise 3
6 pages
EasyBib Guide To APA Format Papers
100% (1)
EasyBib Guide To APA Format Papers
35 pages
Statistical Treatment (Part of Module 4
No ratings yet
Statistical Treatment (Part of Module 4
56 pages
Common Peak Shape Distortions in HPLC and Their Prevention
No ratings yet
Common Peak Shape Distortions in HPLC and Their Prevention
6 pages
BRM Unit-4
No ratings yet
BRM Unit-4
18 pages
Lec 1 - 2 Intr To Modeling
No ratings yet
Lec 1 - 2 Intr To Modeling
39 pages
Review of Chi-Square
No ratings yet
Review of Chi-Square
3 pages
Bayesian Probability
No ratings yet
Bayesian Probability
4 pages
The Function of Criticism As Viewed by Northrop Frye
100% (5)
The Function of Criticism As Viewed by Northrop Frye
3 pages
SPSS Kecemasan & Barthel Index
No ratings yet
SPSS Kecemasan & Barthel Index
3 pages
How To Appraise A Paper Critically
100% (2)
How To Appraise A Paper Critically
59 pages
Functioning of Atomic Force Microscope and Its Application in Material Characterization
No ratings yet
Functioning of Atomic Force Microscope and Its Application in Material Characterization
40 pages
Şenay Arslantaşlı The Ten Faces of Innovation
No ratings yet
Şenay Arslantaşlı The Ten Faces of Innovation
50 pages
Types of Research
No ratings yet
Types of Research
21 pages
核磁共振部分习题及答案 3
No ratings yet
核磁共振部分习题及答案 3
7 pages
Concepts of Social and Cultural Significance
100% (2)
Concepts of Social and Cultural Significance
12 pages
Hypothesis Testing Mean
100% (1)
Hypothesis Testing Mean
26 pages
PMC500 Tutorial 5 Independent T Test Norazliza BT Abd Aziz (Latest 10 Dis 2021)
No ratings yet
PMC500 Tutorial 5 Independent T Test Norazliza BT Abd Aziz (Latest 10 Dis 2021)
20 pages
Intro-Philo Reviewer
No ratings yet
Intro-Philo Reviewer
4 pages
Testing Density Forecasts, With Applications To Risk Management
No ratings yet
Testing Density Forecasts, With Applications To Risk Management
36 pages
B4.1-R3 Syllabus
No ratings yet
B4.1-R3 Syllabus
2 pages
3 - Ch3
No ratings yet
3 - Ch3
44 pages