100% found this document useful (1 vote)
98 views73 pages

Module 1 Notes

The document discusses regression analysis techniques for predictive modeling. It defines regression analysis as evaluating the relationship between dependent and independent variables to enable forecasting or finding relationships between variables. It then describes 9 common types of regression analysis: simple linear regression, multiple linear regression, polynomial regression, logistic regression, ridge regression, lasso regression, Bayesian linear regression, decision tree regression, and random forest regression. The document explains when each technique is best applied and how models can be selected based on adjusted R-squared, predicted R-squared, and p-values.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
100% found this document useful (1 vote)
98 views73 pages

Module 1 Notes

The document discusses regression analysis techniques for predictive modeling. It defines regression analysis as evaluating the relationship between dependent and independent variables to enable forecasting or finding relationships between variables. It then describes 9 common types of regression analysis: simple linear regression, multiple linear regression, polynomial regression, logistic regression, ridge regression, lasso regression, Bayesian linear regression, decision tree regression, and random forest regression. The document explains when each technique is best applied and how models can be selected based on adjusted R-squared, predicted R-squared, and p-values.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 73

MODULE 1

What is Regression Analysis?


A predictive modeling technique that evaluates the relation between dependent
(i.e. the target variable) and independent variables is known as regression
analysis. Regression analysis can be used for forecasting, time series modeling, or
finding the relation between the variables and predict continuous values. For example,
the relationship between household locations and the power bill of the household by
a driver is best studied through regression.

We can analyze data and perform data modeling using regression analysis. Here, we
create a decision boundary/line according to the data points, such that the differences
between the distances of data points from the curve or line are minimized.

Need for Regression techniques


The applications of regression analysis, advantages of linear regression, as well as the
benefits of regression analysis and the regression method of forecasting can help a
small business, and indeed any business, create a better understanding of the variables
(or factors) that can impact its success in the coming weeks, months and years into the
future.
Data are essential figures that define the complete business. Regression analysis
helps to analyze the data numbers and help big firms and businesses to make
better decisions. Regression forecasting is analyzing the relationships between data
points, which can help you to peek into the future.

9 Types of Regression Analysis


The types of regression analysis that we are going to study here are:

1. Simple Linear Regression


2. Multiple Linear Regression
3. Polynomial Regression
4. Logistic Regression
5. Ridge Regression
6. Lasso Regression
7. Bayesian Linear Regression

There are some algorithms we use to train a regression model to create predictions
with continuous values.

8. Decision Tree Regression


9. Random Forest Regression

There are various different types of regression models to create predictions. These
techniques are mostly driven by three prime attributes: one the number of
independent variables, second the type of dependent variables, and lastly the shape
of the regression line.
1) Simple Linear Regression

Linear regression is the most basic form of regression algorithms in machine learning.
The model consists of a single parameter and a dependent variable has a linear
relationship. When the number of independent variables increases, it is called the
multiple linear regression models.

We denote simple linear regression by the following equation given below.

y = mx + c + e

where m is the slope of the line, c is an intercept, and e represents the error in the
model.

The best-fit decision boundary is determined by varying the values of m and c for
different combinations. The difference between the observed values and the predicted
value is called a predictor error. The values of m and c get selected to minimum
predictor error.

Points to keep in mind:

1. Note that a simple linear regression model is more susceptible to outliers hence;
it should not be used in the case of big-size data.
2. There should be a linear relationship between independent and dependent
variables.
3. There is only one independent and dependent variable.
4. The type of regression line: a best fit straight line.

2) Multiple Linear Regression

Simple linear regression allows a data scientist or data analyst to make predictions
about only one variable by training the model and predicting another variable. In a
similar way, a multiple regression model extends to several more than one variable.

Simple linear regression uses the following linear function to predict the value of a
target variable y, with independent variable x?.

y = b0 + b1x1

To minimize the square error we obtain the parameters b? and b? that best fits the
data after fitting the linear equation to observed data.

Points to keep in mind:

1. Multiple regression shows these features multicollinearity, autocorrelation,


heteroscedasticity.
2. Multicollinearity increases the variance of the coefficient estimates and makes
the estimates very sensitive to minor changes in the model. As a result, the
coefficient estimates are unstable.
3. In the case of multiple independent variables, we can go with a forward
selection, backward elimination, and stepwise approach for feature
selection.

3) Polynomial Regression

In a polynomial regression, the power of the independent variable is more than 1. The
equation below represents a polynomial equation:

y = a + bx2

In this regression technique, the best fit line is not a straight line. It is rather a curve
that fits into the data points.

Points to keep in mind:

1. In order to fit a higher degree polynomial to get a lower error, can result in
overfitting. To plot the relationships to see the fit and focus to make sure that
the curve fits according to the nature of the problem. Here is an example of
how plotting can help:
4) Logistic Regression

Logistic regression is a type of regression technique when the dependent variable is


discrete. Example: 0 or 1, true or false, etc. This means the target variable can have only
two values, and a sigmoid function shows the relation between the target variable and
the independent variable.

The logistic function is used in Logistic Regression to create a relation between the
target variable and independent variables. The below equation denotes the logistic
regression.

here p is the probability of occurrence of the feature.


5) Ridge Regression

Ridge Regression is another type of regression in machine learning and is usually used
when there is a high correlation between the parameters. This is because as the
correlation increases the least square estimates give unbiased values. But if the
collinearity is very high, there can be some bias value. Therefore, we introduce a bias
matrix in the equation of Ridge Regression. It is a powerful regression method where
the model is less susceptible to overfitting.

Below is the equation used to denote the Ridge Regression, λ (lambda) resolves the
multicollinearity issue:

β = (X^{T}X + λ*I)^{-1}X^{T}y

6) Lasso Regression

Lasso Regression performs regularization along with feature selection. It avoids the
absolute size of the regression coefficient. This results in the coefficient value getting
nearer to zero, this property is different from what in ridge regression.
Therefore we use feature selection in Lasso Regression. In the case of Lasso Regression,
only the required parameters are used, and the rest is made zero. This helps avoid the
overfitting in the model. But if independent variables are highly collinear, then Lasso
regression chooses only one variable and makes other variables reduce to zero. Below
equation represents the Lasso Regression method:

N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)

7) Bayesian Linear Regression

Bayesian Regression is used to find out the value of regression coefficients. In Bayesian
linear regression, the posterior distribution of the features is determined instead of
finding the least-squares. Bayesian Linear Regression is a combination of Linear
Regression and Ridge Regression but is more stable than simple Linear Regression.

Now, we will learn some types of regression analysis which can be used to train
regression models to create predictions with continuous values.

8) Decision Tree Regression

The decision tree as the name suggests works on the principle of conditions. It is
efficient and has strong algorithms used for predictive analysis. It has mainly attributed
that include internal nodes, branches, and a terminal node.

Every internal node holds a “test” on an attribute, branches hold the conclusion of the
test and every leaf node means the class label. It is used for both classifications as well
as regression which are both supervised learning algorithms. Decisions trees are
extremely delicate to the information they are prepared on — little changes to the
preparation set can bring about fundamentally different tree structures.
9) Random Forest Regression

Random forest, as its name suggests, comprises an enormous amount of individual


decision trees that work as a group or as they say, an ensemble. Every individual
decision tree in the random forest lets out a class prediction and the class with the
most votes is considered as the model's prediction.

Random forest uses this by permitting every individual tree to randomly sample from
the dataset with replacement, bringing about various trees. This is known as bagging.
How to select the right regression model?
Each type of regression model performs differently and the model efficiency depends
on the data structure. Different types of algorithms help determine which parameters
are necessary for creating predictions. There are a few methods to perform model
selection.

1. Adjusted R-squared and predicted R-square: The models with larger


adjusted and predicted R-squared values are more efficient. These statistics can
help you avoid the fundamental problem with regular R-squared—it always
increases when you add an independent variable. This property can lead to
more complex models, which can sometimes produce misleading results.


o Adjusted R-squared increases when a new parameter improves the
model. Low-quality parameters can decrease model efficiency.
o Predicted R-squared is a cross-validation method that can also decrease
the model accuracy. Cross-validation partitions the data to determine
whether the model is a generic model for the dataset.

2. P-values for the independent variables: In regression, smaller p-values than


significance level indicate that the hypothesis is statistically significant.
“Reducing the model” is the process of including all the parameters in the
model, and then repeatedly removing the term with the highest non-significant
p-value until the model contains only significant weighted terms.
3. Stepwise regression and Best subsets regression: The two algorithms that
we discussed for automated model selection that pick the independent
variables to include in the regression equation. When we have a huge amount
of independent variables and require a variable selection process, these
automated methods can be very helpful.

Conclusion
The different types of regression analysis in data science and machine learning
discussed in this tutorial can be used to build the model depending upon the structure
of the training data in order to achieve optimum model accuracy.
Introduction to Multivariate Regression

Multivariate Regression is a type of machine learning algorithm that


involves multiple data variables for analysis. It is mostly considered as a
supervised machine learning algorithm. Steps involved for Multivariate
regression analysis are feature selection and feature engineering,
normalizing the features, selecting the loss function and hypothesis
parameters, optimize the loss function, Test the hypothesis and
generate the regression model. The major advantage of multivariate
regression is to identify the relationships among the variables
associated with the data set. It helps to find the correlation between the
dependent and multiple independent variables. Multivariate linear
regression is a commonly used machine learning algorithm.

Why single Regression model will not work?


 As known that regression analysis is mainly used to exploring the
relationship between a dependent and independent variable.
 In the real world, there are many situations where many independent
variables are influential by other variables for that we have to move to
different options than a single regression model that can only take one
independent variable.

What is Multivariate Regression?


 Multivariate Regression helps use to measure the angle of more than one
independent variable and more than one dependent variable. It finds the
relation between the variables (Linearly related).
 It used to predict the behavior of the outcome variable and the
association of predictor variables and how the predictor variables are
changing.
 It can be applied to many practical fields like politics, economics,
medical, research works and many different kinds of businesses.
 Multivariate regression is a simple extension of multiple regression.
 Multiple regression is used to predicting and exchange the values of one
variable based on the collective value of more than one value of
predictor variables.
 First, we will take an example to understand the use of multivariate
regression after that we will look for the solution to that issue.

Examples of Multivariate Regression


 If E-commerce Company has collected the data of its customers such as
Age, purchased history of a customer, gender and company want to find
the relationship between these different dependents and independent
variables.
 A gym trainer has collected the data of his client that are coming to his
gym and want to observe some things of client that are health, eating
habits (which kind of product client is consuming every week), the
weight of the client. This wants to find a relation between these
variables.

As you have seen in the above two examples that in both of the
situations there is more than one variable some are dependent and
some are independent, so single regression is not enough to analyze
this kind of data.

Here is the multivariate regression that comes into the picture.

1. Feature selection
The selection of features plays the most important role in multivariate
regression.

Finding the feature that is needed for finding which variable is


dependent on this feature.

2. Normalizing Features
For better analysis features are need to be scaled to get them into a
specific range. We can also change the value of each feature.

3. Select Loss function and Hypothesis


The loss function calculates the loss when the hypothesis predicts the
wrong value.

And hypothesis means predicted value from the feature variable.

4. Set Hypothesis Parameters


Set the hypothesis parameter that can reduce the loss function and can
predict.

5. Minimize the Loss Function


Minimizing the loss by using some lose minimization algorithm and use
it over the dataset which can help to adjust the hypothesis parameters.
Once the loss is minimized then it can be used for prediction.

There are many algorithms that can be used for reducing the loss such
as gradient descent.

6. Test the hypothesis function


Check the hypothesis function how correct it predicting values, test it
on test data.

Steps to follow archive Multivariate Regression

1) Import the necessary common libraries such as numpy, pandas


2) Read the dataset using the pandas’ library

3) As we have discussed above that we have to normalize the data for


getting better results. Why normalization because every feature has a
different range of values.

4) Create a model that can archive regression if you are using linear
regression use equation

Y = mx + c

In which x is given input, m is a slop line, c is constant, y is the output


variable.

5) Train the model using hyperparameter. Understand the


hyperparameter set it according to the model. Such as learning rate,
epochs, iterations.

6) As discussed above how the hypothesis plays an important role in


analysis, checks the hypothesis and measure the loss/cost function.

7) The loss/ Cost function will help us to measure how hypothesis value
is true and accurate.

8) Minimize the loss/cost function will help the model to improve


prediction.

9) The loss equation can be defined as a sum of the squared difference


between the predicted value and actual value divided by twice the size
of the dataset.

10) To minimize the Lose/cost function use gradient descent, it starts


with a random value and finds the point their loss function is least.

By following the above we can implement Multivariate regression

Advantages of Multivariate Regression


 The multivariate technique allows finding a relationship between
variables or features
 It helps to find a correlation between independent and dependent
variables.

Disadvantages of Multivariate Regression


 Multivariate techniques are a little complex and high-level mathematical
calculation
 The multivariate regression model’s output is not easily interpretable
and sometimes because some loss and error output are not identical.
 It cannot be applied to a small dataset because results are more
straightforward in larger datasets.

Conclusion- Multivariate Regression


 The main purpose to use multivariate regression is when you have more
than one variables are available and in that case, single linear regression
will not work.
 Mainly real world has multiple variables or features when multiple
variables/features come into play multivariate regression are used.

You might also like