Module 1 Notes
Module 1 Notes
We can analyze data and perform data modeling using regression analysis. Here, we
create a decision boundary/line according to the data points, such that the differences
between the distances of data points from the curve or line are minimized.
There are some algorithms we use to train a regression model to create predictions
with continuous values.
There are various different types of regression models to create predictions. These
techniques are mostly driven by three prime attributes: one the number of
independent variables, second the type of dependent variables, and lastly the shape
of the regression line.
1) Simple Linear Regression
Linear regression is the most basic form of regression algorithms in machine learning.
The model consists of a single parameter and a dependent variable has a linear
relationship. When the number of independent variables increases, it is called the
multiple linear regression models.
y = mx + c + e
where m is the slope of the line, c is an intercept, and e represents the error in the
model.
The best-fit decision boundary is determined by varying the values of m and c for
different combinations. The difference between the observed values and the predicted
value is called a predictor error. The values of m and c get selected to minimum
predictor error.
1. Note that a simple linear regression model is more susceptible to outliers hence;
it should not be used in the case of big-size data.
2. There should be a linear relationship between independent and dependent
variables.
3. There is only one independent and dependent variable.
4. The type of regression line: a best fit straight line.
Simple linear regression allows a data scientist or data analyst to make predictions
about only one variable by training the model and predicting another variable. In a
similar way, a multiple regression model extends to several more than one variable.
Simple linear regression uses the following linear function to predict the value of a
target variable y, with independent variable x?.
y = b0 + b1x1
To minimize the square error we obtain the parameters b? and b? that best fits the
data after fitting the linear equation to observed data.
3) Polynomial Regression
In a polynomial regression, the power of the independent variable is more than 1. The
equation below represents a polynomial equation:
y = a + bx2
In this regression technique, the best fit line is not a straight line. It is rather a curve
that fits into the data points.
1. In order to fit a higher degree polynomial to get a lower error, can result in
overfitting. To plot the relationships to see the fit and focus to make sure that
the curve fits according to the nature of the problem. Here is an example of
how plotting can help:
4) Logistic Regression
The logistic function is used in Logistic Regression to create a relation between the
target variable and independent variables. The below equation denotes the logistic
regression.
Ridge Regression is another type of regression in machine learning and is usually used
when there is a high correlation between the parameters. This is because as the
correlation increases the least square estimates give unbiased values. But if the
collinearity is very high, there can be some bias value. Therefore, we introduce a bias
matrix in the equation of Ridge Regression. It is a powerful regression method where
the model is less susceptible to overfitting.
Below is the equation used to denote the Ridge Regression, λ (lambda) resolves the
multicollinearity issue:
β = (X^{T}X + λ*I)^{-1}X^{T}y
6) Lasso Regression
Lasso Regression performs regularization along with feature selection. It avoids the
absolute size of the regression coefficient. This results in the coefficient value getting
nearer to zero, this property is different from what in ridge regression.
Therefore we use feature selection in Lasso Regression. In the case of Lasso Regression,
only the required parameters are used, and the rest is made zero. This helps avoid the
overfitting in the model. But if independent variables are highly collinear, then Lasso
regression chooses only one variable and makes other variables reduce to zero. Below
equation represents the Lasso Regression method:
N^{-1}Σ^{N}_{i=1}f(x_{i}, y_{I}, α, β)
Bayesian Regression is used to find out the value of regression coefficients. In Bayesian
linear regression, the posterior distribution of the features is determined instead of
finding the least-squares. Bayesian Linear Regression is a combination of Linear
Regression and Ridge Regression but is more stable than simple Linear Regression.
Now, we will learn some types of regression analysis which can be used to train
regression models to create predictions with continuous values.
The decision tree as the name suggests works on the principle of conditions. It is
efficient and has strong algorithms used for predictive analysis. It has mainly attributed
that include internal nodes, branches, and a terminal node.
Every internal node holds a “test” on an attribute, branches hold the conclusion of the
test and every leaf node means the class label. It is used for both classifications as well
as regression which are both supervised learning algorithms. Decisions trees are
extremely delicate to the information they are prepared on — little changes to the
preparation set can bring about fundamentally different tree structures.
9) Random Forest Regression
Random forest uses this by permitting every individual tree to randomly sample from
the dataset with replacement, bringing about various trees. This is known as bagging.
How to select the right regression model?
Each type of regression model performs differently and the model efficiency depends
on the data structure. Different types of algorithms help determine which parameters
are necessary for creating predictions. There are a few methods to perform model
selection.
o Adjusted R-squared increases when a new parameter improves the
model. Low-quality parameters can decrease model efficiency.
o Predicted R-squared is a cross-validation method that can also decrease
the model accuracy. Cross-validation partitions the data to determine
whether the model is a generic model for the dataset.
Conclusion
The different types of regression analysis in data science and machine learning
discussed in this tutorial can be used to build the model depending upon the structure
of the training data in order to achieve optimum model accuracy.
Introduction to Multivariate Regression
As you have seen in the above two examples that in both of the
situations there is more than one variable some are dependent and
some are independent, so single regression is not enough to analyze
this kind of data.
1. Feature selection
The selection of features plays the most important role in multivariate
regression.
2. Normalizing Features
For better analysis features are need to be scaled to get them into a
specific range. We can also change the value of each feature.
There are many algorithms that can be used for reducing the loss such
as gradient descent.
4) Create a model that can archive regression if you are using linear
regression use equation
Y = mx + c
7) The loss/ Cost function will help us to measure how hypothesis value
is true and accurate.