regression
regression
REGRESSION ANALYSIS
Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the datapoints on targetpredictor graph in
such a way that the vertical distance between the datapoints and the regression line is minimum."
The distance between datapoints and line tells whether a model has captured a strong relationship or
not.
• Function of regression analysis is given by: Y=f(x)
3.Neutral Correlation: No relationship in the change of variables X and Y. In this case, the values
are completely random and do not show any sign of correlation, as shown in the following image:
Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
The relationship between input features (variables) and the output (target) variable is fundamental.
These concepts have significant implications for the choice of algorithms, model complexity, and
predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship does
not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one variable is
associated with a proportional change in another variable. Mathematically, it can be represented as ,
y=a * x + b, where y is the output, x is the input, and a and b are constants.
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data points.
Linear models are interpretable and work well when the relationship between variables is close to
being linear.
Limitations: Linear models may perform poorly when the relationship between variables is non-
linear. In such cases, they may underfit the data, meaning they are too simple to capture the
underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is not
proportional to the change in another variable. Non-linear relationships can take various forms,
such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support vector
machines with non-linear kernels, and neural networks can capture nonlinear relationships. These
models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data are
complex or when interactions between variables are non-linear. They have the capacity to capture
integrate patterns.
Types of Regression
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is used
when there is a single independent variable (predictor) and one dependent variable (target).
Purpose: Linear regression is used to establish a linear relationship between two variables and
make predictions based on this relationship. It's suitable for simple scenarios where there's only one
predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Purpose: Multiple regression allows you to model the relationship between the dependent variable
and multiple predictors simultaneously. It is used when there are multiple factors that may influence
the target variable, and you want to understand their combined effect and make predictions based
on all these factors.
Polynomial Regression:
Polynomial regression is an extension of multiple regression used when the relationship between
the independent and dependent variables is non-linear.
Logistic Regression:
Logistic regression is used when the dependent variable is binary (0 or 1). It models the probability
of the dependent variable belonging to a particular class.
Limitations of Regression
Regression analysis is a powerful statistical method used to examine the relationship between one
dependent variable and one or more independent variables. However, it has some limitations that
should be considered when interpreting results:
1. Linearity Assumption:
- Regression assumes a linear relationship between the dependent and independent variables. If the
true relationship is nonlinear, the results may be inaccurate.
2. Assumption of Independence:
- The observations should be independent of each other. If there is dependence among observations,
it can lead to biased and inefficient estimates.
3. Homoscedasticity Assumption:
- Homoscedasticity implies constant variance of errors across all levels of the independent variable.
If the variance of errors is not constant, it may indicate heteroscedasticity, which can affect the
efficiency of the estimates.
4. Normality of Residuals:
- While regression does not require the normal distribution of variables, it assumes normality of the
residuals. Departure from normality may affect the reliability of statistical inferences.
5. Multicollinearity:
High correlation among independent variables can lead to multicollinearity. This makes it
challenging to assess the individual contributions of each variable and can inflate standard errors.
6. Outliers and Influential Points: Outliers and influential data points can significantly impact
regression results. They may distort the parameter estimates and affect the overall fit of the model.
Regression analysis is a statistical method used to examine the relationship between one
dependent variable and one or more independent variables. Its application is widespread across
various fields due to its versatility and ability to model and analyze relationships betwee variables.
2. Marketing:
Sales Forecasting: Regression analysis can be used to predict future sales based on variables
such as advertising spending, promotional activities, and market trends.
Market Research: It helps analyze the impact of different variables on consumer behavior,
such as the influence of price or brand recognition on product sales.
3. Healthcare:
Disease Prediction: Regression models can be applied to predict the likelihood of disease
based on various health indicators.
Drug Dosage Determination: Regression analysis helps determine the appropriate dosage of
a drug based on patient characteristics.
4. Environmental Science:
Climate Change Modelling: Regression can be used to analyze the relationship between
environmental variables (temperature, precipitation) and changes in climate.
Pollution Impact Assessment: Regression models help assess the impact of various factors
on pollution levels.
6. Engineering:
Quality Control: Regression helps analyze the relationship between manufacturing
parameters and product quality.
System Reliability: Regression models can be used to predict the reliability of systems based
on various operational parameters.
7. Education:
Student Performance: Regression analysis can be applied to understand the factors
influencing student performance, such as study hours, attendance, and socio-economic
background.
Teacher Effectiveness: Regression models help analyze the impact of different teaching
methods or interventions on student outcomes.
8. Sports Analytics:
Player Performance Prediction: Regression analysis is used to predict player performance
based on various statistics and playing conditions.
Team Success Prediction: It helps analyze the factors influencing the success of a sports
team, such as player skill, coaching strategies, and team dynamics.
These are just a few examples, and regression analysis can be applied in many other fields to
gain insights, make predictions, and inform decision-making based on observed data.