Machine Learning and Linear Regression
Machine Learning and Linear Regression
2
Introduction to Machine Learning
• The ability to do the tasks come from the underlying model which is the result of the learning process.
• The model is generated from huge volume of data, huge both in breadth and depth reflecting the real world
in which the processes are performed.
• Search through the data to look for patterns in form of trends, cycles, associations, etc.
• Too many permutations and combinations possible , eg: Genetic Code mapping
Machine Learning happens in
Mathematical space
• A data representing the real world, is a collection attributes that define an entity.
Supervised Machine Learning
• The model thus generated is used to make predictions about future instances.
Eg: building model to predict the re-sale value of a car based on its mileage, age, colour etc
• The term “linear” in linear regression refers to the fact that the method models data
with linear combination of the explanatory variables.
Least Sum of Square is a method that uses the sample data to provide the value of b0
an b1such that it minimizes the sum of squared differences between observed value of y
(denoted by yi) and estimated value of y ( denoted by y^i)
Applying Least Square method
b0=60
b1=5
Y^=60+5X
Coefficient of Determination
SST a.k.a Total Sum Of Squares
Transform data
•removes skew
•positive skew – log transform
•negative skew – square
Assumption:
Heteroscedasticity
• The word Heteroscedasticity comes from Heteros ->
Different
• Scedasticity -> Conditional Variance i.e. Variation in residuals
is independent of X
• So if there is no Heteroscedasticity then there is
Homoscedasticity i.e. Uniform Variance
• Heteroscedasticity: When the error variance changes in a
systematic pattern with changes in the X value
Remedies for Heteroscedasticity:
A Critical Question: Why is it so important to detect
heteroscedasticity?
Biased Standard Error estimation- inferences based on OLS
estimation becomes incorrect (though coefficient estimates remain
unbiased)
Possible solution: Transformation
Looking at the scatterplot of squared residuals against X,
decide on the appropriate transformation of X
Regression Diagnostics
What is Collinearity?
Effects of collinearity
Detection of multicollinearity: Simple Signs
How to detect Multi-collinearity
• None of the coefficients has significant statistic
• Pair-wise correlations high
• Negative coefficient when theory suggests positive
relationship
How to Measure Collinearity
Multicollinearity: Detection & Removal
• Independent variables have significant correlation
• Check with Variance Inflation factor (VIF)
• VIF>5 => Multicollinearity
• Why it is a problem?
– It increases the standard error and affects the coefficients of
independent variables
• If more than one variable has VIF > 5, one of them must be
removed
• Remove one by one and choose the one which maximizes R2
VIF-Steps
• Proportion of variance of one predictor explained by all other
predictors
• VIFj = 1/(1-R2j)
• VIF = 1 indicates no collinearity
• Compute VIF for all predictors
• Drop the one with largest VIF above cut-off
• Re-compute and repeat until all VIFs are below cut-off (2,5,10)
• VIF=2.5 for a variable means R2 of this variable with other
predictors is 0.6 which is pretty high
Problem
Dataset1 Dataset 2 Dataset 3
At-1 26.6
Pt-1 44.1
Advantages
• Simple to implement and easier to interpret the outputs coefficient.
Disadvantages
1. Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. Drop NAs’
mpg_df = mpg_df.drop(‘car_name’, axis=1)
3. Replace the numbers in categorical variables with the actual country names in the
origin col
mpg_df[‘origin’] = mpg_df[‘origin’].replace({1: “America”, 2: “Europe”, 3: “Asia”})
Analyse the distribution of dependent column
mpg_df.describe().transpose()
On inspecting records if we find “?” in the columns, Replace them with “nan”
mpg_df = mpg_df.replace (‘?’, np.nan)
mpg_df [mpg_df.isnull().any(axis=1)]
Replace the missing values with mean/median/mode depends on the type of variables
mpg_df.median()
Bivariate Analysis
sns.pairplot (mpg_df_attr, diag_kind = ‘kde’) #to plot density curve
sns.pairplot(mpg_df_attr) #to plot histogram