Passing Reference Multiple Regression
Passing Reference Multiple Regression
Dibyojyoti Bhattacharjee
Department of Statistics, Assam University, Silchar. Email: djb.stat@gmail.com
1. Introduction
Multiple regression is an extension of a bivariate regression technique. It includes several independent
variables (also called as predictors) instead of only one as in case of the bivariate regression. Multiple
regression model is used, when a relationship is to be established between a dependent variable Y and
its potential predictors (X1, X2, X3,…). In general, the response (dependent) variable may be related to the
k regressors (or predictor) by the relation,
y 0 1 x1 2 x2 ... k xk …(1)
Where,
b0 = the constant of regression also called as intercept.
b1, b2 ,…, bk = the partial regression coefficients associated with the k independent variables.
Î = The residual of regression, which is the error term associated with the regression model.
x1, x2 etc. are the independent variables or regressors or predictors
y the dependent or response variable.
The value ‘j’ explains the expected change that takes place in the response variable (Y) with unit change
in xj when all the other regressor variables viz. xi (i ≠ j) are held constant.
1
Fit: Does the overall regression shows any improvement with the inclusion in the new variable?
Parsimony: Does each predictor contribute significantly in the relation?
Stability: Are the predictors strongly related to each other?
3.1 Coefficient of Determination (R2)
The most common measure for overall fit is the coefficient of determination or R2.
R2 1
( y yˆ )
i i
2
( y y) i
2
…(2)
2
The value of R lies between 0 and 1. The value is multiplied by 100 to get the percentage of variation in
the response variable that are explained by the regressors.
3.2 Adjusted R2
With increase in the number of predictors in the model the value of the coefficient of determination i.e.
R2 shall keep on increasing. This will make us feel that more the number of predictors better is the
regression fit. To discourage this, an adjustment can be made in the R2 statistic by penalizing the
inclusion of useless predictors. The adjusted coefficient of determination using n observations and k
predictors are,
n 1
2
Radj 1 (1 R 2 )
n k 1 …(3)
We have for a given set of data Radj R . With increase in the predictors in the model the value of R 2
2 2
2
cannot decline and generally raises but the value of Radj may raise, remain the same or shall fall
depending on whether the added predictors increase R 2 sufficiently.
3.3 Standard Error
Another important measure of fit is the standard error (SE) of the regression, which is a function of the
sum of squares for residuals (SSE) for n observations and k predictors, SE is given by,
n
(y i yˆ i ) 2
SSE
SE i 1
...(4)
n k 1 n k 1
If all predictions are perfect SE shall be equal to zero. Thus, less the value of the SE better is the
regression fit. Whenever we report the result of a multiple regression we shall spell out the estimated
values of the parameter along with the values of R 2 and S.E.
2
Statistically the regression is possible as long as the number of observations (sample size) is more than
the number of predictors. But according to Evan’s rule if n is the sample size and k is the number of
n n
predictors, then 10 (atleast 10 observations per predictor), and as per Doane’s rule, 5 (atleast
k k
10 observations per predictor).
i.e. Total sum of squares (TSS) = Sum of squares due to Regression (SSR) + Error sum of squares (SSE)
The ANOVA calculations can be summarized in the table as follows:
Source of Sum of df Mean sum of F value p-value
variation Squares squares
Regression SSR k MSR = SSR/k F=MSR/MSE
Error SSE n-k-1 MSE = SSE/(n-k-1)
Total TSS n-1
The p-value shall depend on the calculated value of F i.e. MSR/MSE. p-value is the calculated value of
the probability of rejecting the null hypothesis (H0) of a study question when that hypothesis is true. If
the p- value is less than the chosen significance level (generally 0.05), then you reject the null hypothesis
i.e. accept that the sample gives reasonable evidence to believe that atleast one of the regression
coefficients are different from zero.
3
ˆ j
t ~ t n k 1
sj ...(6)
Where, sj is the standard error associated with the estimated value of the regression coefficient
1
Koteswara Rao Kadiyala (1970) Testing for the independence of regression disturbances. Econometrica, 38, 97-
117.
2
In case the Data Analysis sub-menu is not present in the data menu you have to get it activated. For that Click the
4
Next table helps us to get the regression equation as it spells out the values of the partial
regression coefficients. The P-value column helps us to understand which of the regression coefficients
are significant. From the table below we understand that the regression equation is:
y = 0.1973-1.0444 X1 + 0.0033 X2+0.0034 X3+ …(7)
Regression Statistics
Multiple R 0.847935052
R Square 0.718993852
Adjusted R Square 0.686570066
Standard Error 0.036832698
Observations 30
ANOVA
Significance
df SS MS F F
Regression 3 0.09025 0.0301 22.17 2.45×10-7
Residual 26 0.0353 0.0014
Total 29 0.1255
5
Standard
Coefficients Error t Stat P-value
Intercept 0.197315072 0.270216157 0.730212 0.471789405
X1 -1.04441399 0.834357321 -1.25176 0.22180273
X2 0.00330776 0.001171418 2.823722 0.008988729
X3 0.00345843 0.000445547 7.762213 3.10002×10-8
Also from the P-value column we can understand that the variables X 2 and X3 are having a significant
impact in the regression equation as in both these cases the P-values are less than 0.05.
6
two binary predictors i.e. one less than the total number of categories. The binary predictors in this case
shall be,
Non-smoker = 1 if the respondent is a non-smoker
= 0, otherwise
Occasional smoker = 1 if the respondent is an occasional smoker
= 0, otherwise
There is no need to include the ‘chain smoker’ as a binary predictor as ‘non-smoker = 0’ and ‘occasional
smoker = 0’ shall imply that the respondent is a chain smoker. However, it is better not to include many
such multinomial variables in the regression equation as each multinomial variable shall give raise to
several binomial predictors to be included in the regression model. Inclusion of several such variables
may lead to violation of both Evan’s rule and Doane’s rule discussed earlier.
11. Multicollinearity
When the independent variables of the regression model i.e. X1, X2,…, Xk are inter correlated instead
of being independent of each other, then it is known as multicollinearity. If any two independent
variables are correlated then we have collinearity. Multicollinearity influences the variances of the
7
estimated regression coefficients. This inflates the confidence interval for the true coefficients b1, b2,...,
bk making the t statistics less reliable. Thus, the role of each of the regression coefficients gets mixed up.
It is a good practice to calculate correlation coefficient between all the pairs of variables and test if
the values are significant i.e. if the correlation values are significant i.e. if the correlation values are differ
from zero at = 0.05 in a two tailed test. If correlation between any pair comes out to be significant
then any one of the variable is to be retained in the model. The Klein’s rule states that when the
correlation between a pairwise predictor exceeds the multiple correlation coefficient (R) then dropping
of one of the variables of the pair from the model may be suggested. However, correlation coefficient
only looks at pairwise relation between variables but a general test of multicollinearity should look at
more complex relationship between the predictors. For example, X 2 might be a linear function of X1, X3
and X4 even though its pairwise correlation may not be very large. The variance inflation factor (VIF) for
each predictor provides a more comprehensive test. The VIF for the j th predictor is given by,
1
VIFj = ...(14)
1 R 2j
Where Rj2 is the coefficient of determination when the predictor j is regressed against all the other
predictors excluding Y. The minimum value of VIF is 1. As a rule of thumb any value of VIF > 10 may be
considered to have a strong evidence of j th variable to be related to the other j – 1 predictors. Removing
a relevant predictor from a model should be considered seriously as it may lead to misspecification of
the model.
8
that are too narrow and t statistic that are too large. Thus, the model fit may be overstated. The most
widely used test for the detection of auto correlation is the Durbin-Watson test.
Here, H0: Errors are nonautocorrelated.
H1: Errors are autocorrelated.
The test statistic is given by,
n
(e t et 1 ) 2
DW t 2
n
e
t 1
2
t
In general,
DW < 2 suggests positive autocorrelation (commonly seen)
DW 2 suggests no autocorrelation (ideal)
DW > 2 suggests negative autocorrelation (rare)
A way to get rid of the first order auto-correlation is to transform both variables by their first difference.
Thus, we have,
xt xt xt 1
y t y t y t 1
The transformation leads to the diminishing of the number of observations by 1. Some other possible
methods of eliminating the auto correlated errors can be studied in books related to econometrics.
15. Standardized Residuals
By dividing each residual by its standard error we get the standardized residual. As a rule of
thumb any absolute value of the standardized residual which is more than 2 is unusual and any residual
whose absolute value is three or more would be considered as an outlier.
16. Outliers
The outlier causes several problems in regression including loss of fit. An outlier may be an error in
recording the data. In case of an error it is advisable to check and input correct data. But often it is
difficult to identify such an error.
17. Variable Transformation
Sometimes a relationship cannot be assumed using linear regression. Some variable transformation
can be performed to improve the fit. In case of an exponential trend in the scatter plot one can uses log
transformation as it reduces the heteroscedasticity to improve the fit.
18. Conclusion
Another interesting area of discussion is how the regression equation behaves when the assumption on
which it is based are violated. That is a very calling area and introduces several other terminologies like
auto correlation, multicollinearity, variance inflation factor, homoscedastic etc. Some other time the
discussion would be continued. I hope this note shall help the young research scholars, students and
teachers to understand multiple regression in a nut shell. The entire discussion is based on the chapter
Multiple Regression Analysis that appeared in the book ‘Applied Statistics in Business and Economics’ by
Doane and Seward, published by Tata-McGraw-Hill Publishing Company Limited, New Delhi in 2007,
page 558-603.
9
10