File4-Session3-Introduction To Regression

Introduction to
Regression
Introduction to linear regression
 Simple linear regression

 Multiple regression
 Find out relationships between dependent and
independent variable
 Introduction to SPSS
Simple linear regression
y   0  1 x  u
y = Dependent variable
X = Independent variable
0 = y-intercept of the line (constant), cuts through the y-axis
 1 = Unknown parameter – Slope of the line
u = Random error component
Terminology for simple
regression
y X
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor
Examples
Sleepinghours   0  1Sporthours  u
Turnover   0  1 Adverting  u
Score   0  1 Attend  u
- With a sample from the population

- ( xi , yi ) :i  1,..., n) denote a random sample of
size n from the population
yi   0  1 xi  ui
File: dataspss-s4.1
DETERMINING THE EQUATION OF THE
REGRESSION LINE
 Deterministic Regression Model –
mathematical models that produce an ‘exact’ output
for a given input
yˆ   0  1 x
 Probabilistic Regression Model- a model that
includes an error term that allows for various values
of output to occur for a given value of input
yi   0  1 xi   i
ŷ = predicted value of y
xi = value of independent variable for the ith value
yi = real value of dependent variable for the ith value
1 = population slope
 0 = population intercept
 i = error of prediction for the i th
value
Simple Linear Regression Model
Y
Yi  β0  β1Xi  ε i
Observed
value of Y for
Xi
εi
Slope = β1
Predicted
value of Y for Random error for
Xi this Xi value
Intercept = β0
Xi X
7
Sample Regression
Function (SRF)
SRF : yˆ i  b0  b1 xi
ŷi = estimated value of Y for observation i
xi = value of X for observation i
b0 = Y- intercept
is the value of Y when X is zero
b1 = slope of the regression line
change in Y for 1 unit X
b1 > 0 : Line will go up; positive relationship between X and Y
b1 < 0 : Line will go down; negative relationship between X and Y
SIMPLE LINEAR REGRESSION MODEL
(sample)
Simple Linear Regression:
SRF : yˆ i  b0  b1 xi
where
SS xy
b1  b0  y  b1 x
SS xx
( x)( y )
SS xy   ( x  x)( y  y )   xy 
n
( x ) 2
SS xx   ( x  x)   x 
2 2
n
S yy   (Y  Y ) 2
Sample Regression Function
(SRF) (continued)
 b0 and b1 are obtained by finding the values of b0 and b1
that minimizes the sum of the squared residuals (minimize the
error). This process is called Least Squares Analysis
   e
n 2 n
Yi  Yî 2
i
i 1 i 1
Yi = actual value of Y for observation i
Yî = predicted value of Y for observation i
ei = residual (error)
 b0 provides an estimate of 0
 b1 provides and estimate of 1
RESIDUAL ANALYSIS

i 1
i 0
| Yi  Yî |  i
Simple example
xi yi xi  x xi  x yi xi  x 2

1 1 -2 -2 4
2 1 -1 -1 1
3 2 0 0 0
4 2 1 2 1
5 4 2 8 4
Average =3 Average =2 Total=7 10
X = experience years (year)

Y = Income (10 million VND)
yˆ  0.1  0.7 x
x là kinh nghiệm
Fit to the data  Phương trình có
nghĩa là 1 năm kinh
nghiệm tăng lên 7 triệu
y
4 .
yˆ  0.1  0.7 x
3
.
2 .
1 . .
x
1 2 3 4 5
Result of estimation by SPSS
Degree of freedom
k
n-k-1
n-1
yˆ  0.1  0.7 x
Meaning of b0 and b1
 Y-Intercept (b0)
• Average value of individual income (Y) is
-0.1 (10 million VND) when when the experience year
(X) is 0
 Slope (b1)
• Income (Y) is expected to increase by 0.7 (*10 million
VND) for each unit increased in experience year
b1 > 0 : Line will go up; positive relationship between X and Y (increase)

b1 < 0 : Line will go down; negative relationship between X and Y (decrease)
Measures of Variation
Total variation is made up of two parts.
SST  SSR  SSE

Total Sum of Regression Sum of Error Sum of
Squares Squares Squares
SST   ( Yi  Y )2 SSR   (Yî  Y ) 2 SSE   i i

(Y  Yˆ ) 2
Measures the Explained variation Variation attributable to

variation of the Yi attributable to the factors other than the
values around their relationship between X relationship between X
mean Y. and Y. and Y.
16
/* Other notation for SSyy is SST. They are the same!

Measure of Variation: (continued)
The Sum of Squares
Y 
SSE =(Yi - Yi )2
_ 
Yî  b0  b1 X i
SSyy = (Yi - Y)2
 _
SSR = (Yi - Y)2
_
Y
X
Xi
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in
the dependent variable that is explained by variation in the
independent variable.
The coefficient of determination is also called r-squared and is denoted
as r2
SSR regression sum of squares

r2  
SST total sum of squares
Note: 0  r2  1
19
The Coefficient of Determination r2
and the Coefficient of Correlation r
2
SSR b1 S xx
r 
2
 0<r2<1
SS yy SS yy
r2 = Coefficient of Determination
Measures the % of variation in Y that is explained by the
independent variable X in the regression model
r r 2
-1<r<1
r = Coefficient of Correlation
Measures how strong the relationship is between X and Y
r > 0 if b1>0
r < 0 if b1 <0
Examples of Approximate r2
values
Perfect linear
Y Y relationship
between X and Y.
100% of the
variation in Y is
explained by
variation in X.
X
r2 = 1 r2 = 1 X
Y
No linear relationship between X
and Y.
The value of Y does not depend
on X (None of the variation in Y is
explained by variation in X).
r2 = 0 X
21
Examples of Approximate r2
values
0<r <1
2
Y
Weaker linear relationships
between X and Y.
X Some but not all of the

Y variation in Y is explained by
variation in X.
X
22
Standard Error of the Estimate
The standard deviation of the variation of observations around the
regression line is estimated by:
SSE  i i
(Y  Yˆ ) 2
SYX   i 1
n2 n2
Where
SSE = error sum of squares
n = sample size
23
SSE 1.1
SYX    0.60653
n2 52
Inferences About the Slope
The standard error of the regression slope coefficient (b 1) is
estimated by:
SYX SYX
Sb1  
SSX  (X i  X) 2
Where
Sb=1 Estimate of the standard error of the least squares slope.

SSE
S YX  = Standard error of the estimate.
25
n2
Sb1 = 0.1914854
Inference about the Slope: t Test
t test for a population slope:
• Is there a linear relationship between X and Y?
Null hypothesis (H0) and Alternative hypothesis (H1)

H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
Test statistic with d.f. = n-2
b1  β1 Where b1 = regression slope coefficient

t β1 = hypothesized slope
Sb1 Sb1 = standard error of the slope
27
b1  β1 0. 7  0
t   3.66
S b1 0.1914854
Inferences about the Slope: t
Test Example
H0: β1 = 0 Test Statistic: t = 3.66
H1: β1 ≠0 T critical = +/- 3.182 (from t tables)
Decision: Reject H0
d.f. = 5-2 = 3 Conclusion: There is
sufficient evidence that
Do not /2=.025 number of customers
/2=.025
reject H0 affects weekly sales.
Reject H0 Reject H0
-t/2 0 t/2
-3.182 3.182 3.66
29
F Test for Significance
F Test statistic:
SSR
MSR Where MSR 
F  k
MSE SSE
MSE 
n  k 1
F follows an F distribution with k numerator and (n – k - 1)

denominator degrees of freedom.
k = the number of independent (explanatory) variables in the
regression model.
30
MSR 4. 9
F   13.36
MSE 0.3666667
F Test for Significance Example
df1= k =1
df2 = n-k-1=5-1-1
H0: β1 = 0 Test Statistic:
H1: β1 ≠ 0
MSR
 = .05 F  13.36
MSE
df1= 1 df2 = 3
Conclusion:
Reject H0 at  = 0.05
Critical Value:
F = 10.128 There is sufficient evidence that
number of customers affects
weekly sales.
 = .05
0
F
Do not reject H0
F.05 = 10.128 Reject H0
32
Introduction to SPSS (file: dataspss-s4.1)
Voice of result
 R-squared ranges in value between 0 and 1
 R2 = 0, nothing to help explain the variance in y
 R2 = 1, all the same points lie on the estimated regression line
 Example: R2 = 0.93 implies that the regression equation explains
93% of the variation in the dependent variable
 Sig. (significant): Goodness of fit only if
 Sig. of coefficient < 0.01  significant at 1%, Ho is rejected

 0.01 ≤ Sig. value < 0.05  significant at 5%, Ho is rejected
 Sig. value > 0.1  significant at 10%, Ho is rejected
Introduction to multiple
regression
 Multiple regression
 Find out relationships between dependent and
independent variables
 Dummy variable enclosed
 Solution? and SPSS
Linear regression
y   0  1 x1   2 x2  ....   n xn  
y = Dependent (or response variable)

X1, X2, …., Xn = Independent or predictor variables
0 = y-intercept of the line (constant), cuts through the y-axis
1 ,  2 ,... n = Unknown parameter – Slope of the line
 = Random error component
wage   0  1educ   2 exp er  u

Terminology for simple
regression
y X1, x2, …, xk
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor
Example
 File: dataspss-s4.2
 Dependent variable ?
 Independent variables
 SPSS program
 Estimate and discuss
Think?
 Survey conducted with variables
 Income
 Age
 Years in experience working
 Education
 Gender
 ………
Think which one is dependent and independent
variables?
Regression with dummy
independent variables
 Independent variable: Gender
 1= female, 0 = male
 If coefficient estimated of gender is a positive value,
dependent variable is the direction increase with female
 If coefficient estimated of gender is negative value,
dependent variable is the direction increase with male.
Samples of hypotheses
 An increase in education does not cause a rise

in the earning
 People’s earning is not positively influenced by
their age
 There is not a significant relationship between
the earning and the gender
Adjusted R2
Adjusted R square to identify a good regression model once some

variance are added.
The higher Adjusted R-square, the better model
 2  n  1 
r 2
adj  1  (1  r ) 
  n  k  1 
(where/với n = sample size/kích cỡ mẫu, k = number of indendent
variables/số biến độc lập)
• Support to control number of independent variables added/Tạo

trở ngại việc sử dụng vượt mức những biến độc lập không quan
trọng.
• Adjusted R square < R square/ Luôn nhỏ hơn giá trị của r 2
Collinearity
Collinearity: High correlation exists among two or more independent variables.
This means the correlated variables contribute redundant information to the
multiple regression model.
Including two highly correlated independent variables can adversely affect
the regression results.
No new information provided:
• Can lead to unstable coefficients (large standard error and low t-values).
• Coefficient signs may not match prior expectations.
Some Indications of Strong
Collinearity
Incorrect signs on the coefficients.

Large change in the value of a previous coefficient when a
new variable is added to the model.
A previously significant variable becomes non-significant
when a new independent variable is added.
The estimate of the standard deviation of the model increases
when a variable is added to the model.
45
Measuring Collinearity Variance
Inflationary Factor
The variance inflationary factor VIFj can be used to measure collinearity:
VIF – PHStat program

Where R2j is the coefficient of
1 multiple determination of
VIFj  independent variable Xj with all other
1 R j
2
X variables.
If VIFj =1, Xj is uncorrelated with the other Xs
If VIFj > 10, Xj is highly correlated with the other Xs

(conservative estimate reduces this to VIFj > 5)
46
Section Summary
• Developed the multiple regression model.
• Tested the significance of the multiple regression model.
• Discussed r2, adjusted r2 and overall F test.
• Discussed using residual plots to check model assumptions.
• Tested individual regression coefficients.
• Used dummy variables.
• Evaluated interaction effects.
• Evaluated collinearity.
Regression and collinearity
Chọn Statistics để vào
kiểm tra đa cộng tuyến
Chọn Collinearity
diagnostics
Kết quả điển hình từ SPSS
Biến phụ thuộc: Hài lòng
VIF > 10  Đa cộng tuyến
Tolerance > 1  Đa cộng

tuyến
Group assignment
 Check database of group assignment

 Develop general regression model (multiple
regression)
 Develop hypotheses
 Test regression model + check collinearity +
write out the estimated regression model
 Present the result of hypothesis testing
 Develop possible solutions and think of solution
ranking

File4-Session3-Introduction To Regression

Uploaded by

File4-Session3-Introduction To Regression

Uploaded by

Introduction to

 Simple linear regression

- With a sample from the population

Simple Linear Regression:

xi yi xi  x xi  x yi xi  x 2

X = experience years (year)

b1 > 0 : Line will go up; positive relationship between X and Y (increase)

SST  SSR  SSE

SST   ( Yi  Y )2 SSR   (Yˆi  Y ) 2 SSE   i i

Measures the Explained variation Variation attributable to

/* Other notation for SSyy is SST. They are the same!

SSR regression sum of squares

X Some but not all of the

Sb=1 Estimate of the standard error of the least squares slope.

Null hypothesis (H0) and Alternative hypothesis (H1)

Test statistic with d.f. = n-2

b1  β1 Where b1 = regression slope coefficient

F follows an F distribution with k numerator and (n – k - 1)

 Sig. (significant): Goodness of fit only if

 Sig. of coefficient < 0.01  significant at 1%, Ho is rejected

y = Dependent (or response variable)

 = Random error component

wage   0  1educ   2 exp er  u

 An increase in education does not cause a rise

Adjusted R square to identify a good regression model once some

• Support to control number of independent variables added/Tạo

Incorrect signs on the coefficients.

VIF – PHStat program

If VIFj =1, Xj is uncorrelated with the other Xs

If VIFj > 10, Xj is highly correlated with the other Xs

VIF > 10  Đa cộng tuyến

Tolerance > 1  Đa cộng

 Check database of group assignment

You might also like