0% found this document useful (0 votes)
24 views

Lecture - Regression - Compatibility Mode

This document provides an overview of simple linear regression models. It defines the key components of a linear regression equation, including the dependent variable Y, independent variable X, intercept β0, slope β1, and error term ε. It explains how the intercept and slope are estimated to minimize the sum of squared errors between predicted and actual Y values. Various sums of squares are also defined, including total, explained, and unexplained sums of squares, which relate to the coefficient of determination R2. Examples are given to illustrate predicting a dependent variable like store traffic based on advertising spending.

Uploaded by

Zee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture - Regression - Compatibility Mode

This document provides an overview of simple linear regression models. It defines the key components of a linear regression equation, including the dependent variable Y, independent variable X, intercept β0, slope β1, and error term ε. It explains how the intercept and slope are estimated to minimize the sum of squared errors between predicted and actual Y values. Various sums of squares are also defined, including total, explained, and unexplained sums of squares, which relate to the coefficient of determination R2. Examples are given to illustrate predicting a dependent variable like store traffic based on advertising spending.

Uploaded by

Zee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Simple Linear Regression Model –

Regression Analysis
A Graphical Illustration
Simple Linear Regression Model

Yi = βo + β1xi + εi
Where
 Y = Dependent variable
 X =Independent variable
 β o = Model parameter that represents mean value of dependent variable (Y) when
the independent variable (X) is zero
 β1 = Model parameter that represents the slope that measures change in mean
value of dependent variable associated with a one-unit increase in the
independent variable

 εi = Error term that describes the effects on Yi of all factors other than value of Xi

18 19

Residual Value Sum of Squares



• Yi = βo + β1xi + εi yi  bo  b1 xi SST Sum of squared prediction error that would be
obtained if we do not use x to predict y
• Difference between the actual and predicted values Total variance: (Yi  Y ) 2

e i  y i  y i  y i  ( b o  b1 x i )
SSE Sum of squared prediction error that is obtained
when we use x to predict y
 bo and b1 minimize the residual / sum of squares (SSE)
Unexplained variance ( Y i  Yˆi )
2


SSE   ei2   ( yi  yi ) 2   ( yi  (b0  b1 xi )) 2
SSM Reduction in sum of squared prediction error that
has been accomplished using x in predicting y
Predicted value of Yi , Explained variance
( Yˆi  Y ) 2
 (Yi  Y )   (Yi  Yˆi )2   (Yˆi  Y )2
2

Coefficient of Determination (r2) r2 = SSM / SST


Where
= Explained Variation / Total Variation

20 21
Regression Analysis
Regression Analysis -- Quantify the degree of linear association
--Example
Store Traffic Versus Advertising
Predict how the store traffic would change (e.g. increase by how many Dollars
2000
people) with a $1,000 increase in advertising?
1500

Store Traffic
-- If we can observe the store traffic (y) and ad spending (x) for all 200 1000
stores: (x1,y1), (x2,y2), …, (x200, y200), we could fit a regression line for the
500
entire population. The linear equation might be hypothesize as:
0
Y i   0   1 * X i   i Other factors 0 500
Advertising Dollars
1000
affecting y

# of Consumers Ad spending ($1000) for store i But instead we have data from the sample of 20 stores from which we
visiting store i (independent variable) can perform sample linear regression.
(dependent variable)
Change of store traffic
Identify the straight line where most of the sample points fall upon!
Store traffic without ad given $1,000 increase of ad
22 23

22 23

Testing the Significance of Independent Variables

 H0 :There is no linear relationship between the independent &


dependent variables
H0: β1 = 0
Ha :There is a linear relationship between the independent &
dependent variables H1: β1 ≠ 0

25

24 25
Plot option for predicted values and residuals
Residual = Observed – Predicted

 1. *ZPRED The standardized predicted values of the


dependent variable.
2. *ZRESID The standardized residuals.

3. *DRESID Deleted residuals, the residuals for a case
when it is excluded from the regression computations.
4. *ADJPRED Adjusted predicted values, the predicted
value for a case when it is excluded from the regression
computations.
5. *SRESID Studentized residuals.
6. *SDRESID Studentized deleted residuals.

26 27

26 27

The Residual Plot

Improving Your Model: Assessing the Impact of an Outlier.

Outliers: observations with large residuals (the deviation


of the predicted score from the actual score), note that
both the red and blue lines represent the distance of the
outlier from the predicted line at a particular value of enroll

28 29

28 29
Check samples 14 and 17.

30 31

30 31

Interpret Regression Output


Proportion of total variation in y (store traffic)
explained byYi   0  1 X i (ad).

1 Estimate of :
Use estimate of b1 from sample to infer value of 1 in the population
 0 Estimate of :
Without any ad spending, on average, about 149
On average, about 1 and a half more consumer would • H0: 1  0 (No ad effect on store traffic in the population)
visit a store if ad spending goes up by $1000.
people would visit a store on Saturday. • Ha 1  0
b  0 1.5408
Tstat  1   7.234
Sb 0.2130
Alternative way of judgment: P-value
E.g. if you set α=0.05, you can reject H0 here.
32 33

Yi = 148.64 + 1.5 * Xi + ei

32 33
Multiple Regression Evaluating the Importance of Independent Variables

 A linear combination of predictor factors is used to predict the  Step 1: Consider t-value for βi's
outcome or response factors
 The general form of the multiple regression model:  Step 2: Use beta standard coefficients when independent variables are

Y   0  1 X 1   2 X 2 ....   k X k   in different units of measurement


SXk
Where, β1 , β2, . . . , βk are regression coefficients associated with the Standardizd [ k ]   k [ ]
SY
independent variables X1, X2, . . . , Xk and ε is the error .

 The prediction equation  Check for multi-collinearity: “colinearity diagnostic” in the SPSS
Y  b0  b1 X 1  b2 X 2 ....  bk X k

Where, Ŷ is the predicted Y score and


b1 . . . , bk are the (partial) regression coefficients. 34 35
 Tolerance is less than 0.1, or VIF is larger than 10.

34 35

36

Stepwise Regression ANOVA is linear regression!


 Forward Addition A categorical variable with more than two groups:
 Start with no predictor variables in regression equation e.g.: groups 1, 2, and 3 (mutually exclusive)
i.e. y = βo + ε = 0 (value for group 1) + 1(1 if in group 2) + 2(1 if in group 3)
 Add variables if they meet certain criteria in terms of F-ratio
This is called “dummy coding”—where multiple binary variables are
 Backward Elimination created to represent being in each category (or not) of a categorical
variable
 Start with full regression equation
i.e. y = βo + β1x1 + β2 x2 ...+ βr xr + ε
 Remove predictors based on F- ratio

 Stepwise Method

 Forward addition method is combined with removal of predictors that


no longer meet specified criteria at each step 36

36 37
38

Regression with Dummy Variables E.g2 Store Sales-Price-Advertising Data

Ultimate Interest: Population Linear Association (E.g., to predict the


change of unit sales with the changes in unit price and advertising )
Y (sales)  
i
0  1 * x1i (price)   2 * x 2i (ad)   i
StoreID UnitsSold Price($) Advertise (1/0)
1 420 5.5 0
2 380 6 0
3 350 6.5 0
4 400 6 0

Y   0  1 D1   2 D2   3 D3   4 X   5 380 5 0
6 450 6.5 1
7 420 4.5 0
• For Rational buyer, Ŷi = a+ b4X 8 550 5 1

• For Brand-loyal consumers, Ŷi = a + b1 + b4X 9 525 5.5 1


10 425 5.5 0
• For Variety seeker, Ŷi = a + b2 + b4X
11 475 5 0
• For Impulse buyer, Ŷi = a + b3 + b4X 12 395 6.5 1
38 13 390 6.75 1
14 425 4.5 0

38 39

40 41

40 41
42 43

42 43

SPSS Output SPSS Output

•A larger “R Square” implies a better fit of


model to the sample data. Usually R
Square should be at least >0.5 for the
linear model to be considered fine.
•The “Adjusted R Square” would be used
for regression of more than one
independent variable.

sales = 637.47 – 40.95 * unit price + 61.84 * advertising

Is it so in the population under study?

H1: There is NO price effect in the population of 200 stores. Rejected


• A smaller value of “Significance F ” implies that the overall regression model is
H2: There is NO advertising effect in the population of 200 stores. Rejected
more likely to be meaningful. On the other hand, if “Significance F” (p > 0.05),
44 45
the entire regression model is neither significant nor meaningful !

44 45
Adjust Original Regression Model if Necessary
Y i    1 * x1i   2 * x 2 i i Summary of Multiple Regression Output
(sales) (unit price) (advertising)
 Diagnostic outputs (R-square, Adjusted R-Square, Significance F)
• R Square (adjusted) = 0.470 indicate the general fitness of the linear equation model to the
• Significance of model = 0.002 sample data;
• P-value for Unit Price = 0.004
• P-value for Advertising = 0.004  Key outputs (coefficients, t Stat, P-value) indicate the significance
of each independent variable in the regression model, and its
Therefore, the original linear model seems fine, suggesting that the change
of unit sales is mainly subject to the negative influence from price, and the marginal effect on changing the dependent variable in population
positive influence from advertising. under study;

 Together, the output of multiple regression helps researchers


A doubt could come from the adjusted R-square (=0.470), which might call for
decide the necessary adjustment of the model before providing
the researchers in getting additional and more detailed data for analysis (for
example, getting the exact $ amount of ad spending from each store that runs
information to the managers.
advertisement). 46 47

46 47

Summary

independent
T-test: compares means between two independent groups

ANOVA: compares means between more than two independent groups

Pearson’s correlation coefficient (linear correlation): shows linear


correlation between two continuous variables

Linear regression: multivariate regression technique used when the


outcome is continuous; gives slopes

 Practice the examples!

48

48

You might also like