Unit 3 Regression Models
Unit 3 Regression Models
2
Time series linear model
3
Time series linear model
4
Multiple regression and forecasting
us_change %>%
pivot_longer(c(Consumption, Income), names_to="Series") %>%
autoplot(value) +
labs(y="% change")
2.5
% change
Series
Consumption
0.0
Income
−2.5
us_change %>%
ggplot(aes(x = Income, y = Consumption)) +
labs(y = "Consumption (quarterly % change)",
x = "Income (quarterly % change)") +
geom_point() + geom_smooth(method = "lm", se = FALSE)
2
Consumption (quarterly % change)
−1
−2
## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.582 -0.278 0.019 0.323 1.422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5445 0.0540 10.08 < 2e-16 ***
## Income 0.2718 0.0467 5.82 2.4e-08 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.591 on 196 degrees of freedom
## Multiple R-squared: 0.147, Adjusted R-squared: 0.143
## F-statistic: 33.8 on 1 and 196 DF, p-value: 2e-08 8
Example: US consumption expenditure
Consumption
1
0
−1
−2
2.5
Income
0.0
−2.5
2.5
Production
0.0
−2.5
−5.0
40
20
Savings
0
−20
−40
−60
1.5
Unemployment
1.0
0.5
0.0
−0.5
−1.0
1980 Q1 2000 Q1 2020 Q1
Quarter
9
Example: US consumption expenditure
Consumption
0.6
0.4
Corr: Corr: Corr: Corr:
0.2 0.384*** 0.529*** −0.257*** −0.527***
0.0
2.5
Income
Corr: Corr: Corr:
0.0
0.269*** 0.720*** −0.224**
−2.5
2.5
Production
0.0 Corr: Corr:
−2.5 −0.059 −0.768***
−5.0
40
20
Savings
0 Corr:
−20 0.106
−40
−60
1.5
Unemployment
1.0
0.5
0.0
−0.5
−1.0
−2 −1 0 1 2 −2.5 0.0 2.5 −5.0 −2.5 0.0 2.5 −60 −40 −20 0 20 40−1.0 −0.5 0.0 0.5 1.0 1.5
10
Example: US consumption expenditure
fit_consMR <- us_change %>%
model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit_consMR)
## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.906 -0.158 -0.036 0.136 1.155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25311 0.03447 7.34 5.7e-12 ***
## Income 0.74058 0.04012 18.46 < 2e-16 ***
## Production 0.04717 0.02314 2.04 0.043 *
## Unemployment -0.17469 0.09551 -1.83 0.069 .
## Savings -0.05289 0.00292 -18.09 < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.31 on 193 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.763
## F-statistic: 160 on 4 and 193 DF, p-value: <2e-16
11
Example: US consumption expenditure
Data
0
Fitted
−1
−2
12
Example: US consumption expenditure
2
Fitted (predicted values)
−1
−2
−1 0 1 2
Data (actual values)
13
Example: US consumption expenditure
1.0
0.5
0.0
−0.5
−1.0
1980 Q1 2000 Q1 2020 Q1
Quarter
40
0.1
30
0.0 count
acf
20
−0.1 10
−0.2 0
2 4 6 8 10 12 14 16 18 20 22 −1.0 −0.5 0.0 0.5 1.0
lag [1Q] .resid
14
Some useful predictors
15
Trend
Linear trend
xt = t
• t = 1, 2, . . . , T
• Strong assumption that trend will continue.
16
Dummy variables
17
Dummy variables
• If there are more than two categories, then the variable can be
coded using several dummy variables (one fewer than the total
number of categories).
18
Beware of the dummy variable trap!
• Using one dummy for each category gives too many dummy
variables!
• The regression will then be singular and inestimable.
• Either omit the constant, or omit the dummy for one category.
• The coefficients of the dummies are relative to the omitted
category.
19
Uses of dummy variables
Seasonal dummies
• For quarterly data: use 3 dummies
• For monthly data: use 11 dummies
• For daily data: use 6 dummies
• What to do with weekly data?
Outliers
• If there is an outlier, you can use a dummy variable
to remove its effect.
Public holidays
• For daily data: if it is a public holiday, dummy=1,
otherwise dummy=0.
20
Beer production revisited
Australian quarterly beer production
500
Megalitres
450
400
Regression model
yt = β0 + β1 t + β2 d2,t + β3 d3,t + β4 d4,t + εt
21
Beer production revisited
fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
## Series: Beer
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.9 -7.6 -0.5 8.0 21.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 441.8004 3.7335 118.33 < 2e-16 ***
## trend() -0.3403 0.0666 -5.11 2.7e-06 ***
## season()year2 -34.6597 3.9683 -8.73 9.1e-13 ***
## season()year3 -17.8216 4.0225 -4.43 3.4e-05 ***
## season()year4 72.7964 4.0230 18.09 < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.2 on 69 degrees of freedom
## Multiple R-squared: 0.924, Adjusted R-squared: 0.92
## F-statistic: 211 on 4 and 69 DF, p-value: <2e-16
22
Beer production revisited
augment(fit_beer) %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y="Megalitres",title ="Australian quarterly beer production") +
scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00"))
500
Megalitres
colour
450 Data
Fitted
400
480
Quarter
1
Fitted
2
440
3
4
400
24
Beer production revisited
20
−20
−40
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter
0.2
0.1 20
count
acf
0.0
10
−0.1
−0.2
0
2 4 6 8 10 12 14 16 18 −25 0 25
lag [1Q] .resid
25
Beer production revisited
500
level
Beer
450
80
95
400
350
1995 Q1 2000 Q1 2005 Q1 2010 Q1
Quarter
26
Intervention variables
Spikes
Steps
Change of slope
27
Holidays
28
Trading days
z1 = # Mondays in month;
z2 = # Tuesdays in month;
..
.
z7 = # Sundays in month.
29
Distributed lags
30
Nonlinear trend
x1,t = t
{
0 t<τ
x2,t =
(t − τ ) t ≥ τ
31
Nonlinear trend
x1,t = t
{
0 t<τ
x2,t =
(t − τ ) t ≥ τ
NOT RECOMMENDED!
31
Example: Boston marathon winning times
170
Winning times in minutes
160
150
140
130
fit_trends
## # A mable: 1 x 3
## linear exponential piecewise
## <model> <model> <model>
## 1 <TSLM> <TSLM> <TSLM>
33
Example: Boston marathon winning times
160 level
95
Minutes
140
.model
exponential
linear
piecewise
120
34
Example: Boston marathon winning times
20
10
−10
0.3
0.2 20
0.1 count
acf
0.0 10
−0.1
−0.2 0
5 10 15 20 −10 0 10 20
lag [1Y] .resid
35
Residual diagnostics
36
Multiple regression and forecasting
37
Residual plots
Useful for spotting outliers and whether the linear model was
appropriate.
38
Residual patterns
39
Selecting predictors and forecast
evaluation
40
Comparing regression models
Computer output for regression will always give the R 2 value. This
is a useful summary of the model.
41
Comparing regression models
However . . .
• R 2 does not allow for “degrees of freedom’ ’.
• Adding any variable tends to increase the value of R 2 ,
even if that variable is irrelevant.
To overcome this problem, we can use adjusted R 2 :
T −1
R̄ 2 = 1 − (1 − R 2 )
T −k −1
where k = no. predictors and T = no. observations.
Maximizing R̄ 2 is equivalent to minimizing σ̂ 2 .
2 1 ∑T
σ̂ = ε2t
T − k − 1 t=1
42
Akaike’s Information Criterion
43
Corrected AIC
For small values of T , the AIC tends to select too many predictors,
and so a bias-corrected version of the AIC has been developed.
2(k + 2)(k + 3)
AICC = AIC +
T −k −3
44
Bayesian Information Criterion
45
Leave-one-out cross-validation
46
Cross-validation
Traditional evaluation
Training data Test data
time
47
Cross-validation
Traditional evaluation
Training data Test data
time
time 47
Cross-validation
Traditional evaluation
Training data Test data
time
Leave-one-out cross-validation
h=1
time 48
Comparing regression models
glance(fit_trends) %>%
select(.model, r_squared, adj_r_squared, AICc, CV)
## # A tibble: 3 x 5
## .model r_squared adj_r_squared AICc CV
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 linear 0.728 0.726 452. 39.1
## 2 exponential 0.744 0.742 -779. 0.00176
## 3 piecewise 0.767 0.761 438. 34.8
50
Choosing regression variables
Warning!
50
Choosing regression variables
Notes
51
Forecasting with regression
52
Ex-ante versus ex-post forecasts
53
Scenario based forecasting
54
Building a predictive regression model
yt = β0 + β1 x1,t−h + · · · + βk xk,t−h + εt
55
US Consumption
56
US Consumption
US consumption
Scenario
1 Decrease
% change
Increase
0
level
80
−1
95
−2
57
Matrix formulation
58
Matrix formulation
Then
y = Xβ + ε.
59
Matrix formulation
1
σ̂ 2 = (y − X β̂)′ (y − X β̂) (Prove it!)
T −k −1
60
Likelihood
y ∼ N(Xβ, σ 2 I).
So the likelihood is
( )
1 1
L= T T /2
exp − 2 (y − Xβ)′ (y − Xβ)
σ (2π) 2σ
61
Exercise to do!
yt = βxt + ϵt ,
62
Multiple regression forecasts
Optimal forecasts
ŷ ∗ = E(y ∗ |y, X, x ∗ ) = x ∗ β̂ = x ∗ (X ′ X)−1 X ′ y
63
Multiple regression forecasts
Fitted values
ŷ = X β̂ = X(X ′ X)−1 X ′ y = Hy
64
Multiple regression forecasts
Fitted values
ŷ = X β̂ = X(X ′ X)−1 X ′ y = Hy
1 ∑T
CV = [et /(1 − ht )]2 ,
T t=1
65
Correlation is not causation
66
Multicollinearity
67
Multicollinearity
If multicollinearity exists. . .
68
Exercise to do!
• Plot the winning time against the year for each event. Describe
the main features of the plot.
• Fit a regression line to the data for each event. Obviously the
winning times have been decreasing, but at what average rate
per year?
• Plot the residuals against the year. What does this indicate
about the suitability of the fitted lines?
• Predict the winning time for each race in the 2020 Olympics.
Give a prediction interval for your forecasts. What assumptions
have you made in these calculations?
69
Next Lecture!
70