Lecture 9: Linear
Regression
Goals
• Develop basic concepts of linear regression from
a probabilistic framework
• Estimating parameters and hypothesis testing
with linear models
• Linear regression in R
Regression
• Technique used for the modeling and analysis of
numerical data
• Exploits the relationship between two or more
variables so that we can gain information about one of
them through knowing values of the other
• Regression can be used for prediction, estimation,
hypothesis testing, and modeling causal relationships
Regression Lingo
Y = X1 + X2 + X3
Dependent Variable Independent Variable
Outcome Variable Predictor Variable
Response Variable Explanatory Variable
Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3
Y = f(X1, X2, X3)
• Typically will not have enough data to try and directly
estimate f
• Therefore, we usually have to assume that it has some
restricted form, such as linear
Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model
• Much of mathematics is devoted to studying variables
that are deterministically related to one another
y = " 0 + "1 x
y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation
y = "!0 +! "1 x +
!#
2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2
"0 !
! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount
• Formally, let x* denote a particular value of the independent
variable x, then our linear probabilistic model says:
E(Y | x*) = µY|x* = mean value of Y when x is x *
2
V (Y | x*) = " Y|x* = variance of Y when x is x *
!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
• For example, if x = height and y = weight then µ!
Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population
!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5
• Q3: If x = 20 what is P(Y > 22)?
! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1
! y
"0
x
!
•
!
"ˆ = y # "ˆ x
0 1
!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2
• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2
i=1 i=1
• Which in turn allows us to estimate " 2 :
! SSE
"ˆ 2 =
n #2
!
• As well as an important statistic referred to as the coefficient of
determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1
! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #
Expression = Baseline + Age + Tissue + Sex + Error
!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables
• Simple example: sex can be coded as 0/1
• What if my categorical variable contains three levels:
0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity
• Solution is to set up a series of dummy variable. In general
for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise
x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):
H 0 : "1 = " 2 = ... = " k = 0
H A : At least one "1 # 0
R2 k
f = 2
•
(1$ R ) n $ (k + 1)
Rejection Region : F" ,k,n#(k +1)
!
!
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test
• We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Source of df Sum of Squares MS F
Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE
SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2
! !
Rejection Region : F" ,k,n#(k +1)
!
F Test For Subsets of Independent Variables
• A powerful tool in multiple regression analyses is the ability to
compare two models
• For instance say we want to compare:
Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #
Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #
!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error
Reduced Model: Trait = Mean + M1 + M2 + error
(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )
f = =
SSE F /([100 " (3 + 1)] SSE F /96
Rejection Region : Fa, 1, 96
!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0
! H : "ˆ # 0
A i
"ˆ i $ " i
T=
se(" i )
Critical value : t" / 2,n#(k#1)
• Confidence Intervals
! are equally easy to obtain:
! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )
Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
Outliers
Normality
Constant variance
Independence among residuals
• Standard diagnostic plots include:
scatter plots of y versus xi (outliers)
qq plot of residuals (normality)
residuals versus fitted values (independence, constant variance)
residuals versus xi (outliers, constant variance)
• We’ll explore diagnostic plots in more detail in R
Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random
• Fixed Effects: variables whose levels are either sampled
exhaustively or are the only ones considered relevant to the
experimenter
• Random Effects: variables whose levels are randomly sampled
from a large population of levels
• Example from our recent AJHG paper:
Expression = Baseline + Population + Individual + Error