0% found this document useful (0 votes)
14 views24 pages

Lecture9 Regression

This document covers the fundamentals of linear regression, including its probabilistic framework, parameter estimation, and hypothesis testing. It explains the relationship between dependent and independent variables, outlines the process of estimating model parameters, and discusses multiple linear regression and categorical variables. Additionally, it addresses hypothesis testing for model utility and individual regression coefficients, as well as the importance of checking assumptions in regression analysis.

Uploaded by

justson280304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

Lecture9 Regression

This document covers the fundamentals of linear regression, including its probabilistic framework, parameter estimation, and hypothesis testing. It explains the relationship between dependent and independent variables, outlines the process of estimating model parameters, and discusses multiple linear regression and categorical variables. Additionally, it addresses hypothesis testing for model utility and individual regression coefficients, as well as the importance of checking assumptions in regression analysis.

Uploaded by

justson280304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 9: Linear

Regression
Goals

• Develop basic concepts of linear regression from


a probabilistic framework

• Estimating parameters and hypothesis testing


with linear models

• Linear regression in R
Regression

• Technique used for the modeling and analysis of


numerical data

• Exploits the relationship between two or more


variables so that we can gain information about one of
them through knowing values of the other

• Regression can be used for prediction, estimation,


hypothesis testing, and modeling causal relationships
Regression Lingo

Y = X1 + X2 + X3

Dependent Variable Independent Variable

Outcome Variable Predictor Variable

Response Variable Explanatory Variable


Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3

Y = f(X1, X2, X3)

• Typically will not have enough data to try and directly


estimate f

• Therefore, we usually have to assume that it has some


restricted form, such as linear

Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model

• Much of mathematics is devoted to studying variables


that are deterministically related to one another

y = " 0 + "1 x

y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation

y = "!0 +! "1 x +
!#

2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2

"0 !

! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount

• Formally, let x* denote a particular value of the independent


variable x, then our linear probabilistic model says:

E(Y | x*) = µY|x* = mean value of Y when x is x *


2
V (Y | x*) = " Y|x* = variance of Y when x is x *

!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1

x
x1 x2
!

• For example, if x = height and y = weight then µ!


Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population

!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5

• Q3: If x = 20 what is P(Y > 22)?


! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1

! y
"0

x
!


!
"ˆ = y # "ˆ x
0 1

!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2

• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3

y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2

i=1 i=1

• Which in turn allows us to estimate " 2 :


! SSE
"ˆ 2 =
n #2
!

• As well as an important statistic referred to as the coefficient of


determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1

! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #

Expression = Baseline + Age + Tissue + Sex + Error

!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables

• Simple example: sex can be coded as 0/1

• What if my categorical variable contains three levels:


0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity

• Solution is to set up a series of dummy variable. In general


for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise

x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):

H 0 : "1 = " 2 = ... = " k = 0


H A : At least one "1 # 0
R2 k
f = 2

(1$ R ) n $ (k + 1)

Rejection Region : F" ,k,n#(k +1)


!

!
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)

Source of df Sum of Squares MS F


Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE

SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2

! !

Rejection Region : F" ,k,n#(k +1)


!
F Test For Subsets of Independent Variables

• A powerful tool in multiple regression analyses is the ability to


compare two models

• For instance say we want to compare:


Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #

Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #

!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.

Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error


Reduced Model: Trait = Mean + M1 + M2 + error

(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )


f = =
SSE F /([100 " (3 + 1)] SSE F /96

Rejection Region : Fa, 1, 96


!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0

! H : "ˆ # 0
A i

"ˆ i $ " i
T=
se(" i )

Critical value : t" / 2,n#(k#1)

• Confidence Intervals
! are equally easy to obtain:

! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )


Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
 Outliers
 Normality
 Constant variance
 Independence among residuals

• Standard diagnostic plots include:


 scatter plots of y versus xi (outliers)
 qq plot of residuals (normality)
 residuals versus fitted values (independence, constant variance)
 residuals versus xi (outliers, constant variance)

• We’ll explore diagnostic plots in more detail in R


Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random

• Fixed Effects: variables whose levels are either sampled


exhaustively or are the only ones considered relevant to the
experimenter

• Random Effects: variables whose levels are randomly sampled


from a large population of levels

• Example from our recent AJHG paper:

Expression = Baseline + Population + Individual + Error

You might also like