Correlation_Linear_Logistic Regression
Correlation_Linear_Logistic Regression
Introduction
• Data are frequently given in pairs where one
variable is dependent on the other
• E.g.
– Weight and height
– Age and blood pressure
– Birth weight and gestational age
– House rent and income
– Plasma volume and weights
Introduction
• It is usually desirable to express their
relationship by finding an appropriate
mathematical equation/model.
• To form the equation/model, collect the data
on these two variables.
• Let the observations be denoted by (X1 ,Y1),
(X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
• We have to have at least two Quantitative
Variables
• Goal: Test whether the level of the
response variable is associated with
(depends on) the level of the explanatory
variable
• Goal: Measure the strength of the
association between the two variables ‐
Correlation
• Goal: Use the level of the explanatory to
predict the level of the response variable‐
Regression
Introduction
• However, before trying to quantify this
relationship, plot the data and get an idea of
their nature.
• Plot these points on the XY plane and obtain
the scatter diagram.
Introduction
Relationship between heights of
fathers and their oldest sons
73
Heights of oldest sons (inches)
72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)
1. Simple Linear
Correlation
(Karl Pearson’s Coefficient of linear correlation)
(x i − x )(y i − y )
r= i=1
n n
(x i − x) 2
(y i − y)
2
i =1 i =1
Simple Linear Correlation
• Properties
– ‐1 r 1
– r is a pure number without any unit
– If r is close to 1 a strong positive relationship
– If r is close to ‐1 a strong negative relationship
– If r = 0 → no correlation
Simple Linear Correlation
• Example: Heights of 10 fathers (X) together with their oldest
sons (Y) are given below (in inches). Find the degree of
relationship between Y and X.
• Father (X) oldest son (Y)
• 63 65
• 64 67
• 70 69
• 72 70
• 65 64
• 67 68
• 68 71
• 66 63
• 70 70
• 71 72
Simple Linear Correlation
• Calculate the correlation coefficient for the
above data!
• r = 0.7776 0.78
Rank correlation coefficient
n
6 d i
2
rs = 1 − 3i=1
n −n
In which di is the difference of the two
ranks associated with the ith point.
Rank correlation coefficient
A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4
Rank correlation coefficient
• di² = 10, n = 6.
6 di 2 610
rs = 1− = 1−
n(n2 −1) 6(6 2 −1)
= 1− 0.29
= 0.71
X
DCH, AAU
•When we look at the scatter plot the linear model
looks reasonable, but any function of x might be
possible in practice. Although we almost always
begin with the linear model.
The question now is given a set of data what is the best fitting straight
line. Lets look at the scatter plot for n = 30 with a few possible guess.
DCH, AAU
Let L2 be the following graph
Y P4
Y= 0 + 1X
Q4
u1 P1 Q3
Q2
0 Q1 P3
P2
0 + 1 X 1
X1 X2 X3 X4 X
6
A GENERAL SIMPLE REGRESSION MODEL
Y (actual value)
Y Yˆ(fitted value) P4
X1 X2 X3 X4 X
The discrepancies between the actual and fitted values of Y are known as the
residuals.
10
Simple Linear Regression
• Y on X means Y is the dependent variable and
X is the independent one.
• The purpose of a regression equation is to use
one variable to predict another.
The Method of least square
SS
Slope : bˆ1 xy
SS xx
k P
n‐k‐1
Assumption 7:
• The probability distribution of is normal.
Assumption 8:
• The values of associated with any two observed values of y
are independent. That is, the value of associated with one
value of y has no effect on the values of associated with
other y values.
Computer output for the above example
Potential misuse of model
Multivariate Analysis
• Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
• It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
• All types of multivariate analyses involve the
construction of a mathematical model to describe
the association between independent and
dependent variables
Multivariate Analysis
• A large number of multivariate models have been
developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability.
• The choice of the appropriate model is based on
the underlying design of the study, the nature of
the variables, as well as assumptions regarding
the inter‐relationship between the exposures and
outcomes under investigation.
•
Multiple Linear Regression
Where
• y is the dependent variable
• x1, x2,…,xk are the independent variable
• E(y) = 0 + 1 x1 + 2 x2 + L + k xisk the deterministic
portion of the model.
• is a random error component
• i determines the contribution of the independent
variable xi.
Analyzing a Multiple Regression Model
Step 1.
Hypothesis the deterministic component of the
model. This component relates the mean, E(y), to
the independent variables x1, x2,…, xk. This
involved the choice of the independent variables
to be included in the model
Step 4:
Statistically evaluate the usefulness of the model
Step 5:
When satisfied that the model is useful, use it for
prediction, estimation and other purposes.
Fitting The Model: The Least Squares Approach
In this case since we have p+1 linear equations that are easily solved but not in
closed form. If we write the model in matrix form we can express the solution in
closed form
Assumption for Random Error
• logit (p) = ln
p
(
(1 − p)
)
is the log odds.
1
• P= 1 + e −(a + b1X 1+ b2 X 2+.+bnXn )
• Linear regression?
Dot‐plot: Data from The Table
Yes
Signs of coronary disease
No
0 20 40 60 80 100
AGE (years)
Logistic regression (2)
Table 2 Prevalence (%) of signs of CD according to age
group
Diseased
20 - 29 5 0 0
30 - 39 6 1 17
40 - 49 7 2 29
50 - 59 7 4 57
60 - 69 5 4 80
70 - 79 2 2 100
80 - 89 1 1 100
Dot‐plot: Data from Table 2
100
Diseased %
80
60
40
20
0
0 2 4 6 8
Age group
Logistic function (1)
Probability
of disease 1.0
+ x
P ( y x )= e
0.8
1 +e + x
0.6
0.4
0.2
0.0
x
Logistic transformation
+ x
P(y x) = e
1+ e+x
P(y x)
ln = + x
1− P(y x)
logit of P(y|x)
Advantages of Logit
yes P( y x= 1) P(y x = 0)
no 1− P(y x = 1) 1− P( y x = 0)
e +
P = e α+βx Oddsd e = e + OR = = e
1- P e
Oddsd e = e ln(OR) =
Interpretation of coefficient
• = increase in logarithm of odds ratio for
a one unit increase in x
• Test of the hypothesis that =0 (Wald
test) β2
2= (1 df)
Variance ( β)
CD
Present Absent
(1) (0)
55+ (1) 21 6
<55 (0) 22 51
Coefficient SE Coeff/SE
OR = e 2.094 = 8.1
Wald Test = 3.96 2
with 1df (p 0.05)
95% CI = e (2.094 1.96 x 0.529 )
= 2.9, 22.9
Logistic regression
• Significance tests
• The process by which coefficients are tested for
significance for inclusion or elimination from the
model involves several different techniques.
I) Z‐test
• The significance of each variable can be assessed
by treating b
Z = se(b)
• The corresponding P‐values are easily computed
(found from the table of Z‐distribution).
Logistic regression
• II) Likelihood‐Ratio Test:
• The likelihood‐ratio test uses the ratio of the
maximized value of the likelihood function for
the full model (L1) over the maximized value
of the likelihood function for the simpler
model (L0).
Logistic regression
• Deviance
– Before proceeding to the likelihood ratio test, we
need to know about the deviance which is
analogous to the residual sum of squares from a
linear model.
– The deviance of a model is ‐2 times the log
likelihood associated with each model.
– As a model’s ability to predict outcomes improves,
the deviance falls. Poorly‐fitting models have
higher deviance.
Logistic regression
• Deviance
– If a model perfectly predicts outcomes, the
deviance will be zero. This is analogous to the
situation in linear regression, where the residual
sum of squares falls to 0 if the model predicts the
values of the dependent variable perfectly.
– Based on the deviance, it is possible to construct
an analogous to r² for logistic regression,
commonly referred to as the Pseudo r².
Logistic regression
• Deviance
– If G1² is the deviance of a model with variables,
and G0² is the deviance of a null model, the
pseudo r² of the model is :
G12
r² = 1 ‐ G02 = 1 – (ln L1 / ln L0)
– One might think of it as the proportion of
deviance explained.
– The likelihood ratio test, which makes use of the
deviance , is analogous to the F‐test from linear
regression.
Logistic regression
• Deviance
– In its most basic form, it can test the hypothesis that
all the coefficients in a model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0
– The test statistic has a chi‐square distribution, with k
degrees of freedom.
– If we want to test whether a subset consisting of q
coefficients in a model are all equal to zero, the test
statistic is the same, except that for L0 we use the
likelihood from the model without the coefficients,
and L1 is the likelihood from the model with them.
– This chi‐square has q degrees of freedom.
Logistic regression
• Assumptions
• Logistic regression is popular in part because it enables
the researcher to overcome many of the restrictive
assumptions of OLS regression:
1. Logistic regression does not assume a linear relationship
between the dependents and the independents. It may
handle nonlinear effects even when exponential and
polynomial terms are not explicitly added as additional
independents because the logit link function on the left‐
hand side of the logistic regression equation is non‐
linear. However, it is also possible and permitted to add
explicit interaction and power terms as variables on the
right‐hand side of the logistic equation, as in OLS
regression.
Logistic regression
• Assumptions
2. The dependent variable need not be normally
distributed.
3. The dependent variable need not be
homoscedastic for each level of the
independents; that is, there is no homogeneity
of variance assumption.
Logistic regression
• However, other assumptions still apply:
– Meaningful coding. Logistic coefficients will be
difficult to interpret if not coded meaningfully. The
convention for binomial logistic regression is to
code the dependent class of greatest interest as 1
and the other class as 0.
– Inclusion of all relevant variables in the
regression model
– Exclusion of all irrelevant variables
Logistic regression
• However, other assumptions still apply:
– Error terms are assumed to be independent
(independent sampling). Violations of this
assumption can have serious effects. Violations are
apt to occur, for instance, in correlated samples and
repeated measures designs, such as before‐after or
matched‐pairs studies, cluster sampling, or time‐
series data. That is, subjects cannot provide multiple
observations at different time points. In some cases,
special methods are available to adapt logistic models
to handle non‐independent data.
Logistic regression
• However, other assumptions still apply:
– Linearity. Logistic regression does not require linear
relationships between the independents and the
dependent, as does OLS regression, but it does
assume a linear relationship between the logit of the
independents and the dependent.
– No multicollinearity: To the extent that one
independent is a linear function of another
independent, the problem of multicollinearity will
occur in logistic regression, as it does in OLS
regression. As the independents increase in
correlation with each other, the standard errors of the
logit (effect) coefficients will become inflated.
Logistic regression
• However, other assumptions still apply:
– No outliers. As in OLS regression, outliers can affect
results significantly. The researcher should analyze
standardized residuals for outliers and consider
removing them or modeling them separately.
Standardized residuals >2.58 are outliers at the .01
level, which is the customary level (standardized
residuals > 1.96 are outliers at the less‐used .05 level).
– Large samples. Unlike OLS regression, logistic
regression uses maximum likelihood estimation (MLE)
rather than ordinary least squares (OLS) to derive
parameters.
Logistic regression
• MLE relies on large‐sample asymptotic normality
which means that reliability of estimates decline
when there are few cases for each observed
combination of independent variables.
• That is, in small samples one may get high
standard errors. In the extreme, if there are too
few cases in relation to the number of variables,
it may be impossible to converge on a solution.
• Very high parameter estimates (logistic
coefficients) may signal inadequate sample size.
Multiple Logistic Regression
• More than one independent variable
– Dichotomous, ordinal, nominal, continuous …
P
ln = α+β1x1 + β2x2 +...βixi
1-P
• Interpretation of i
– Increase in log‐odds for a one unit increase in xi with all the
other xis constant
– Measures association between xi and log‐odds adjusted for
all other xi
Effect modification
• Effect modification
– Can be modelled by including interaction terms
P
ln =α+βx1 +1 βx2 +β x3 1x
2 2
1-P
Statistical testing
• Question
– Does model including given independent variable
provide more information about dependent
variable than model without this variable?
• Three tests
– Likelihood ratio statistic (LRS)
– Wald test
– Score test
Likelihood ratio statistic
• LR statistic
‐2 log (likelihood model 2 / likelihood model 1) =
‐2 log (likelihood model 2) minus ‐2log (likelihood model 1)
P
ln = α +β1Exc+ β 2 Smk
1-P
= 0.7102+1.0047Exc+ 0.7005Smk
(SE0.2614) (SE0.2664)