Multilevel Modeling Using R
Multilevel Modeling Using R
Using R
Second Edition
Multilevel Modeling
Using R
Second Edition
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this
publication and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let us know so we
may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known
or hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC),
222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Authors......................................................................................................................ix
1 Linear Models...................................................................................................1
Simple Linear Regression................................................................................2
Estimating Regression Models with Ordinary Least Squares................... 2
Distributional Assumptions Underlying Regression..................................3
Coefficient of Determination........................................................................... 4
Inference for Regression Parameters.............................................................5
Multiple Regression..........................................................................................7
Example of Simple Linear Regression by Hand...........................................9
Regression in R................................................................................................ 11
Interaction Terms in Regression................................................................... 14
Categorical Independent Variables.............................................................. 15
Checking Regression Assumptions with R................................................ 18
Summary.......................................................................................................... 21
v
vi Contents
Additional Options......................................................................................... 56
Parameter Estimation Method................................................................. 56
Estimation Controls................................................................................... 56
Comparing Model Fit................................................................................ 57
lme4 and Hypothesis Testing................................................................... 58
Summary.......................................................................................................... 61
Note................................................................................................................... 61
ix
1
Linear Models
1
2 Multilevel Modeling Using R
yi = β0 + β1xi + ε i , (1.1)
where yi is the dependent variable for individual i in the dataset, and xi is the
independent variable for subject i (i = 1, , N ) . The terms β0 and β1 are the
intercept and slope of the model, respectively. In a graphical sense, the inter-
cept is the point where the line in Equation (1.1) crosses the y-axis at x = 0. It is
also the mean, specifically the conditional mean, of y for individuals with a
value of 0 on x, and it is this latter definition that will be most useful in actual
practice. The slope, β1, expresses the relationship between y and x. Positive
slope values indicate that larger values of x are associated with correspond-
ingly larger values of y, while negative slopes mean that larger x values are
associated with smaller ys. Holding everything else constant, larger values
of β1 (positive or negative) indicate a stronger linear relationship between
y and x. Finally, εi represents the random error inherent in any statistical
model, including regression. It expresses the fact that for any individual i,
the model will not generally provide a perfect predicted value of yi, denoted
yi and obtained by applying the regression model as
yˆ i = β0 + β1xi . (1.2)
goal of OLS is to minimize the sum of the squared differences between the
observed values of y and the model-predicted values of y, across the sample.
This difference, known as the residual, is written as
ei = yi − yˆ i . (1.3)
∑ ∑ (y − yˆ ) . (1.4)
i =1
ei2 =
i =1
i i
2
The actual mechanism for finding the linear equation that minimizes the sum
of squared residuals involves the partial derivatives of the sum of squared
function with respect to the model coefficients, β0 and β1. We will leave these
mathematical details to excellent references, such as Fox (2016). It should be
noted that in the context of simple linear regression, the OLS criteria reduce
to the following equations, which can be used to obtain b0 and b1 as
sy
b1 = r (1.5)
sx
and
b0 = y − b1x (1.6)
where
r is the Pearson product moment correlation coefficient between x and y,
sy is the sample standard deviation of y,
sx is the sample standard deviation of x,
y is the sample mean of y, and
x is the sample mean of x.
Coefficient of Determination
When the linear regression model has been estimated, researchers generally
want to measure the relative magnitude of the relationship between the vari-
ables. One useful tool for ascertaining the strength of relationship between
x and y is the coefficient of determination, which is the squared multiple
correlation coefficient, denoted R2 in the sample. R2 reflects the proportion of
the variation in the dependent variable that is explained by the independent
variable. Mathematically, R2 is calculated as
Linear Models 5
n n
∑ ( yˆ i − y ) ∑ ( y − yˆ )
2 2
i
SSR SSE
R2 = = i =1
n = 1− i =1
n = 1− . (1.7)
SST SST
∑ (y − y ) ∑ (y − y )
2 2
i i
i =1 i =1
The terms in Equation (1.7) are as defined previously. The value of this sta-
tistic always lies between 0 and 1, with larger numbers indicating a stronger
linear relationship between x and y, implying that the independent variable
is able to account for more variance in the dependent. R2 is a very commonly
used measure of the overall fit of the regression model and, along with the
parameter inference discussed below, serves as the primary mechanism by
which the relationship between the two variables is quantified.
∑e 2
i
se2 = i =1
, (1.8)
n− p −1
6 Multilevel Modeling Using R
where
ei is the residual value for individual i,
N is the sample size, and
p is the number of independent variables (1 in the case of simple
regression).
Then
1 se
sb1 = . (1.9)
1 − R2 n
∑ (x − x )
i =1
i
2
∑x 2
i
sb0 = sb1 i =1
. (1.10)
n
Given that the sample intercept and slope are only estimates of the popu-
lation parameters, researchers are quite often interested in testing hypoth-
eses to infer whether the data represent a departure from what would be
expected in what is commonly referred to as the null case, that the idea of the
null value holding true in the population can be rejected. Most frequently
(though not always) the inference of interest concerns testing that the popu-
lation parameter is 0. In particular, a non-0 slope in the population means
that x is linearly related to y. Therefore, researchers typically are interested
in using the sample to make inferences about whether the population slope
is 0 or not. Inferences can also be made regarding the intercept, and again the
typical focus is on whether this value is 0 in the population.
Inferences about regression parameters can be made using confidence
intervals and hypothesis tests. Much as with the confidence interval of the
mean, the confidence interval of the regression coefficient yields a range of
values within which we have some level of confidence (e.g. 95%) that the
population parameter value resides. If our particular interest is in whether x
is linearly related to y, then we would simply determine whether 0 is in the
interval for β1. If so, then we would not be able to conclude that the popu-
lation value differs from 0. The absence of a statistically significant result
(i.e. an interval not containing 0) does not imply that the null hypothesis
is true, but rather it means that there is not sufficient evidence available in
the sample data to reject the null. Similarly, we can construct a confidence
interval for the intercept, and if 0 is within the interval, we would conclude
Linear Models 7
that the value of y for an individual with x = 0 could plausibly be, but is not
necessarily, 0. The confidence intervals for the slope and intercept take the
following forms:
b1 ± tcvSb1 (1.11)
and
b0 ± tcvSb0 (1.12)
Here the parameter estimates and their standard errors are as described pre-
viously, while tcv is the critical value of the t distribution for 1-α/2 (e.g. the
.975 quantile if α = .05) with n-p-1 degrees of freedom. The value of α is equal
to 1 minus the desired level of confidence. Thus, for a 95% confidence inter-
val (0.95 level of confidence), α would be 0.05.
In addition to confidence intervals, inference about the regression param-
eters can also be made using hypothesis tests. In general, the forms of this
test for the slope and intercept respectively are
b1 − β1
tb1 = (1.13)
sb1
b0 − β0
tb0 = (1.14)
sb0
The terms β1 and β0 are the parameter values under the null hypothesis.
Again, most often the null hypothesis posits that there is no linear relation-
ship between x and y (β1 = 0) and that the value of y = 0 when x = 0 (β0 = 0).
For simple regression, each of these tests is conducted with n-2 degrees of
freedom.
Multiple Regression
The linear regression model can very easily be extended to allow for multiple
independent variables at once. In the case of two regressors, the model takes
the form
In many ways, this model is interpreted as is that for simple linear regres-
sion. The only major difference between simple and multiple regression
interpretation is that each coefficient is interpreted in turn holding constant
the value of the other regression coefficient. In particular, the parameters
8 Multilevel Modeling Using R
are estimated by b0, b1, and b2, and inferences about these parameters are
made in the same fashion with regard to both confidence intervals and
hypothesis tests. The assumptions underlying this model are also the same
as those described for the simple regression model. Despite these similari-
ties, there are three additional topics regarding multiple regression that we
need to consider here. These are the inference for the set of model slopes as a
whole, an adjusted measure of the coefficient of determination, and the issue
of collinearity among the independent variables. Because these issues will
be important in the context of multilevel modeling as well, we will address
them in detail here.
With respect to model inference, for simple linear regression the most
important parameter is generally the slope, so that inference for it will be
of primary concern. When there are multiple x variables in the model, the
researcher may want to know whether the independent variables taken as a
whole are related to y. Therefore, some overall test of model significance is
desirable. The null hypothesis for this test is that all of the slopes are equal
to 0 in the population; i.e. none of the regressors are linearly related to the
dependent variable. The test statistic for this hypothesis is calculated as
SSR / p n − p − 1 R2
F= = 1 − R2 . (1.16)
SSE /(n − p − 1) p
Here terms are as defined in Equation (1.7). This test statistic is distributed
as an F with p and n-p-1 degrees of freedom. A statistically significant result
would indicate that one or more of the regression coefficients are not equal
to 0 in the population. Typically, the researcher would then refer to the tests
of individual regression parameters, which were described above, in order to
identify which were not equal to 0.
A second issue to be considered by researchers in the context of multi-
ple regression is the notion of adjusted R2. Stated simply, the inclusion of
additional independent variables in the regression model will always yield
higher values of R2, even when these variables are not statistically signifi-
cantly related to the dependent variable. In other words, there is a capitaliza-
tion on chance that occurs in the calculation of R2. As a consequence, models
including many regressors with negligible relationships with y may produce
an R2 that would suggest the model explains a great deal of variance in y. An
option for measuring the variance explained in the dependent variable that
accounts for this additional model complexity would be quite helpful to the
researcher seeking to understand the true nature of the relationship between
the set of independent variables and the dependent. Such a measure exists in
the form of the adjusted R2 value, which is commonly calculated as
n−1
( )
RA2 = 1 − 1 − R2
n − p − 1
. (1.17)
Linear Models 9
RA2 only increases with the addition of an x if that x explains more variance
than would be expected by chance. RA2 will always be less than or equal to
the standard R2. It is generally recommended to use this statistic in practice
when models containing many independent variables are used.
A final important issue specific to multiple regression is that of collin-
earity, which occurs when one independent variable is a linear combina-
tion of one or more of the other independent variables. In such a case,
regression coefficients and their corresponding standard errors can be
quite unstable, resulting in poor inference. It is possible to investigate the
presence of collinearity using a statistic known as the variance inflation
factor (VIF). In order to obtain the VIF for xj, we would first regress all of
the other independent variables onto xj and obtain an Rx2i value. We then
calculate
1
VIF = . (1.18)
1 − Rx2
The VIF will become large when Rx2j is near 1, indicating that xj has very little
unique variation when the other independent variables in the model are con-
sidered. That is, if the other p-1 regressors can explain a high proportion of
xj, then xj does not add much to the model, above and beyond the other p-1
regression. Collinearity in turn leads to high sampling variation in bj, result-
ing in large standard errors and unstable parameter estimates. Conventional
rules of thumb have been proposed for determining when an independent
variable is highly collinear with the set of other p-1 regressors. Thus, the
researcher might consider collinearity to be a problem if VIF > 5 or 10 (Fox,
2016). The typical response to collinearity is to either remove the offending
variable(s) or use an alternative approach to conducting the regression analy-
sis such as ridge regression or regression following a principal components
analysis.
TABLE 1.1
Descriptive Statistics and Correlation for GPA and Test Anxiety
Variable Mean Standard Deviation Correlation
GPA 3.12 0.51 –0.30
Anxiety 35.14 10.83
We can use this information to obtain estimates for both the slope and
intercept of the regression model using Equations (1.4) and (1.5). First, the
slope is calculated as
0.51
b1 = −0.30 = −0.014,
10.83
indicating that individuals with higher test anxiety scores will generally
have lower GPAs. Next, we can use this value and information in the table to
calculate the intercept estimate: b0 = 3.12 − ( −0.014)(35.14) = 3.63 .
The resulting estimated regression equation is then
ˆ
GPA = 3.63 − 0.014 (Anxiety ). Thus, this model would predict that for a
1-point increase in the anxiety assessment score, GPA would decrease by
−0.014 points.
In order to better understand the strength of the relationship between test
anxiety and GPA, we will want to calculate the coefficient of determination.
To do this, we need both the SSR and SST, which take the values 10.65 and
115.36, yielding
10.65
R2 = = 0.09.
115.36
This result suggests that approximately 9% of the variation in GPA is
explained by a variation in test anxiety scores. Using this R2 value and Equation
(1.14), we can calculate the F-statistic testing whether any of the model slopes (in
this case there is only one) are different from 0 in the population:
440 − 1 − 1 0.09
F= = 438(0.10) = 43.80.
1 1 − 0.09
This test has p and n-p-1 degrees of freedom, or 1 and 438 in this situation.
The p-value of this test is less than 0.001, leading us to conclude that the
slope in the population is indeed significantly different from 0 because the
p-value is less than the Type I error rate specified. Thus, test anxiety is lin-
early related to GPA. The same inference could be conducted using the t-test
for the slope. First, we must calculate the standard error of the slope estimate:
1 SE
sb1 = .
1 − R2 Σ( xi − x )2
Linear Models 11
104.71
SE = = 0.24 = 0.49.
440 − 1 − 1
In turn, the sum of squared deviations for x (anxiety) was 53743.64, and we
previously calculated R2 = 0.09. Thus, the standard error for the slope is
1 0.49
sb1 = = 1.05(0.002) = 0.002.
1 − 0.09 53743.64
The test statistic for the null hypothesis that β1 = 0 is calculated as
b1 − 0 −0.014
t= = = −7.00,
sb1 0.002
with n-p-1 or 438 degrees of freedom. The p-value for this test statistic value is
less than 0.001 and thus we can probabilistically infer that in the population,
the value of the slope is not 0, with the best sample point estimate being −0.014.
Finally, we can also draw inferences about β1 through a 95% confidence
interval, as shown in Equation (1.9). For this calculation, we will need to
determine the value of the t distribution with 438 degrees of freedom that
corresponds to the 1-0.05/2 or 0.975 point in the distribution. We can do so
by using a t table in the back of a textbook, or through standard computer
software such as SPSS. In either case, we find that the critical value for this
example is 1.97. The confidence interval can then be calculated as
Regression in R
In R, the function call for fitting linear regression is lm, which is part of the
stats library that is loaded by default each time R is started on your com-
puter. The basic form for a linear regression model using lm is:
lm(formula, data)
12 Multilevel Modeling Using R
Where formula defines the linear regression form and data indicates the
dataset used in the analysis, examples of which appear below. Returning to
the previous example, predicting GPA from measures of physical (BStotal)
and cognitive academic anxiety (CTA.tot), the model is defined in R as:
This line of R code is referred to as a function call, and defines the regres-
sion equation. The dependent variable, GPA, is followed by the independent
variables CTA.tot and BStotal, separated by ~. The dataset, Cassidy, is
also given here, after the regression equation has been defined. Finally, the
output from this analysis is stored in the object Model1.1. In order to view
this output, we can type the name of this object in R, and hit return to obtain
the following:
Call:
lm(formula = GPA ~ CTA.tot + BStotal, data = Cassidy)
Coefficients:
(Intercept) CTA.tot BStotal
3.61892 -0.02007 0.01347
The output obtained from the basic function call will return only values
for the intercept and slope coefficients, lacking information regarding model
fit (e.g. R2) and significance of model parameters. Further information on our
model can be obtained by requesting a summary of the model.
summary(Model1.1)
Call:
lm(formula = GPA ~ CTA.tot + BStotal, data = Cassidy)
Residuals:
Min 1Q Median 3Q Max
-2.99239 -0.29138 0.01516 0.36849 0.93941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.618924 0.079305 45.633 < 2e-16 ***
CTA.tot -0.020068 0.003065 -6.547 1.69e-10 ***
BStotal 0.013469 0.005077 2.653 0.00828 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the model summary we can obtain information on model fit (over-
all F test for significance, R2 and standard error of the estimate), parameter
significance tests, and a summary of residual statistics. As the F test for the
overall model is somewhat abbreviated in this output, we can request the
entire ANOVA result, including sums of squares and mean squares, by using
the anova(Model1.1) function call.
Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
CTA.tot 1 10.316 10.3159 43.8125 1.089e-10 ***
BStotal 1 1.657 1.6570 7.0376 0.00828 **
Residuals 426 100.304 0.2355
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
attributes(Model1.1)
$names
[1] "coefficients" "residuals" "effects" "rank"
"fitted.values"
[6] "assign" "qr" "df.residual" "na.
action" "xlevels"
[11] "call" "terms" "model"
$class
[1] "lm"
This is a list of attributes or information that can be pulled out of the fitted
regression model. In order to obtain this information from the fitted model,
we can call for the particular attribute. For example, if we wished to obtain
the predicted GPAs for each individual in the sample, we would simply type
the following, followed by the enter key:
Model1.1$fitted.values
1 3 4 5 8 9 10 11 12
2.964641 3.125996 3.039668 3.125454 2.852730 3.152391 3.412460 3.011917 2.611103
13 14 15 16 17 19 23 25 26
3.158448 3.298923 3.312121 2.959938 3.205183 2.945928 2.904979 3.226064 3.245318
14 Multilevel Modeling Using R
27 28 29 30 31 34 35 37 38
2.944573 3.171646 2.917635 3.198584 3.206267 3.073204 3.258787 3.118584 2.972594
39 41 42 43 44 45 46 48 50
2.870630 3.144980 3.285454 3.386064 2.871713 2.911849 3.166131 3.051511 3.251917
Thus, for example, the predicted GPA for subject 1 based on the prediction
equation would be 2.96. By the same token, we can obtain the regression
residuals with the following command:
Model1.1$residuals
1 3 4 5 8 9
-0.4646405061 -0.3259956916 -0.7896675749 -0.0254537419 0.4492704297 -0.0283914353
10 11 12 13 14 15
-0.1124596847 -0.5119169570 0.0888967457 -0.6584484215 -0.7989228998 -0.4221207716
16 17 19 23 25 26
-0.5799383942 -0.3051829226 -0.1459275978 -0.8649791080 0.0989363702 -0.2453184879
27 28 29 30 31 34
-0.4445727235 0.7783537067 -0.8176350301 0.1014160133 0.3937331779 -0.1232042042
35 37 38 39 41 42
0.3412126654 0.4814161689 0.9394056837 -0.6706295541 -0.5449795748 -0.4194540531
43 44 45 46 48 50
-0.4960639410 -0.0717134535 -0.4118490187 0.4338687432 0.7484894275 0.4480825762
From this output, we can see that the predicted GPA for the first individ-
ual in the sample was approximately 0.465 points above the actual GPA (i.e.
observed GPA – predicted GPA = 2.5-2.965).
Residuals:
Min 1Q Median 3Q Max
-2.98711 -0.29737 0.01801 0.36340 0.95016
Linear Models 15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8977792 0.2307491 16.892 < 2e-16 ***
CTA.tot -0.0267935 0.0060581 -4.423 1.24e-05 ***
BStotal -0.0057595 0.0157812 -0.365 0.715
CTA.tot:BStotal 0.0004328 0.0003364 1.287 0.199
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the slope for the interaction is denoted CTA.tot:BStotal, takes the
value 0.0004, and is non-significant (t = 1.287, p = 0.199), which indicates that
the level of physical anxiety symptoms (BStotal) does not change or mod-
erate the relationship between cognitive test anxiety (CTA.tot) and GPA.
Call:
lm(formula = GPA ~ CTA.tot + Male, data = Acad)
Residuals:
Min 1Q Median 3Q Max
-3.01149 -0.29005 0.03038 0.35374 0.96294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.740318 0.080940 46.211 < 2e-16 ***
CTA.tot -0.015184 0.002117 -7.173 3.16e-12 ***
16 Multilevel Modeling Using R
In this example the slope for the dummy variable Male is negative and sig-
nificant (β = −0.223, p < .001) indicating that males have a significantly lower
mean GPA than do females.
Depending on the format in which the data are stored, the lm function
is capable of dummy coding a categorical variable itself. If the variable has
been designated as categorical (this often happens if you have read your data
from an SPSS file in which the variable is designated as such) then when the
variable is used in the lm function it will automatically dummy code the
variable for you in your results. For example, if instead of using the Male
variable as described above, we used Gender as a categorical variable coded
as female and male, we would obtain the following results from the model
specification and summary commands.
Call:
lm(formula = GPA ~ CTA.tot + Gender, data = Acad)
Residuals:
Min 1Q Median 3Q Max
-3.01149 -0.29005 0.03038 0.35374 0.96294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.740318 0.080940 46.211 < 2e-16 ***
CTA.tot -0.015184 0.002117 -7.173 3.16e-12 ***
Gender[T.male] -0.222594 0.047152 -4.721 3.17e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the slope as Gender[t.male] indicating that the variable has been auto-
matically dummy coded so that Male is 1 and not Male is 0.
In the same manner, categorical variables consisting of more than two
categories can also be easily incorporated into the regression model, either
through direct use of the categorical variable or dummy coded prior to
analysis. In the following example, the variable Ethnicity includes three
possible groups, African American, Other, and Caucasian. By including this
variable in the model call, we are implicitly requesting that R automatically
dummy code it for us.
Call:
lm(formula = GPA ~ CTA.tot + Ethnicity, data = Acad)
Residuals:
Min 1Q Median 3Q Max
-2.95019 -0.30021 0.01845 0.37825 1.00682
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 3.670308 0.079101 46.400 <
2e-16 ***
CTA.tot -0.015002 0.002147 -6.989
1.04e-11 ***
Ethnicity[T.African American] -0.482377 0.131589 -3.666
0.000277 ***
Ethnicity[T.Other] -0.151748 0.136150 -1.115
0.265652
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given that we have slopes for African American and Other, we know that
Caucasian serves as the reference category, which is coded as 0. Results indi-
cate a significant negative slope for African American (β = −.482, p < .001), and
a non-significant slope for Other (β = −.152, p > .05) indicating that African
Americans have a significantly lower GPA than Caucasians, but the Other
ethnicity category was not significantly different from Caucasian in terms
of GPA.
Finally, let us consider some issues associated with allowing R to auto dummy
code categorical variables. First, R will always auto dummy code the first
18 Multilevel Modeling Using R
Male<-as.factor(Male)
Library(car)
residualPlots(Model1.1)
1949). A non-significant result, such as that found for this example, indicates
that no interaction is required in the model. The other tests included here are
for the squared term of each independent variable. For example, given that the
Test stat for CTA.tot or BStotal are not significant, we can conclude that
neither of these variables has a quadratic relationship with GPA (Figure 1.1).
residualPlots(Model1.1)
FIGURE 1.1
Diagnostic residual plots for regression model predicting GPA from CTA.tot and Bstotal.
20 Multilevel Modeling Using R
Summary
Chapter 1 introduced the reader to the basics of linear modeling using R.
This treatment was purposely limited, as there are a number of good texts
available on this subject, and it is not the main focus of this book. However,
many of the core concepts presented here for the GLM apply to multilevel
modeling as well, and thus are of key importance as we move into these
more complex analyses. In addition, much of the syntactical framework
presented here will reappear in subsequent chapters. In particular, readers
should leave this chapter comfortable with interpretation of coefficients in
linear models, as well as the concept of variance explained in an outcome
variable. We would encourage you to return to this chapter frequently as
needed in order to reinforce these basic concepts. In addition, we would rec-
ommend that you also refer to the initial chapter dealing with the basics of
using R when questions regarding data management and installation of spe-
cific R libraries become an issue. Next, in Chapter 2 we will turn our atten-
tion to the conceptual underpinnings of multilevel modeling before delving
into their estimation in Chapters 3 and 4.
2
An Introduction to Multilevel Data Structure
23
24 Multilevel Modeling Using R
only one level of a higher-level variable such as school. Thus, students are
nested within school. Such designs can be contrasted with a crossed data
structure whereby individuals at the first level appear in multiple levels of
the second variable. In our example, students might be crossed with after-
school organizations if they are allowed to participate in more than one.
For example, a given student might be on the basketball team as well as in
the band. The focus of this book is almost exclusively on nested designs,
which give rise to multilevel data. Other examples of nested designs might
include a survey of job satisfaction for employees from multiple depart-
ments within a large business organization. In this case, each employee
works within only a single division in the company, which leads to a
nested design. Furthermore, it seems reasonable to assume that employ-
ees working within the same division will have correlated responses on
the satisfaction survey, as much of their view regarding the job would
be based exclusively upon experiences within their division. For a third
such example, consider the situation in which clients of several psycho-
therapists working in a clinic are asked to rate the quality of each of their
therapy sessions. In this instance, there exist three levels in the data: time,
in the form of individual therapy session, client, and therapist. Thus, ses-
sion is nested in client, who in turn is nested within therapist. All of this
data structure would be expected to lead to correlated scores on a therapy-
rating instrument.
Intraclass Correlation
In cases where individuals are clustered or nested within a higher-level unit
(e.g. classrooms, schools, school districts), it is possible to estimate the cor-
relation among individuals’ scores within the cluster/nested structure using
the intraclass correlation (denoted ρI in the population). The ρI is a measure
of the proportion of variation in the outcome variable that occurs between
groups versus the total variation present and ranges from 0 (no variance
between clusters) to 1 (variance between clusters but not within cluster vari-
ance). ρI can also be conceptualized as the correlation for the dependent mea-
sure for two individuals randomly selected from the same cluster. It can be
expressed as
t2
rI = (2.1)
t2 + s2
where
τ2 = Population variance between clusters
σ2 = Population variance within clusters
An Introduction to Multilevel Data Structure 25
å (n - 1)S
j =1
j
2
j
sˆ 2 =
N -C
where
nj
2
å (y ij - yj)
S = variance within cluster j =
j
i =1
(n j - 1)
nj = sample size for cluster j
N = total sample size
C = total number of clusters
å (n - 1)S
j =1
j
2
j
sˆ 2 = (2.2)
N -C
where
nj
å (y
j =1
ij - y j )2
S2j =
(n j - 1)
nj = sample size for cluster j
N = total sample size
C = total number of clusters
å n (y - y)
j j
2
Sˆ B2 =
j =1
(2.3)
n (C - 1)
where
y j = mean on response variable for cluster j
y = overall mean on response variable
é C
ù
1 ê
ê å j =1
n2j ú
ú
n = êN - ú
C -1 ê N ú
ê ú
ë û
sˆ 2
tˆ 2 = SB2 - . (2.4)
n
Using these variance estimates, we can in turn calculate the sample estimate
of ρI:
tˆ 2
rˆ I = . (2.5)
tˆ 2 + sˆ 2
Note that Equation (2.5) assumes that the clusters are of equal size. Clearly,
such will not always be the case, in which case this equation will not hold.
However, the purpose for its inclusion here is to demonstrate the principle
underlying the estimation of ρI, which holds even as the equation might
change.
In order to illustrate estimation of ρI, let us consider the following dataset.
Achievement test data were collected from 10,903 third-grade examinees
nested within 160 schools. School sizes range from 11 to 143, with a mean
size of 68.14. In this case, we will focus on the reading achievement test
score, and will use data from only five of the schools, in order to make the
calculations by hand easy to follow. First, we will estimate ŝ 2 . To do so,
we must estimate the variance in scores within each school. These values
appear in Table 2.1.
An Introduction to Multilevel Data Structure 27
TABLE 2.1
School Size, Mean, and Variance of Reading Achievement Test
School N Mean Variance
767 58 3.952 5.298
785 29 3.331 1.524
789 64 4.363 2.957
815 39 4.500 6.088
981 88 4.236 3.362
Total 278 4.149 3.916
∑ (n − 1)S
j =1
j
2
j
(58 − 1)5.3 + (29 − 1)1.5 + (64 − 1)2.9 + (39 − 1)6.1 + (88 − 1)3.4
σˆ 2 = =
N −C 278 − 5
302.1 + 42 + 182.7 + 231.8 + 295.8 1054.4
= = = 3.9
273 273
The school means, which are needed in order to calculate SB2 , appear in
Table 2.2 as well. First, we must calculate n:
C
1
∑ n
j =1
2
j
1 58 2 + 292 + 64 2 + 392 + 88 2 1
n = N− = 278 − = 4 (278 − 63.2)
C−1 N
5−1 278
= 53.7
TABLE 2.2
Between Subjects Intercept and Slope, and within Subjects Variation on
These Parameters by School
School Intercept U0j Slope U1j
1 1.230 −1.129 0.552 0.177
2 2.673 0.314 0.199 −0.176
3 2.707 0.348 0.376 0.001
4 2.867 0.508 0.336 −0.039
5 2.319 −0.040 0.411 0.036
Overall 2.359 0.375
28 Multilevel Modeling Using R
Using this value, we can then calculate SB2 for the five schools in our small
sample using Equation (2.3):
58(3.952 - 4.149)2 + 29(3.331 - 4.149)2 + 64( 4.363 - 4.149)2 + 39( 4.500 - 4.149)2 + 88( 4.236 - 4.149)2
53.7(5 - 1)
2.251 + 19.405 + 2.931 + 4.8805 + 0.666 30.057
= = = 0.140
214.8 214.800
3.9
0.140 - = 0.140 - 0.073 = 0.067
53.7
We have now calculated all of the parts that we need to estimate ρI for the
population,
0.067
rˆ I = = 0.017
0.067 + 3.9
This result indicates that there is very little correlation of examinees’ test
scores within the schools. We can also interpret this value as the proportion
of variation in the test scores that is accounted for by the schools.
Given that r̂I is a sample estimate, we know that it is subject to sampling
variation, which can be estimated with a standard error as in Equation (2.6):
2
srI = (1 - rI ) (1 + (n - 1)rI ) . (2.6)
n(n - 1)( N - 1)
The terms in 2.6 are as defined previously, and the assumption is that
all clusters are of equal size. As noted earlier in the chapter, this latter
condition is not a requirement, however, and an alternative formulation
exists for cases in which it does not hold. However, 2.6 provides suffi-
cient insight for our purposes into the estimation of the standard error
of the ICC.
The ICC is an important tool in multilevel modeling, in large part because
it is an indicator of the degree to which the multilevel data structure might
impact the outcome variable of interest. Larger values of the ICC are indica-
tive of a greater impact of clustering. Thus, as the ICC increases in value, we
must be more cognizant of employing multilevel modeling strategies in our
data analysis. In the next section, we will discuss the problems associated
with ignoring this multilevel structure, before we turn our attention to meth-
ods for dealing with it directly.
An Introduction to Multilevel Data Structure 29
Random Intercept
As we transition from the one-level regression framework of Chapter 1 to
the MLM context, let’s first revisit the basic simple linear regression model
of Equation (1.1), y = b0 + b1x + e . Here, the dependent variable y is expressed
as a function of an independent variable, x, multiplied by a slope coeffi-
cient, β1, an intercept, β0, and random variation from subject to subject, ε.
We defined the intercept as the conditional mean of y when the value of x
is 0. In the context of a single-level regression model such as this, there is
one intercept that is common to all individuals in the population of inter-
est. However, when individuals are clustered together in some fashion (e.g.
within classrooms, schools, organizational units within a company), there
will potentially be a separate intercept for each of these clusters; that is,
there may be different means for the dependent variable for x = 0 across the
different clusters. We say potentially here because if there is in fact no cluster
effect, then the single intercept model of 1.1 will suffice. In practice, assess-
ing whether there are different means across the clusters is an empirical
question, which we describe below. It should also be noted that in this dis-
cussion we are considering only the case where the intercept is cluster spe-
cific, but it is also possible for β1 to vary by group, or even other coefficients
from more complicated models.
Allowing for group-specific intercepts and slopes leads to the following
notation commonly used for the level 1 (micro level) model in multilevel
modeling:
yij = b0 j + b1 j x + e ij (2.7)
where the subscripts ij refer to the ith individual in the jth cluster. As we con-
tinue our discussion of multilevel modeling notation and structure, we will
begin with the most basic multilevel model: predicting the outcome from just
an intercept which we will allow to vary randomly for each group.
yij = b0 j + e ij . (2.8)
An Introduction to Multilevel Data Structure 31
b0 j = g 00 + U 0 j . (2.9)
y = g 00 + U 0 j + b1x + e. (2.10)
Equation (2.10) is termed the full or composite model in which the multiple
levels are combined into a unified equation.
Often in MLM, we begin our analysis of a dataset with this simple random
intercept model, known as the null model, which takes the form
yij = g 00 + U 0 j + e ij . (2.11)
While the null model does not provide information regarding the impact
of specific independent variables on the dependent, it does yield important
information regarding how variation in y is partitioned between variance
among the individuals σ2 and variance among the clusters τ2. The total vari-
ance of y is simply the sum of σ2 and τ2. In addition, as we have already seen,
these values can be used to estimate ρI. The null model, as will be seen in
later sections, is also used as a baseline for model building and comparison.
Random Slopes
It is a simple matter to expand the random intercept model in 2.9 to accom-
modate one or more independent predictor variables. As an example, if we
32 Multilevel Modeling Using R
add a single predictor (xij) at the individual level (level 1) to the model, we
obtain
yij = b0 j + b1 j x + e ij (2.13)
Level 2:
β0 j = γ 00 + U 0 j (2.14)
b1 j = g 10 (2.15)
This model now includes the predictor and the slope relating it to the depen-
dent variable, γ10, which we acknowledge as being at level 1 by the subscript
10. We interpret γ10 in the same way that we did β1 in the linear regression
model; i.e. a measure of the impact on y of a 1-unit change in x. In addition,
we can estimate ρI exactly as before, though now it reflects the correlation
between individuals from the same cluster after controlling for the indepen-
dent variable, x. In this model, both γ10 and γ00 are fixed effects, while σ2 and
τ2 remain random.
One implication of the model in 2.12 is that the dependent variable is
impacted by variation among individuals (σ2), variation among clusters (τ2),
an overall mean common to all clusters (γ00), and the impact of the indepen-
dent variable as measured by γ10, which is also common to all clusters. In
practice there is no reason that the impact of x on y would need to be com-
mon for all clusters, however. In other words, it is entirely possible that rather
than a single γ10 common to all clusters, there is actually a unique effect for
the cluster of γ10 + U1j, where γ10 is the average relationship of x with y across
clusters, and U1j is the cluster-specific variation of the relationship between
the two variables. This cluster -specific effect is assumed to have a mean of 0
and to vary randomly around γ10. The random slopes model is
Written in this way, we have separated the model into its fixed (g 00 + g 10 xij )
and random (U 0 j + U1 j xij + e ij ) components. Model 2.16 simply states that
there is an interaction between cluster and x, such that the relationship of x
and y is not constant across clusters.
Heretofore we have discussed only one source of between-group varia-
tion, which we have expressed as τ2, and which is the variation among
clusters in the intercept. However, Model2.16 adds a second such source of
An Introduction to Multilevel Data Structure 33
å (U ) (2.17)
2
1j - U 1.
J -1
for the slopes, and an analogous equation for the intercept random variance.
Doing so, we obtain t 02 = 0.439 , and t12 = 0.016 . In other words, much more
34 Multilevel Modeling Using R
Centering
Centering simply refers to the practice of subtracting the mean of a vari-
able from each individual value. This implies the mean for the sample of the
centered variables is 0, and implies that each individual’s (centered) score
represents a deviation from the mean, rather than whatever meaning its raw
value might have. In the context of regression, centering is commonly used,
for example, to reduce collinearity caused by including an interaction term
in a regression model. If the raw scores of the independent variables are used
to calculate the interaction, and then both the main effects and interaction
terms are included in the subsequent analysis, it is very likely that collin-
earity will cause problems in the standard errors of the model parameters.
Centering is a way to help avoid such problems (e.g. Iversen, 1991). Such
issues are also important to consider in MLMs, in which interactions are
frequently employed. In addition, centering is also a useful tool for avoid-
ing collinearity caused by highly correlated random intercepts and slopes
in MLMs (Wooldridge, 2004). Finally, centering provides a potential advan-
tage in terms of interpretation of results. Remember from our discussion in
Chapter 1 that the intercept is the value of the dependent variable when the
independent variable is set equal to 0. In many applications the indepen-
dent variable cannot reasonably be 0 (e.g. a measure of vocabulary), however,
which essentially renders the intercept as a necessary value for fitting the
regression line but not one that has a readily interpretable value. However,
when x has been centered, the intercept takes on the value of the dependent
variable when the independent is at its mean. This is a much more useful
interpretation for researchers in many situations, and yet another reason
why centering is an important aspect of modeling, particularly in the mul-
tilevel context.
An Introduction to Multilevel Data Structure 35
assume about the data, and how they differ from one another. For the tech-
nical details we refer the interested reader to Bryk and Raudenbush (2002)
or de Leeuw and Meijer (2008), both of which provide excellent resources
for those desiring a more in-depth coverage of these methods. Our purpose
here is to provide the reader with a conceptual understanding that will aid
in their understanding of application of MLMs in practice.
the number of level-2 clusters increases, the difference in value for MLE and
REML estimates becomes very small (Snijders and Bosker, 1999).
and
Level 2:
b hj = g h 0 + g hl z j + U hj . (2.19)
The additional piece of the equation in 2.19 is γh1zj, which represents the slope
for (γh1), and value of the average vocabulary score for the school (zj). In other
words, the mean school performance is related directly to the coefficient
linking the individual vocabulary score to the individual reading score. For
our specific example, we can combine 2.18 and 2.19 in order to obtain a single
equation for the two-level MLM.
Each of these model terms has been defined previously in the chapter: γ00
is the intercept or the grand mean for the model, γ10 is the fixed effect of
variable x (vocabulary) on the outcome, U0j represents the random variation
for the intercept across groups, and U1j represents the random variation for
the slope across groups. The additional pieces of the equation in 2.13 are γ01
and γ11. γ01 represents the fixed effect of level-2 variable z (average vocabu-
lary) on the outcome. γ11 represents the slope for, and value of, the average
vocabulary score for the school. The new term in Model 2.20 is the cross-
level interaction, γ1001xijzj. As the name implies, the cross-level interaction
is simply the interaction between the level-1 and level-2 predictors. In this
context, it is the interaction between an individual’s vocabulary score and
the mean vocabulary score for their school. The coefficient for this interac-
tion term, γ1001, assesses the extent to which the relationship between an
examinee’s vocabulary score is moderated by the mean for the school that
they attend. A large significant value for this coefficient would indicate that
the relationship between a person’s vocabulary test score and their overall
reading achievement is dependent on the level of vocabulary achievement
at their school.
An Introduction to Multilevel Data Structure 39
Here, the subscript k represents the level-3 cluster to which the individual
belongs. Prior to formulating the rest of the model, we must evaluate if the
slopes and intercepts are random at both levels 2 and 3, or only at level 1,
for example. This decision should always be based on the theory surround-
ing the research questions, what is expected in the population, and what is
revealed in the empirical data. We will proceed with the remainder of this
discussion under the assumption that the level-1 intercepts and slopes are
random for both levels 2 and 3, in order to provide a complete description
of the most complex model possible when three levels of data structure are
present. When the level-1 coefficients are not random at both levels, the terms
in the following models for which this randomness is not present would sim-
ply be removed. We will address this issue more specifically in Chapter 4,
when we discuss the fitting of three-level models using R.
The level-2 and level-3 contributions to the MLM described in 2.13 appear
below.
Level 2:
b0 jk = g 00 k + U 0 jk
b1 jk = g 10 k + U1 jk
Level 3:
g 00 k = d 000 + V00 k
We can then use simple substitution to obtain the expression for the level-1
intercept and slope in terms of both level-2 and level-3 parameters.
b0 jk = d 000 + V00 k + U 0 jk
and (2.23)
b1 jk = d 100 + V10 k + U1 jk
40 Multilevel Modeling Using R
In turn, these terms can be substituted into Equation (2.15) to provide the full
three-level MLM.
( )
yijk = d 000 + V00 k + U 0 jk + d 100 + V10 k + U1 jk xijk + e ijk . (2.24)
Summary
The goal of this chapter was to introduce the basic theoretical underpin-
nings of multilevel modeling, but not to provide an exhaustive technical
discussion of these issues, as there are a number of useful sources avail-
able in this regard, which you will find among the references at the end of
the text. However, what is given here should stand you in good stead as we
move forward with multilevel modeling using R software. We recommend
that while reading subsequent chapters you make liberal use of the informa-
tion provided here, in order to gain a more complete understanding of the
output that we will be examining from R. In particular, when interpreting
output from R, it may be very helpful for you to come back to this chapter
for reminders on precisely what each model parameter means. In the next
two chapters we will take the theoretical information from Chapter 2 and
apply it to real datasets using two different R libraries, nlme and lme4, both
of which have been developed to conduct multilevel analyses with continu-
ous outcome variables. In Chapter 5, we will examine how these ideas can
be applied to longitudinal data, and in Chapters 7 and 8, we will discuss
multilevel modeling for categorical dependent variables. In Chapter 9, we
will diverge from the likelihood-based approaches described here, and dis-
cuss multilevel modeling within the Bayesian framework, focusing on appli-
cation, and learning when this method might be appropriate and when it
might not.
3
Fitting Two-Level Models in R
43
44 Multilevel Modeling Using R
summary(Model3.0)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ 1 + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-2.3229 -0.6378 -0.2138 0.2850 3.8812
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.3915 0.6257
Residual 5.0450 2.2461
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.30675 0.05498 78.34
0.392
ρˆ I = = 0.072.
0.392 + 5.045
We interpret this value to mean that the correlation of reading test scores
among students within the same schools is approximately 0.072.
In order to fit a model with vocabulary test score as the independent vari-
able using lmer, we submit the following syntax in R.
Fitting Two-Level Models in R 45
In the first part of the function call we define the formula for the model
fixed effects, which is very similar to model definition of linear regression
using lm(). The statement geread~gevocab essentially says that the read-
ing score is predicted with the vocabulary score fixed effect. The call in
parentheses defines the random effects and the nesting structure. If only a
random intercept is desired, the syntax for the intercept is “1.” In this exam-
ple, (1|school) indicates that only a random intercepts model will be used
and that the random intercept varies within school. This corresponds to the
data structure of students nested within schools. Fitting this model, which is
saved in the output object Model3.1, we obtain the following output.
summary(Model3.1)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.0823 -0.5735 -0.2103 0.3207 4.4334
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09978 0.3159
Residual 3.76647 1.9407
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.023356 0.049309 41.03
gevocab 0.512898 0.008373 61.26
σ 2M 1 / B + τ 2M 1
R22 = 1 −
σ 2M 0 / B + τ 2M 0
where B is the average size of the level-2 units, or schools in this case. R
provides us with the number of individuals in the sample, 10,320, and the
number of schools, 160, so that we can calculate B as 10,320/160 = 64.5. We
can now estimate
3.76647
σ 2M 1 / B + τ 2M 1 .0583
2
R = 1− 2
2 = 1 − 64.5 + .09978 = 1 − = 1 − .7493 = .2507
σ M 0 / B + τ 2M 0 5.045 .0778
64.5 + .3915
The model in the previous example was quite simple, only incorporating one
level-1 predictor. In many applications, researchers will have predictor vari-
ables at both level 1 (student) and level 2 (school). Incorporation of predic-
tors at higher levels of analysis is very straightforward in R, and is done in
Fitting Two-Level Models in R 47
Scaled residuals:
Min 1Q Median 3Q Max
-3.0834 -0.5729 -0.2103 0.3212 4.4336
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1003 0.3168
Residual 3.7665 1.9408
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.0748819 0.1140074 18.20
gevocab 0.5128708 0.0083734 61.25
senroll -0.0001026 0.0002051 -0.50
Note that in this particular function call, senroll is included only in the
fixed part of the model and not in the random part. This variable thus only
has a fixed (average) effect and is the same across all schools. We will see
shortly how to incorporate a random coefficient in this model.
From these results we can see that, in fact, enrollment did not have a statis-
tically significant relationship with reading achievement (t = −0.50). In addi-
tion, notice that there were some minor changes in the estimates of the other
model parameters, but a fairly large change in the correlation between the
fixed effect of the gevocab slope and the fixed effect of the intercept, from
−0.758 to −0.327. The slope for senroll and the intercept were very strongly
negatively correlated, and the slopes of the fixed effects exhibited virtually
no correlation (−0.002). As noted before, these correlations are typically not
48 Multilevel Modeling Using R
3.7665
σ 2M 1 / B + τ 2M 1 .0583
2
R = 1− 2
2 = 1 − 64.5 + .1003 = 1 − = 1 − .7494 = .2506
σ M 0 / B + τ 2M 0 5.045 .0778
64.5 + .33915
These values are nearly identical to those obtained for the model without
senroll. This is because there was very little change in the variances when
senroll was included in the model, thus the amount of variance explained
at each level was largely unchanged.
summary(Model3.3)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + age + gevocab * age + (1 | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.0635 -0.5706 -0.2108 0.3191 4.4467
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09875 0.3143
Residual 3.76247 1.9397
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 5.187208 0.866786 5.984
gevocab -0.028077 0.188145 -0.149
age -0.029368 0.008035 -3.655
gevocab:age 0.005027 0.001750 2.873
Looking at the output from Model3.5, both age (t = −3.65) and the inter-
action (gevocab:age)between age and vocabulary (t = 2.87) are statistically
significant predictors of reading. Focusing on the interaction, the sign on
the coefficient is positive indicating an enhancing effect: as age increases,
the relationship between reading and vocabulary becomes stronger.
Interestingly, when both age and the interaction are included in the model,
the relationship between vocabulary score and reading performance is no
longer statistically significant.
Next, let’s examine a model that includes a cross-level interaction.
summary(Model3.4)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + senroll + gevocab * senroll + (1 |
school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.1228 -0.5697 -0.2090 0.3188 4.4359
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1002 0.3165
Residual 3.7646 1.9403
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.748e+00 1.727e-01 10.118
gevocab 5.851e-01 2.986e-02 19.592
senroll 5.121e-04 3.186e-04 1.607
gevocab:senroll -1.356e-04 5.379e-05 -2.520
summary(Model3.5)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + (gevocab | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.7102 -0.5674 -0.2074 0.3176 4.6775
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 0.1025 0.3202
gevocab 0.0193 0.1389 0.52
Residual 3.6659 1.9147
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.34411 0.03271 132.8
gevocab 0.52036 0.01442 36.1
Scaled residuals:
Min 1Q Median 3Q Max
-3.6735 -0.5682 -0.2091 0.3184 4.6840
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 1.022e-01 0.319700
gevocab 1.902e-02 0.137918 0.53
age 2.509e-05 0.005009 -0.28 -0.96
Residual 3.664e+00 1.914181
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.343872 0.032686 132.895
gevocab 0.519277 0.014350 36.187
age -0.008882 0.003822 -2.324
summary(Model3.7)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + age + (gevocab | school) + (age |
school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.6937 -0.5681 -0.2081 0.3182 4.6744
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.914e-02 0.1707120
gevocab 1.919e-02 0.1385422 1.00
school.1 (Intercept) 7.272e-02 0.2696677
age 7.522e-07 0.0008673 1.00
Residual 3.665e+00 1.9143975
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.343789 0.032644 133.065
54 Multilevel Modeling Using R
summary(Model3.6a)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + age + (1 | school) + (-1 + gevocab
| school) +
(-1 + age | school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.6084 -0.5664 -0.2058 0.3087 4.5437
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09547 0.3090
school.1 gevocab 0.01833 0.1354
school.2 age 0.00000 0.0000
Residual 3.66719 1.9150
Number of obs: 10320, groups: school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.351259 0.032014 135.917
gevocab 0.521053 0.014133 36.869
age -0.008652 0.003807 -2.272
Fitting Two-Level Models in R 55
Centering Predictors
As per the discussion in Chapter 2, it may be advantageous to center predic-
tors, especially when interactions are incorporated. Centering predictors can
provide easier interpretation of interaction terms as well as help alleviating
issues of multicollinearity arising from inclusion of both main effects and
interactions in the same model. Recall that centering of a variable entails the
subtraction of a mean value from each score on the variable. Centering of
predictors can be accomplished through R by the creation of new variables.
For example, returning to Model3.3, grand mean centered gevocab and age
variables can be created with the following syntax:
Once mean centered versions of the predictors have been created, they can
be incorporated into the model in the same manner as before.
summary(Model3.3C)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ Cgevocab + Cage + Cgevocab * Cage + (1 |
school)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.0635 -0.5706 -0.2108 0.3191 4.4467
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09875 0.3143
Residual 3.76247 1.9397
Number of obs: 10320, groups: school, 160
56 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.332327 0.032062 135.124
Cgevocab 0.512480 0.008380 61.159
Cage -0.006777 0.003917 -1.730
Cgevocab:Cage 0.005027 0.001750 2.873
Focusing on the fixed effects of the model, there are some changes in their
values. These differences are likely due to the presence of multicollinearity
issues in the original uncentered model. The interaction is still significant
(t = 2.87); however, there is now a significant effect of vocabulary (t = 61.15),
and age is no longer a significant predictor (t = −1.73). Focusing on the
interaction, recall that when predictors are centered, the interaction can be
interpreted as the effect of one variable while holding the second variable
constant. Thus, since the sign on the interaction is positive, if we hold age
constant, vocabulary has a positive impact on reading ability.
Additional Options
Parameter Estimation Method
By default, lme4 uses restricted maximum likelihood estimation (REML).
However, it also allows for the use of maximum likelihood estimation (ML)
instead. Model3.8 demonstrates syntax for fitting a multilevel model using
ML. In order to change the estimation method, the call is REML = FALSE.
Estimation Controls
Sometimes a correctly specified model will not reach a solution (converge)
using the default settings for model convergence. Many times, this problem
can be fixed by changing the default estimation controls using the control
option. Quite often, convergence issues can be fixed by changing the model
iteration limit (maxIter), or by changing the model optimizer (opt). In
order to specify which controls will be changed, R must be given a list of
controls and their new values. For example, control=list(maxIter=100,
opt=”optim”) would change the maximum number of iterations to 100
Fitting Two-Level Models in R 57
and the optimizer to “optim.” These control options are placed in the R code
in the same manner as choice of estimation method (separated from the rest
of the syntax with a comma). A comprehensive list of estimation controls can
be found on the R help ?lme4 pages.
anova(Model3.0,Model3.1)
refitting model(s) with ML (instead of REML)
Data: Achieve
Models:
Model3.0: geread ~ 1 + (1 | school)
Model3.1: geread ~ gevocab + (1 | school)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
Model3.0 3 46270 46292 -23132 46264
Model3.1 4 43132 43161 -21562 43124 3139.9 1 <
2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
58 Multilevel Modeling Using R
Referring to the AIC and BIC statistics, recall that smaller values reflect bet-
ter model fit. For Model3.1, the AIC and BIC are 43132 and 43161, respectively,
whereas for Model3.0 the AIC and BIC were 46270 and 46292. Given that the
values for both statistics are smaller for Model3.1, we would conclude that
it provides a better fit to the data. Since these models are nested, we may
also look at the chi-square test for deviance. This test yielded a statistically
significant chi-square (χ2 = 3139.9, p < .001) indicating that Model3.1 provided
a significantly better fit to the data than does Model3.0. Substantively, this
means that we should include the predictor variable geread, which results
of the hypothesis test also supported.
θ ± zαSEB (3.1)
where
θ = parameter estimate as calculated from original data
zα = critical value from the standard normal distribution
SEB = bootstrap standard error; i.e. standard deviation of the bootstrap
sample
Fitting Two-Level Models in R 59
The normal bootstrap is very similar to the standard error approach, the
difference being that the B samples are generated from a normal distribu-
tion with marginal statistics (i.e. means, variances) equal to those of the raw
data, rather than through resampling. In other respects, it is equivalent to
the standard error method. Two other approaches for calculating confidence
intervals are available using confint. The first of these applies only to the
fixed effects, and is based on the assumption of normally distributed errors.
The confidence interval is then calculated as
where
bx1 = coefficient linking X1 to Y
zα = critical value from the standard normal distribution
SEB = MLE standard error
The final confidence interval method that we can use is called the profile
confidence interval, and is not based on any assumptions about the distribu-
tion of the errors. Instead, we select as the bounds of the confidence interval
two points on either side of the MLE estimate with likelihood equal to
2.5 % 97.5 %
.sig01 0.2664702 0.3812658
.sig02 0.2798048 0.7387527
.sig03 0.1105445 0.1644085
.sigma 1.8879782 1.9408446
(Intercept) 4.2844713 4.4028048
Cgevocab 0.4913241 0.5511338
confint(Model3.5, method=c("boot"), boot.type=c("basic"))
Computing bootstrap confidence intervals ...
2.5 % 97.5 %
.sig01 0.2638048 0.3767120
.sig02 0.3131177 0.7405494
.sig03 0.1148660 0.1662661
.sigma 1.8900770 1.9412491
(Intercept) 4.2795679 4.4078229
Cgevocab 0.4909470 0.5484041
confint(Model3.5, method=c("boot"), boot.type=c("norm"))
Computing bootstrap confidence intervals ...
2.5 % 97.5 %
.sig01 0.2633918 0.3799952
.sig02 0.3302355 0.7388453
.sig03 0.1140536 0.1655553
.sigma 1.8871793 1.9429654
(Intercept) 4.2830980 4.4057360
Cgevocab 0.4935254 0.5488512
confint(Model3.5, method=c("Wald"))
2.5 % 97.5 %
.sig01 NA NA
.sig02 NA NA
.sig03 NA NA
.sigma NA NA
(Intercept) 4.2799887 4.4082236
Cgevocab 0.4921036 0.5486107
confint(Model3.5, method=c("profile"))
Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 0.2623302 0.3817036
.sig02 0.2926177 0.7139976
.sig03 0.1141790 0.1656869
.sigma 1.8884541 1.9414934
(Intercept) 4.2790205 4.4083316
Cgevocab 0.4920097 0.5490748
The last section of the output contains the confidence intervals for the fixed
and random effects of Model3.5. The top three rows correspond to the ran-
dom intercept (.sig01), the correlation between the random intercept and
random slope (.sig02), and the random slope (.sig03). According to these
confidence intervals, since 0 does not appear in the interval, there is statis-
tically significant variation in the intercept across schools CI95[.262, .382],
Fitting Two-Level Models in R 61
the slope for gevocab across schools CI95[.114, .166], and there is a sig-
nificant relationship between the random slope and the random intercept
CI95[.293, .713].
The bottom two rows correspond to the fixed effects. According to the con-
fidence intervals, the intercept (γ00) is significant CI95[4.279, 4.408] and the
slope for gevocab is significant CI95[.492, .549], each corresponding with the
t values presented in the output previously displayed in the chapter.
Summary
In Chapter 3, we put the concepts that we learned in Chapter 2 to work
using R. We learned the basics of fitting two-level models when the depen-
dent variable is continuous, using the lme4 package. Within this multilevel
framework, we learned how to fit the null model, as well as the random inter-
cept and random slopes models. We also included independent variables at
both levels of the data, and finally, learned how to compare the fit of models
with one another. This last point will prove particularly useful as we engage
in the process of selecting the most parsimonious model (i.e. the simplest)
that also adequately explains the dependent variable. Of greatest import in
this chapter, however, is understanding how to fit multilevel models using
lme4 in R, and correctly interpreting the resultant output. If you have mas-
tered those skills, then you are ready to move to Chapter 4, where we extend
the model to include a third level in the hierarchy. As we will see, the actual
fitting of these three-level models is very similar to that for the two-level
models we have studied here.
Note
1. In these models, gevocab and age were grand mean centered to fix conver-
gence issues.
4
Three-Level and Higher Models
In this chapter we will continue working with the Achieve data that were
described in Chapter 3. In our Chapter 3 examples, we included two levels of
data structure, students within schools, along with associated predictors of
reading achievement at each level. We will now add a third level of structure,
classroom, which is nested within schools. In this context, then, students are
nested within classrooms, which are in turn nested within schools.
63
64 Multilevel Modeling Using R
have already seen in Chapter 3. Let’s begin by defining a null model for
prediction of student reading achievement, where regressors might include
student-level characteristics, classroom-level characteristics, and school-level
characteristics. The syntax to fit a three-level null model appears below, with
the results being stored in the object Model4.0.
As can be seen, the syntax for fitting a random intercepts model with
three levels is very similar to that for the random intercepts model with two
levels. In order to define a model with more than two levels, we need to
include the variables denoting the higher levels of the nesting structures,
here school (school-level influence) and class (classroom-level influence),
and designate the nesting structure of the levels (students within classrooms
within schools). Nested structure in lmer is defined as A/B, where A is the
higher-level data unit (e.g. school) and B is the lower-level data unit (e.g.
classroom). The intercept (1) is denoted as a random effect by its inclusion in
the parentheses.
summary(Model4.0)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ 1 + (1 | school/class)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-2.3052 -0.6290 -0.2094 0.3049 3.8673
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.2727 0.5222
school (Intercept) 0.3118 0.5584
Residual 4.8470 2.2016
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.30806 0.05499 78.34
confint(Model4.0, method=c("profile"))1
2.5 % 97.5 %
.sig01 0.4516596 0.5958275
.sig02 0.4658891 0.6561674
.sigma 2.1710519 2.2328459
(Intercept) 4.1997929 4.4160128
summary(Model4.1)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + clenroll + cenroll + (1 | school/
class)
Data: Achieve
66 Multilevel Modeling Using R
Scaled residuals:
Min 1Q Median 3Q Max
-3.2212 -0.5673 -0.2079 0.3184 4.4736
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.09047 0.3008
school (Intercept) 0.07652 0.2766
Residual 3.69790 1.9230
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.675e+00 2.081e-01 8.050
gevocab 5.076e-01 8.427e-03 60.233
clenroll 1.899e-02 9.559e-03 1.986
cenroll -3.721e-06 3.642e-06 -1.022
We can see from the output for Model4.1 that the student’s vocabulary
score (t = 60.23) and size of the classroom (t = 1.99) are statistically significantly
positive predictors of student reading achievement score, but the size of the
school (t = −1.02) does not significantly predict reading achievement. As a
side note, the significant positive relationship between size of the classroom
and reading achievement might seem to be a bit confusing, suggesting that
students in larger classrooms had higher reading achievement test scores.
However, in this particular case, larger classrooms very frequently included
multiple teacher’s aides, so that the actual adult to student ratio might be
lower than in classrooms with fewer students. In addition, estimates for the
random intercepts of classroom nested in school and school have decreased
in value from those of the null model, suggesting that when we account for
the three fixed effects, some of the mean differences between schools and
between classrooms are accounted for.
Confidence intervals for the model effects can be obtained with the con-
fint() function.
confint(Model4.1, method=c("profile"))1
2.5 % 97.5 %
.sig01 2.312257e-01 3.660844e-01
.sig02 2.055494e-01 3.398754e-01
.sigma 1.896242e+00 1.950182e+00
Three-Level and Higher Models 67
From this value, we see that inclusion of the classroom and school enroll-
ment variables, along with student vocabulary score, results in a model that
explains approximately 26% of the variance in the reading score, above and
beyond the null model.
We can also ascertain whether including the predictor variables resulted
in a better fitting model, as compared to the null model. As we saw in
Chapter 3, we can compare model by examining the AIC and BIC values for
each, where lower values indicate better fit. Using the anova() command,
we can compare Model4.0 (the null model) with Model4.1.
anova(Model4.0, Model4.1)
Data: Achieve
Models:
Model4.0: geread ~ 1 + (1 | school/class)
Model4.1: geread ~ gevocab + clenroll + cenroll + (1 | school/
class)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
Model4.0 4 46150 46179 -23071 46142
Model4.1 7 43101 43152 -21544 43087 3054.6 3 <
2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For the original null model, AIC and BIC were 46,150 and 46,179, respec-
tively, which are both larger than the AIC and BIC for Model4.1. Therefore,
we would conclude that this latter model including a single predictor vari-
able at each level provides better fit to the data, and thus is preferable to the
null model with no predictors. We could also look at the chi-square deviance
68 Multilevel Modeling Using R
test since these are nested models. The chi-square test is statistically signifi-
cant, indicating Model4.1 is a better fit to the data.
Using lmer, it is very easy to include both single-level and cross-level
interactions in the model once the higher-level structure is understood. For
example, we may have a hypothesis stating that the impact of vocabulary
score on reading achievement varies depending upon the size of the school
that a student attends. In order to test this hypothesis, we will need to include
the interaction between vocabulary score and size of the school, as is done
in Model4.2 below.
summary(Model4.2)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + clenroll + cenroll + gevocab *
cenroll + (1 |
school/class)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.1902 -0.5683 -0.2061 0.3183 4.4724
Random effects:
Groups Name Variance Std.Dev.
class:school (Intercept) 0.08856 0.2976
school (Intercept) 0.07513 0.2741
Residual 3.69816 1.9231
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.752e+00 2.100e-01 8.341
gevocab 4.900e-01 1.168e-02 41.940
clenroll 1.880e-02 9.512e-03 1.977
cenroll -1.316e-05 5.628e-06 -2.337
gevocab:cenroll 2.340e-06 1.069e-06 2.190
In this example we can see that, other than the inclusion of the higher-level
nesting structure in the random effects line, defining a cross-level interaction
in a model with more than two levels is no different than was the case for the
two-level models that we worked with in Chapter 3. In terms of hypothesis
testing results, we find that student vocabulary (t = 41.94) and classroom size
(t = 1.98) remain statistically significant positive predictors of reading ability.
In addition, both the cross-level interaction between vocabulary and school
size (t = 2.19) and the impact of school size alone (t = −2.34) are also statisti-
cally significant predictors of the reading score. The statistically significant
interaction term indicates that the impact of student vocabulary score on
reading achievement is dependent to some degree on the size of the school,
so that the main effects for school and vocabulary cannot be interpreted
in isolation, but must be considered in light of one another. The interested
reader is referred to Aiken and West (1991) for more detail regarding the
interpretation of interactions in regression.
The R2 for Model4.3 appears below:
anova(Model4.1, Model4.2)
refitting model(s) with ML (instead of REML)
Data: Achieve
Models:
Model4.1: geread ~ gevocab + clenroll + cenroll + (1 | school/
class)
Model4.2: geread ~ gevocab + clenroll + cenroll + gevocab *
cenroll + (1 |
Model4.2: school/class)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
Model4.1 7 43101 43152 -21544 43087
70 Multilevel Modeling Using R
Scaled residuals:
Min 1Q Median 3Q Max
-2.2995 -0.6305 -0.2131 0.3029 3.9448
Random effects:
Groups Name Variance Std.Dev.
class:(school:corp) (Intercept) 0.27539 0.5248
school:corp (Intercept) 0.08748 0.2958
corp (Intercept) 0.17726 0.4210
Residual 4.84699 2.2016
Number of obs: 10320, groups: class:(school:corp), 568;
school:corp, 160; corp, 59
Three-Level and Higher Models 71
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.32583 0.07198 60.1
confint(Model4.3, method=c(“profile”))
.sig01 0.4544043 0.5983855
.sig02 0.1675851 0.4083221
.sig03 0.3147483 0.5430904
.sigma 2.1710534 2.2328418
(Intercept) 4.1838496 4.4679558
In order to ensure that the dataset is being read by R as one thinks it should
be, we can first find the summary of the sample sizes for the different data
levels which occurs toward the bottom of the printout. There were 10,320
students (groups) nested within 568 classrooms (within schools within cor-
porations) nested within 160 schools (within corporations) nested within 59
school corporations. This matches what we know about the data; therefore,
we can proceed with the interpretation of the results. Given that this is a null
model with no fixed predictors, our primary focus is on the intercept esti-
mates for the random effects and their associated confidence intervals. We
can see from the results confidence interval results below that each level of
the data yielded intercepts that were statistically significantly different from
0 (given that 0 does not appear in any of the confidence intervals), indicating
that mean reading achievement scores differed among the classrooms, the
schools, and the school corporations.
model where the gender coefficient is allowed to vary across both random
effects in our three-level model. Below is the lmer command sequence for
fitting this model.
Model4.4 <- lmer(geread~gevocab+gender +(gender|school/class),
data=Achieve)
summary(Model4.4)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + gender + (gender | school/class)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.2043 -0.5680 -0.2069 0.3171 4.4455
Random effects:
Groups Name Variance Std.Dev. Corr
class:school (Intercept) 0.148931 0.38592
gender 0.019807 0.14074 -0.62
school (Intercept) 0.033166 0.18212
gender 0.006805 0.08249 0.58
Residual 3.692436 1.92157
Number of obs: 10320, groups: class:school, 568; school, 160
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.015564 0.075629 26.651
gevocab 0.509097 0.008408 60.550
gender 0.017237 0.039243 0.439
Interpreting these results, we first see that there is not a statistically signifi-
cant relationship between the fixed gender effect and reading achievement
(t = .439). In other words, across classrooms and schools the difference in
mean reading achievement for males and females is not shown to be statisti-
cally significant, when accounting for vocabulary scores. The estimate for
the gender random coefficient term at the school level is approximately 0.006,
and is approximately 0.02 at the classroom nested in school level. Thus, it
appears that the relationship between gender and reading achievement var-
ies more across classrooms than it does across schools, at least descriptively.
As noted above, in Model4.4, the coefficients for gender were allowed to
vary randomly across both classes and schools. However, there may be some
Three-Level and Higher Models 73
summary(Model4.5)
Linear mixed model fit by REML ['lmerMod']
Formula: geread ~ gevocab + gender + (1 | school) + (gender |
class)
Data: Achieve
Scaled residuals:
Min 1Q Median 3Q Max
-3.0875 -0.5734 -0.2105 0.3229 4.4559
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 9.891e-02 0.314498
class (Intercept) 2.566e-03 0.050653
gender 2.307e-06 0.001519 1.00
Residual 3.765e+00 1.940395
Number of obs: 10320, groups: school, 160; class, 8
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.999576 0.080559 24.821
gevocab 0.513165 0.008379 61.244
gender 0.019808 0.038415 0.516
The flexibility of the lmer() definition for random effects allows for many
different random effects situations to be modeled. For example, see Models4.6
74 Multilevel Modeling Using R
and 4.7, below. Model4.6 allows gender to vary only across schools and the
intercept to vary only across classrooms. Model4.7 depicts four nested levels
where the intercept is allowed to vary across corporation, school, and class-
room, but the gender slope is only allowed to vary across classroom.
Summary
Chapter 4 is very much an extension of Chapter 3, demonstrating the use
of R in fitting two-level models to include data structure at three or more
levels. In practice, such complex multilevel data are relatively rare. However,
as we saw in this chapter, faced with this type of data, we can use lmer to
model it appropriately. Indeed, the basic framework that we employed in the
two-level case works equally well for the more complex data featured in this
chapter. If you have read the first four chapters, you should now feel fairly
comfortable analyzing most common multilevel models with a continuous
outcome variable. We will next turn our attention to the application of mul-
tilevel models to longitudinal data. Of key importance as we change direc-
tions a bit is that the core ideas that we have already learned, including the
fitting of the null, random intercept, and random coefficients models, as well
as inclusion of predictors at different levels of the data, do not change when
we have longitudinal data. As we will see, application of multilevel models
in this context is no different from that in Chapters 3 and 4. What is different
is the way in which we define the levels of the data. Heretofore, level 1 has
generally been associated with individual people. With longitudinal data,
however, level 1 will refer to a single measurement in time, while level 2 will
refer to the individual subject. By recasting longitudinal data in this man-
ner, we make available to ourselves the flexibility and power of multilevel
models.
Note
1. This is the final section of code using confint() function to boostrap confidence
intervals. For full R code and output, see Chapter 3.
5
Longitudinal Data Analysis
Using Multilevel Models
75
76 Multilevel Modeling Using R
Level 2:
π 0 i = β00 + β01 ( Zi ) + r0 i
π1i = β10 + r1i
π 2i = β 20 + r2i
where: Yit is the outcome variable for individual i at time t; πit are the level-1
regression coefficients; βit are the level-2 regression coefficients; εit is the
level-1 error, rit are the level-2 random effects; Tit is a dedicated time predic-
tor variable; Xit is a time-varying predictor variable, and Zi is a time invari-
ant predictor. Thus, as can be seen in Equation (5.1), although there is new
notation to define specific longitudinal elements, the basic framework for the
multilevel model is essentially the same as we saw for the two-level model in
Chapter 3. The primary difference is that now we have three different types
of predictors: a time predictor, time-varying predictors, and time-invariant
predictors. Given their unique role in longitudinal modeling, it is worth
spending just a bit of time defining each of these predictor types.
Of the three types of predictors that are possible in longitudinal models,
a dedicated time variable is the only one that is necessary in order to make
the multilevel model longitudinal. This time predictor, which is literally an
index of the time point at which a particular measurement was made, can
be very flexible with time measured in fixed intervals or in waves. If time
is measured in waves, they can be waves of varying length from person to
person, or they can be measured on a continuum. It is important to note
that when working with time as a variable it is often worthwhile to rescale
the time variable so that the first measurement occasion is the zero point,
thereby giving the intercept the interpretation of baseline, or initial status on
the dependent variable.
The other two types of predictors (time varying and time invariant), differ
in terms of how they are measured. A predictor is time varying when it is
measured at multiple points in time, just as is the outcome variable. In the
context of education, a time-varying predictor might be the number of hours
in the previous 30 days a student has spent studying. This value could be
recorded concurrently with the student taking the achievement test serving
as the outcome variable. On the other hand, a predictor is time invariant
when it is measured at only one point in time, and its value does not change
across measurement occasions. An example of this type of predictor would
be gender, which might be recorded at the baseline measurement occasion,
and which is unlikely to change over the course of the data collection period.
In the context of applying multilevel models to longitudinal data problems,
time-varying predictors will appear at level 1 because they are associated
with specific measurements, whereas time-invariant predictors will appear
at level 2 or higher, because they are associated with the individual (or a
higher data level) across all measurement conditions.
Longitudinal Data Analysis Using Multilevel Models 77
This code takes all of the time-invariant variables directly from the raw
person-level data, while also consolidating the repeated measurements into
a single variable called values, and also creates a variable measuring time
called ind. At this point, we may wish to do some recoding and renaming of
variables. Renaming of variables can be accomplished via the names func-
tion, and recoding can be done via recode(var, recodes, as.factor.
result, as.numeric.result=TRUE, levels).
78 Multilevel Modeling Using R
summary(Model5.0)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + (1 | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
–6.4040 –0.4986 0.0388 0.5669 4.8636
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 232.14 15.236
Residual 56.65 7.526
Number of obs: 18228, groups: ID, 3038
Fixed effects:
Estimate Std. Error t value
(Intercept) 197.21573 0.29356 671.80
Time 3.24619 0.03264 99.45
confint(Model5.0, method=c("profile"))1
2.5 % 97.5 %
.sig01 14.842887 15.640551
.sigma 7.442335 7.611610
(Intercept) 196.640288 197.791167
Time 3.182207 3.310165
summary(Model5.1)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + Grammar + (1 | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
-6.6286 -0.5260 0.0374 0.5788 4.6761
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 36.13 6.011
Residual 56.64 7.526
Number of obs: 18216, groups: ID, 3036
Fixed effects:
Estimate Std. Error t value
(Intercept) 73.703551 1.091368 67.53
Time 3.245483 0.032651 99.40
Grammar 0.630888 0.005523 114.23
From these results, we see that, again, Time is positively related to scores
on the language assessment, indicating that they increased over time. In
addition, Grammar is also statistically significantly related to language test
scores (t = 114.23), meaning that higher grammar test scores were associated
with higher language scores.
If we wanted to allow the growth rate to vary randomly across individuals,
we would use the following R command.
summary(Model5.2)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + Grammar + (Time | ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
-5.3163 -0.5240 -0.0012 0.5363 5.0321
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 21.577 4.645
Time 3.215 1.793 0.03
Residual 45.395 6.738
Number of obs: 18216, groups: ID, 3036
Longitudinal Data Analysis Using Multilevel Models 81
Fixed effects:
Estimate Std. Error t value
(Intercept) 54.470619 1.002595 54.33
Time 3.245483 0.043740 74.20
Grammar 0.729119 0.005083 143.46
confint(Model5.2, method=c("profile"))1
2.5 % 97.5 %
.sig01 4.36967084 4.9175027
.sig02 -0.06472334 0.1250178
.sig03 1.70949317 1.8765765
.sigma 6.65369657 6.8231795
(Intercept) 51.93115382 57.0105666
Time 3.15974007 3.3312255
Grammar 0.71620566 0.7420327
In this model, the random effect for Time is assessing the extent to which
growth over time differs from one person to the next. Results show that the
random effect for Time is statistically significant, given that the 95% confidence
interval does not include 0. Thus, we can conclude that growth rates in lan-
guage scores over the six time points do differ across individuals in the sample.
We could add a third level of data structure to this model by including
information regarding schools, within which examinees are nested. To fit
this model we use the following code in R:
summary(Model5.3)
Linear mixed model fit by REML ['lmerMod']
Formula: Language ~ Time + (1 | school/ID)
Data: LangPP
Scaled residuals:
Min 1Q Median 3Q Max
-6.4590 -0.5026 0.0400 0.5658 4.8580
Random effects:
Groups Name Variance Std.Dev.
ID:school (Intercept) 187.18 13.681
school (Intercept) 69.11 8.313
Residual 56.65 7.526
Number of obs: 18228, groups: ID:school, 3038; school, 35
82 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error t value
(Intercept) 197.33787 1.48044 133.30
Time 3.24619 0.03264 99.45
Using the anova() command, we can compare the fit of the three-level
and two-level versions of this model.
anova(Model5.0, Model5.3)
refitting model(s) with ML (instead of REML)
Data: LangPP
Models:
Model5.0: Language ~ Time + (1 | ID)
Model5.3: Language ~ Time + (1 | school/ID)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
Model5.0 4 135168 135199 -67580 135160
Model5.3 5 134648 134687 -67319 134638 521.99 1 < 2.2e-16
***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Given that the AIC for Model5.3 is lower than that for Model5.0 where
school is not included as a variable, and the chi-square test for deviance is
significant (χ2 = 521.99, p < .001), we can conclude that inclusion of the school
level of the data leads to better model fit.
or the researcher must work with a greatly reduced sample size. In contrast,
multilevel models are able to use the available data from incomplete observa-
tions, thereby not reducing sample size as dramatically as do other approaches
for modeling longitudinal data, nor requiring special missing data methods.
Repeated measures ANOVA is traditionally one of the most common meth-
ods for analysis of change. However, when it is used with longitudinal data,
the assumptions upon which repeated measures rests may be too restrictive. In
particular the assumption of sphericity (assuming equal variances of outcome
variable differences) may be unreasonable given that variability can change con-
siderably as time passes. On the other hand, analyzing longitudinal data from
a multilevel modeling perspective does not require the assumption of spheric-
ity. In addition, it also provides flexibility in model definition, thus allowing
for information about the anticipated effects of time on error variability to be
included in the model design. Finally, multilevel models can easily incorporate
predictors from each of the data levels, thereby allowing for more complex data
structures. In the context of longitudinal data, this means that it is possible to
incorporate measurement occasion (level 1), individual (level 2), and cluster
(level 3) characteristics. We saw an example of this type of analysis in Model5.3.
On the other hand, in the context of repeated measures ANOVA or MANOVA,
incorporating these various levels of the data would be much more difficult.
Thus, the use of multilevel modeling in this context not only has the benefits
listed above pertaining specifically to longitudinal analysis, but it brings the
added capability of simultaneous analysis of multiple levels of influence.
Summary
In Chapter 5, we saw that the multilevel modeling tools we studied together
in Chapters 2 through 4 can be applied in the context of longitudinal data.
The key to this analysis is the treatment of each measurement in time as a
level-1 data point, and the individuals on whom the measurements are made
as level 2. Once this shift in thinking is made, the methodology remains very
much the same as what we employed in the standard multilevel models in
Chapters 3 and 4. By modeling longitudinal data in this way, we are able to
incorporate a wide range of data structures, including individuals (level 2)
nested within a higher level of data (level 3).
Note
1. This is the final section of code using confint() function to boostrap confidence
intervals. For full R code and output, see Chapter 3.
6
Graphing Data in Multilevel Contexts
Graphing data is an important step in the analysis process. Far too often
researchers skip the graphing of their data and move directly into analy-
sis without the insights that can come from first giving the data a careful
visual examination. It is certainly tempting for researchers to bypass data
exploration through graphical analysis and move directly into formal statis-
tical modeling, given that the models are generally the tools used to answer
research questions. However, if proper attention is not paid to the graphing
of data, the formal statistical analyses may be poorly informed regarding
the distribution of variables, and relationships among them. For example,
a model allowing only a linear relationship between a predictor and a cri-
terion variable would be inappropriate if there is actually a nonlinear rela-
tionship between the two variables. Using graphical tools first, it would be
possible for the researcher of such a case to see the nonlinearities, and appro-
priately account for them in the model. Perhaps one of the most eye-open-
ing examples of the dangers in not plotting data can be found in Anscombe
(1973). In this classic paper, Anscombe shows the results of four regression
models that are essentially equivalent in terms of the means and standard
deviations of the predictor and criterion variable, with the same correlation
between the regressor and outcome variables in each dataset. However, plots
of the data reveal drastically different relationships among the variables.
Figure 6.1 shows these four datasets and the regression equation and the
squared multiple correlation for each. First, note that the regression coef-
ficients are identical across the models, as are the squared multiple correla-
tion coefficients. However, the actual relationships between the independent
and dependent variables are drastically different! Clearly, these data do not
come from the same data-generating process. Thus, modeling the four situa-
tions in the same fashion would lead to mistaken conclusions regarding the
nature of the relationships in the population. The moral of the story here is
clear: plot your data!
The plotting capabilities in R are truly outstanding. It is capable of pro-
ducing high-quality graphics with a great deal of flexibility. As a simple
example, consider the Anscombe data from Figure 6.1. These data are actu-
ally included with R and can be loaded into a session with the following
command:
data(anscombe)
85
86 Multilevel Modeling Using R
8
6
Y
Y
6
4
4
2 2
4 6 8 10 12 14 4 6 8 10 12 14
X X
10 10
Y
8 8
6 6
4 4
4 6 8 10 12 14 8 10 12 14 16 18
X X
FIGURE 6.1
Plot of the Anscombe (1973) data illustration that the same set of summary statistics does not
necessarily reveal the same type of information.
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
The way in which one can plot the data for the first dataset (i.e. x1 and y1
above), is as follows:
Graphing Data in Multilevel Contexts 87
plot(anscombe$y1 ~ anscombe$x1)
Notice here that $y1 extracts the column labeled y1 from the data frame,
as $x1 extracts the variable x1. The ~ symbol in the function call positions
the data to the left as the dependent variable plotted on the ordinate (y-axis),
whereas the value to the right is treated as an independent variable and plot-
ted on the abscissa (x-axis). Alternatively, one can rearrange the terms so that
the independent variable comes first, with a comma separating it from the
dependent variable:
plot(anscombe$x1, anscombe$y1)
Both the former and the latter approaches lead to the same plot.
There are many options that can be called upon within the plotting frame-
work. The plot function has six base parameters, with the option of calling
other graphical parameters including, among others, par function. The par
function has more than 70 graphical parameters that can be used to modify
a basic plot. Discussing all of the available plotting parameters is beyond
the scope of this chapter, however. Rather, we will discuss some of the most
important parameters to consider when plotting in R.
The parameters ylim and xlim modify the starting and ending points for
the y-and x-axes, respectively. For example, ylim=c(2, 12) will produce a
plot with the y-axis being scaled from 2 until 12. R typically automates this
process, but it can be useful for the researcher to tweak this setting, such as
by setting the same axis across multiple plots. The ylab and xlab param-
eters are used to create labels for the y- and x-axes, respectively. For example,
ylab=“Dependent Variable” will produce a plot with the y-axis labeled
“Dependent Variable.” The main parameter is used for the main title of a
plot. Thus, main=“Plot of Observed Values” would produce a plot
with the title “Plot of Observed Values” above the graph. There is also a
sub parameter that provides a subtitle that is on the bottom of a plot, cen-
tered and below the xlab label. For example, sub=“Data from Study 1”
would produce such a subtitle. In some situations, it is useful to include text,
equations, or a combination in a plot. Text is quite easy to include, which
involves using the text function. For example, text(2, 5, "Include
This Text") would place the text “Include This Text” in the plot centered
at x = 2 and y = 5.
Equations can also be included in graphs. Doing so requires the use of
expression within the text function call. The function expression
allows for the inclusion of an unevaluated expression (i.e. it displays what is
written out). The particular syntax for various mathematical expressions is
available via calling help for the plotmath function (i.e. ?plotmath). R pro-
vides a demonstration of the plotmath functionality via demo(plotmath),
which demonstrates the various mathematical expressions that can be dis-
played in a figure. As an example, in order to add the R2 information to the
88 Multilevel Modeling Using R
The values of 5.9 (on the x-axis) and 9.35 (on the y-axis) are simply where
we thought the text looked best, and can be easily adjusted to suit the user’s
preferences.
Combining actual text and mathematical expressions requires using the
paste function in conjunction with the expression function. For example,
if we wanted to add “The Value of R2 = 0.67” we would replace the previous
text syntax with the following:
Here, paste is used to bind together the text, which is contained within
the quotes, and the mathematical expression. Note that sep="" is used so
that there are no spaces added between the parts that are pasted together.
Although we did not include it in our figure, the model-implied regres-
sion line can be easily included in a scatterplot. One way to do this is
to go through the abline function, and include the intercept (denoted
a) and the slope (denoted b). Thus, abline(a = 3, b = .5) would
add a regression line to the plot that has an intercept at 3 and a slope
of .5. Alternatively, to automate things a bit, using abline(lm.object)
will extract the intercept and slope and include the regression line in our
scatterplot.
Finally, notice in Figure 6.2 that we have broken the y-axis to make it
very clear that the plot does not start at the origin. Whether or not this is
needed may be debatable. Note that base R does not include this option,
but we prefer to include it in many situations. Such a broken axis can be
identified with the axis.break function from the plotrix package.
The zigzag break is requested via the style="zigzag" option in axis.
break, and the particular axis (1 for y and 2 for x). By default, R will set
the axis to a point that is generally appropriate. However, when the origin
is not shown, there is no break in the axis by default as some have argued
is important.
Now, let us combine the various pieces of information that we have dis-
cussed in order to yield Figure 6.2, which was generated with the following
syntax.
data(anscombe)
text(5.9, 10.15,
expression(italic(hat(Y))==3+italic(X)*.5))
8
Y
4 6 8 10 12 14
X
FIGURE 6.2
Plot of Anscombe’s Data Set 1, with the regression line of best fit included, axis breaks, and text
denoting the regression line and value of the squared multiple correlation coefficient.
90 Multilevel Modeling Using R
20 30 40 50 60 70
4.0
3.5
3.0
GPA
2.5
2.0
70
60
50
40 CTA_Total
30
20
40
35
30
BS_Total 25
20
15
10
2.0 2.5 3.0 3.5 4.0 10 15 20 25 30 35 40
FIGURE 6.3
Pairs plot of the Cassidy GPA data. The plot shows the bivariate scatterplot for each of the three
variables.
Graphing Data in Multilevel Contexts 91
pairs(
cbind(GPA=Cassidy$GPA, CTA_Total=Cassidy$CTA.tot,
BS_Total=Cassidy$BStotal))
Given that our data contain p variables, there will be p*(p-1)/2 unique scat-
terplots (here 3). The plots below the “principal diagonal” are the same as
those above the principal diagonal, the only difference being the x and y axes
are reversed. Such a pairs plot allows multiple bivariate relationships to be
visualized simultaneously. Of course, one can quantify the degree of linear
relation with a correlation. Code to do this can be given as follows, using
listwise deletion:
cor(na.omit(
cbind(GPA=Cassidy$GPA, CTA_Total=Cassidy$CTA.tot,
BS_Total=Cassidy$BStotal)))
There are other options available for dealing with missing data (see ?cor),
but we have used the na.omit function wrapped around the cbind func-
tion so as to obtain a listwise deletion dataset in which the following correla-
tion matrix is computed.
hist(resid.1,
freq=FALSE, main="Density of Residuals for Model 1",
xlab="Residuals")
lines(density(resid.1))
This code first requests that a histogram be produced for the residuals.
Note that freq=FALSE is used, which instructs R to make the y-axis scaled
in terms of probability, not the default, which is frequency. The solid line
represents the density estimate of the residuals, corresponding closely to the
bars in the histogram (Figure 6.4).
92 Multilevel Modeling Using R
0.8
0.6
Density
0.4
0.2
0.0
FIGURE 6.4
Histogram with overlaid density curve of the residuals from the GPA model from Chapter 1.
Notice that above we use the scale function, which standardizes the
residuals to have a mean of 0 (already done due to the regression model) and
a standard deviation of 1.
We can see in Figure 6.5 that points in the Q-Q plot diverge from the line
for the higher end of the distribution. This is consistent with the histogram
in Figure 6.4 which shows a shorter tail on the high end as compared to
the low end of the distribution. This plot, along with the histogram, reveals
that the model residuals do diverge from a perfectly normal distribution. Of
course, the degree of non-normality can be quantified (e.g. with skewness
and kurtosis measures), but for this chapter we are most interested in visual-
izing the data, and especially in looking for gross violations of assumptions.
Graphing Data in Multilevel Contexts 93
1
Sample Quantiles
–1
–2
–3
–3 –2 –1 0 1 2 3
Theoretical Quantiles
FIGURE 6.5
Plot comparing the observed quantiles of the residuals to the theoretical quantiles from the
standard normal distribution. When points fall along the line (which has slope 1 and goes
through the origin), then the sample quantiles follow a normal distribution.
dotplot(
class ~ geread,
data=Achieve.940.767, jitter.y = TRUE, ylab="Classroom",
Graphing Data in Multilevel Contexts 95
This R command literally identifies the rows in which the corp is equal to
940 (equality checks require two equal signs) and in which the school is equal
to 767. The class ~ geread part of the code instructs the function to plot
geread for each of the classrooms. The jitter.y parameter is used, which
will jitter, or slightly shift around, overlapping data points in the graph. For
example, if multiple students in the same classroom have the same score for
geread, using jitter will shift those points on the y-axis to make clear that
there are multiple values at the same x value. Finally, as before, labels can be
specified. Note that the use of \'geread\' in the main title is to put geread
in a single quote in the title. Calling the dotplot function using these R
commands yields Figure 6.6.
3
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.6
Dotplot of classrooms in school 767 (within corporation 940) for geread.
96 Multilevel Modeling Using R
This dotplot shows the dispersion of geread for the classrooms in school
767. From this plot, we can see that students in each of the four classrooms
had generally similar reading achievement scores. However, it is also clear
that classrooms 2 and 4 each have multiple students with outlying scores that
are higher than those of other individuals within the school. We hope that
it is clear how a researcher could make use of this type of plot when exam-
ining the distributions of scores for individuals at a lower level of data (e.g.
students) nested within a higher level (e.g. classrooms or schools).
Because the classrooms within a school are arbitrarily numbered, we can
alter the order in which they appear in the graph in order to make the dis-
play more meaningful. Note that if the order of the classrooms had not been
arbitrary in nature (e.g. honors classes were numbered 1 and 2), then we
would need to be very careful about changing the ordering. However, in this
case, no such concerns are necessary. In particular, the use of the reorder
function on the left-hand side of the ~ symbol will reorder the classrooms
in ascending order of the variable of interest (here, geread) in terms of the
mean. Thus, we can modify Figure 6.6 in order to have the classes placed in
descending order by the mean of geread.
dotplot(
reorder(class, geread) ~ geread,
data=Achieve.940.767, jitter.y = TRUE, ylab="Classroom",
main="Dotplot of \'geread\' for Classrooms in School 767,
Which is Within Corporation 940")
From Figure 6.7, it is easier to see the within-class and between-class vari-
ability for school 767. Visually, at least, it is clear that classroom 3 is more
homogeneous (smaller variance) and lower performing (smaller mean) than
classrooms 2 and 4.
Although dotplots such as those in Figures 6.6 and 6.7 are useful, creating
one for each school would yield so much visual information that it may be
difficult to draw any meaningful conclusions regarding our data. Therefore,
suppose that we ignored the classrooms and schools, and instead focused on
the highest level of data: corporation. Using what we have already learned, it
is possible to create dotplots of student achievement for each of the corpora-
tions. To do so, we would use the following code:
dotplot(reorder(corp, geread) ~ geread, data=Achieve, jitter.y
= TRUE,
ylab="Classroom", main="Dotplot of \'geread\' for All
Corporations")
4
Classroom
0 2 4 6 8 10 12
geread
FIGURE 6.7
Dotplot of classrooms in school 767 (within corporation 940) for geread, with the classrooms
ordered by the mean (lowest to highest).
due to (a) the sheer volume of the data and (b) remaining sensitive to the
nesting structure of the data.
One method for visualizing such large and complex data would be to focus
on a higher level of data aggregation, such as the classroom, rather than on
the individual students. Recall, however, that classrooms are simply num-
bered from 1 to n within each school, and are not, therefore, given identifiers
that mark them as unique across the entire dataset. Therefore, if we wish to
focus on achievement at the classroom level, we must first create a unique
classroom number. We can use the following R code in order to create such
a unique identifier, which augments to the Achieve data with a new column
of the unique classroom identifiers, which we call Classroom _ Unique.
After forming this unique identifier for the classrooms, we will then aggre-
gate the data within the classrooms in order to obtain the mean of the vari-
ables within those classrooms. We do this by using the aggregate function:
6055
4455
3125
5495
5930
3695
2725
7215
2960
1160
5360
3435
6755
5375
7995
940
8050
8305
5620
8215
4345
8360
5350
6835
8045
5385
2940
4590
7495
2305
4790
8115
4670
0 2 4 6 8 10 12
geread
FIGURE 6.8
Dotplot of students in corporations, with the corporations ordered by the mean (lowest to
highest).
jitter.y = TRUE,
ylab="Corporation", main="Dotplot of Classroom Mean \'geread\'
Within the Corporations")
Of course, we still know the nesting structure of the classrooms within the
schools and the schools within the corporations. We are aggregating here for
purposes of plotting, but not modeling the data. We want to remind readers
of the potential dangers of aggregation bias discussed earlier in this chapter.
With this caveat in mind, consider Figure 6.9, which shows that within the
corporations, classrooms do vary in terms of their mean level of achievement
(i.e. the within line/corporation spread) as well as between corporations (i.e.
the change in the lines).
3025
4455
3125
5495
2725
5930
2960
1160
5360
3435
6755
7215
3695
5375
7995
940
8050
8305
5620
4345
7495
5350
8045
8215
8360
5385
6835
2940
4590
2305
4790
8115
4670
4 6 8
geread
FIGURE 6.9
Dotplot of geread for corporations, with the corporations ordered by the mean (lowest to high-
est) of the classroom aggregated data. The dots in the plot are the mean of the classrooms
scores within each of the corporations.
100 Multilevel Modeling Using R
We can also use dotplots to gain insights into reading performance within
specific school corporations. But, again, this would yield a unique plot, such
as the one above, for each corporation. Such a graph may be useful when
interest concerns a specific corporation, or when one wants to assess vari-
ability in specific corporations.
We hope to have demonstrated that dotplots can be useful tools for gain-
ing an understanding of the variability that exists, or that does not exist, in
the variable(s) of interest. Of course, looking only at a single variable can be
limiting. Another particularly useful function for multilevel data that can
be found in the lattice package is xyplot. This function creates a graph
very similar to a scatterplot matrix for a pair of variables, but accounting for
the nesting structure in the data. For example, the following code produces
an xyplot for geread (y-axis) by gevocab (x-axis), accounting for school
corporation.
strip=strip.custom(strip.names=FALSE, strip.levels=c(FALSE,
TRUE))
Our use of the optional strip argument adds the corporation number to
the graph, and removes the “corp” variable name itself from each “strip”
above all of the bivariate plots, which itself was removed with the strip.
names=FALSE subcommand.
Of course, any sort of specific conditioning of interest can be done for a
particular graph. For example, we might want to plot the schools in, say, cor-
poration 940, which can be done by extracting from the Achieve data only
corporation 940, as is done to produce Figure 6.11.
We have now discussed two lattice functions that can be quite useful for
visualizing grouped/nested data. An additional plotting strategy involves
assessment of the residuals from a fitted model, as doing so can help discern
when there are violations of assumptions, much as we saw earlier in this
chapter when discussing single-level regression models. Because residuals
Graphing Data in Multilevel Contexts 101
0
{ 3640 } { 3695 } { 4345 } { 4455 } { 4590 } { 4670 } { 4790 } { 4805 }
12
8
4
0
{ 3060 } { 3125 } { 3145 } { 3295 } { 3330 } { 3435 } { 3445 } { 3500 }
12
8
4
0
{ 1900 } { 2305 } { 2400 } { 2725 } { 2940 } { 2950 } { 2960 } { 3025 }
12
8
4
0
{ 940 } { 1160 } { 1180 } { 1315 } { 1405 } { 1655 } { 1805 } { 1820 }
12
8
4
0
02468 02468 02468 02468
gevocab
FIGURE 6.10
Xyplot from of geread (y-axis) as a function of gevocab (x-axis) by corporation.
hist(scale(resid(Model3.1)),
freq=FALSE, ylim=c(0, .7), xlim=c(-4, 5),
main="Histogram of Standardized Residuals from Model 3.1",
xlab="Standardized Residuals")
lines(density(scale(resid(Model3.1))))
box()
The only differences in the way that we plotted residuals with hist earlier
in the chapter are purely cosmetic in nature. In particular, here we used the
102 Multilevel Modeling Using R
10
0
geread
{ 767 } { 785 }
12
10
2 4 6 8 10
gevocab
FIGURE 6.11
Xyplot from of geread (y-axis) as a function of gevocab (x-axis) by school within corporation 940.
0.7
0.6
0.5
0.4
Density
0.3
0.2
0.1
0.0
−4 −2 0 2 4
Standardized Residuals
FIGURE 6.12
Histogram and density plot for standardized residuals from the Model 3.1.
Graphing Data in Multilevel Contexts 103
box function to draw a box around the plot and specified the limits of the
y- and x-axis.
Alternatively, a Q-Q plot can be used to evaluate the assumption of normal-
ity, as described earlier in the chapter (Figure 6.13). The code to do such a plot is:
qqnorm(scale(resid(model3.1))
qqline(scale(resid(model3.1)))
Clearly, the Q-Q plot (and the associated histogram) illustrate that there are
issues on the high end of the distribution of residuals. The issue, as it turns
out, is not so uncommon in educational research: ceiling effects. In particular,
an examination of the previous plots we have created reveals that there are is
non-trivial number of students who achieved the maximum score on geread.
The multilevel model assumes that the distribution of residuals follows a nor-
mal distribution. However, when a maximum value is reached, it is necessar-
ily the case that the residuals will not be normally distributed because, as in
this case, a fairly large number of individuals have the same residual value.
The plotting capabilities available in R are impressive. Unfortunately, we
were only able to highlight a few of the most useful plotting functions in this
chapter. For the most part, the way to summarize the graphical ability of R
and the available packages is, “if you can envision it, you can implement it.”
There are many great resources available for graphics in R, and an internet
search will turn up many great online resources that are freely available.
2
Sample Quantiles
−2
−4 −2 0 2 4
Theoretical Quantiles
FIGURE 6.13
Q-Q plot of the standardized residuals from Model 3.1.
104 Multilevel Modeling Using R
Scaled residuals:
Min 1Q Median 3Q Max
-2.1620 -0.6063 -0.1989 0.3045 4.8067
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1063 0.326
Residual 3.8986 1.974
Number of obs: 10765, groups: school, 163
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.952e+00 5.146e-02 8.143e+02 37.94 <2e-16
***
npaverb 4.390e-02 7.372e-04 9.585e+03 59.56 <2e-16
***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From these results, we can see that the fixed effect npaverb has a sta-
tistically significant positive relationship with geread. Using the effects
Graphing Data in Multilevel Contexts 105
package, we can visualize this relationship in the form of the line of best fit,
with an accompanying confidence interval. In order to obtain this graph, we
will use the command sequence below.
library(effects)
plot(predictorEffects(Model6.1))
5
geread
0 20 40 60 80 100
npaverb
Scaled residuals:
Min 1Q Median 3Q Max
-2.1458 -0.6085 -0.1988 0.3057 4.7966
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.09759 0.3124
Residual 3.89872 1.9745
Number of obs: 10765, groups: school, 163
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.662e+00 1.077e-01 1.801e+02 15.437 < 2e-16
***
npaverb 4.356e-02 7.467e-04 1.055e+04 58.345 < 2e-16 ***
ses 4.290e-03 1.413e-03 1.709e+02 3.036 0.00277 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From these results, we see that higher levels of school SES are associated
with higher reading test scores, as are higher scores on the verbal reasoning
test. We can plot these relationships and the associated confidence intervals
simultaneously as below.
plot(predictorEffects(Model6.2))
Graphing Data in Multilevel Contexts 107
It is also possible to plot only one of these relationships per graph. We also
added a more descriptive title for each of the following plots, using the main
subcommand.
5
geread
0 20 40 60 80 100
npaverb
One very useful aspect of these plots is the inclusion of the 95% confidence
region around the line. From this, we can see that our confidence regard-
ing the actual nature of the relationships in the population is much greater
for verbal reasoning than for school SES. Of course, given the much larger
level-1 sample size, this makes perfect sense.
It is also possible to include categorical independent variables along with
their plots, as with student gender in Model6.3.
Scaled residuals:
Min 1Q Median 3Q Max
-2.1691 -0.6094 -0.1993 0.3040 4.8117
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.106 0.3256
Residual 3.903 1.9756
Number of obs: 10720, groups: school, 163
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.991e+00 7.783e-02 3.290e+03 25.586 <2e-16
***
npaverb 4.396e-02 7.387e-04 9.533e+03 59.506 <2e-16
***
gender -2.765e-02 3.835e-02 1.065e+04 -0.721 0.471
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(predictorEffects(Model6.3, ~gender))
Finally, we can also use the effects package to graphically probe inter-
actions among independent variables. For example, consider Model6.4, in
which the variables npaverb and npanverb (nonverbal reasoning) are used
as predictors of reading score. In addition, the interaction of these two vari-
ables is also included in the model.
Scaled residuals:
Min 1Q Median 3Q Max
-2.2001 -0.6070 -0.1904 0.3154 4.8473
Random effects:
Groups Name Variance Std.Dev.
110 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.192e+00 8.592e-02 4.637e+03 25.516 <2e-16 ***
npaverb 1.923e-02 1.761e-03 1.075e+04 10.920 <2e-16 ***
npanverb 2.896e-03 1.639e-03 1.075e+04 1.766 0.0774 .
npaverb:npanverb 2.678e-04 2.692e-05 1.076e+04 9.950 <2e-
16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(predictorEffects(Model6.4, ~npaverb*npanverb))
Graphing Data in Multilevel Contexts 111
The interaction effect is presented in two plots, one for each variable. The
set of line plots on the right focuses on the relationship between npaverb
and geread, at different levels of npanverb. Conversely, the set of figures
on the left switches the focus to the relationship of npanverb and read-
ing score at different levels of npaverb. Let’s focus on the right-side plot.
We see that R has arbitrarily plotted the relationship between npanverb
and geread at 5 levels of npaverb, 1, 30, 50, 70, and 100. The relation-
ship between npanverb and geread is very weak when npaverb=1, and
increases in value (i.e. the line becomes steeper) as the value of npaverb
increases in value. Therefore, we would conclude that nonverbal reason-
ing is more strongly associated with reading test score for individuals who
exhibit a higher level of verbal reasoning. For a discussion as to the details
of how these calculations are made, the reader is referred to the effects
package documentation (https://cran.r-project.org/web/packages/effects/
effects.pdf).
It is also possible to plot three-way interactions using the effects pack-
age. Model6.5 includes a measure of memory, in addition to the verbal and
nonverbal reasoning scores. The model summary information demonstrates
that there is a statistically significant three-way interaction among the inde-
pendent variables.
Scaled residuals:
Min 1Q Median 3Q Max
-2.3007 -0.6041 -0.1930 0.3169 4.8402
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.08988 0.2998
Residual 3.73981 1.9339
Number of obs: 10758, groups: school, 163
Fixed effects:
Estimate Std. Error df t
value Pr(>|t|)
(Intercept) 2.028e+00 1.594e-01 9.900e+03
12.722 < 2e-16 ***
112 Multilevel Modeling Using R
plot(predictorEffects(Model6.5, ~npaverb*npanverb*npamem))
For the two-way interaction that was featured in Model6.4, we saw that
the relationship between npanverb and geread was stronger for higher
levels of npaverb. If we focus on the relationship between npanverb
and geread in the three-way interaction, we see that this pattern contin-
ues to hold, and that it is amplified somewhat for larger values of npa-
mem. In other words, the relationship between nonverbal reasoning and
reading test score is stronger for larger values of verbal reasoning, and
stronger still for the combination of higher verbal reasoning and higher
memory scores.
Graphing Data in Multilevel Contexts 113
114 Multilevel Modeling Using R
Summary
The focus of Chapter 6 was on graphing multilevel data. Exploration of data
using graphs is always recommended for any data analysis problem, and can
be particularly useful in the context of multilevel modeling, as we have seen
here. We saw how a scatterplot matrix can provide insights into relationships
among variables that may not be readily apparent from a simple review of
model coefficients. In addition, we learned of the power of dotplots to reveal
interesting patterns at multiple levels of the data structure. In particular, with
dotplots we were able to visualize mean differences among classrooms in a
school, as well as among individuals within a classroom. Finally, graphical
tools can also be used to assess the important assumptions underlying linear
models in general, and multilevel models in particular, including normality
and homogeneity of residual variance. In short, analysts should always be
mindful of the power of pictures as they seek to understand relationships
in their data.
7
Brief Introduction to Generalized
Linear Models
115
116 Multilevel Modeling Using R
and their applications in the single-level context. In the next chapter, we will
then expand upon our discussion here to include the multilevel variants of
these models, and how to fit them in R. In the following sections of this
chapter, we will focus on three broad types of GLiMs, including those for
categorical outcomes (dichotomous, ordinal, and nominal), counts or rates of
events that occur very infrequently, and counts or rates of events that occur
somewhat more frequently. After their basic theoretical presentation, we will
then describe how these single-level GLiMS can be fit using functions in R.
p (Y = 1)
ln = β0 + β1x. (7.1)
1 − p (Y = 1)
outcomes could also be assigned other values, though 1 and 0 are probably
the most commonly used in practice.) This outcome is linked to an indepen-
dent variable, x, by the slope (β1) and intercept (β0). Indeed, the right side
of this equation should look very familiar, as it is identical to the standard
linear regression model. However, the left side is quite different from what
we see in linear regression, due to the presence of the logistic link function,
also known as the logit. Within the parentheses lie the odds that the outcome
variable will take the value of 1. For our coronary artery example, 1 is the
value for having coronary artery disease and 0 is the value for not having
it. In order to render the relationship between this outcome and the inde-
pendent variable (time walking on treadmill until fatigue) linear, we need
to take the natural log of these odds. Thus, the logit link for this problem is
the natural log of the odds of an individual having coronary artery disease.
Interpretation of the slope and intercept in the logistic regression model are
the same as interpretation in the linear regression context. A positive value
of β1 would indicate that the larger the value of x, the greater the log odds of
the target outcome occurring. The parameter β0 is the log odds of the target
event occurring when the value of x is 0. Logistic regression models can be
fit easily in R using the GLM function, a part of the MASS library, which is a
standard package included with the basic installation of R. In the following
section, we will see how to call this function and how to interpret the results
we obtain from it.
The data were read into a data frame called coronary, using the methods
outlined in the chapter on data management in R. The logistic regression
model can then be fit in R using the following command sequence, where
group refers to the outcome variable, and time is the number of seconds
walked on the treadmill.
coronary.logistic<-glm(group~time, family=binomial)
By default, the glm command will treat the higher value as the target. In this
case 0 = healthy and 1 = disease. Therefore, the numerator of the logit will be
1, or disease. It is possible to change this so that the lower number is the tar-
get, and the interested reader can refer to help(glm) for more information
in this regard. This is a very important consideration, as the results would
be completely misinterpreted if R used a different specification than the user
thinks was used. The results of the summary command appear below.
Call:
glm(formula = group ~ time, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1387 -0.3077 0.1043 0.5708 1.5286
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.488949 5.876693 2.295 0.0217 *
coronary$time -0.016534 0.007358 -2.247 0.0246 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is the test statistic for the null hypothesis that the coefficient is equal to 0.
Next to z is the p-value for the test. Using standard practice, we would con-
clude that a p-value less than 0.05 indicates statistical significance. In addi-
tion, R provides a simple heuristic for interpreting these results based on
the *. For this example, the p-value of 0.0246 for time means that there is a
statistically significant relationship between time on the treadmill to fatigue
and the odds of an individual having coronary artery disease. The negative
sign for the estimate further tells us that more time spent on the treadmill
was associated with a lower likelihood of having heart disease.
One common approach to assessing the quality of the model fit to the data
is by examining the deviance values. For example, the residual deviance
compares the fit between a model that is fully saturated, meaning that it
perfectly fits the data, and our proposed model. Residual deviance is mea-
sured using a χ2 statistic that compares the predicted outcome value with
the actual value, for each individual in the sample. If the predictions are
very far from the actual responses, this χ2 will tend to be a large value, indi-
cating that the model is not very accurate. In the case of the residual devi-
ance, we know that the saturated model will always provide optimal fit to
the data at hand, though in practice it may not be particularly useful for
understanding the relationship between x and y in the population because it
will have a separate model parameter for every cell in the contingency table
relating the two variables, and thus may not generalize well. The proposed
model will always be more parsimonious than the saturated (i.e. have fewer
parameters), and will therefore generally be more interpretable and general-
izable to other samples from the same population, assuming that it does in
fact provide adequate fit to the data. With appropriately sized samples, the
residual deviance can be interpreted as a true χ2 test, and the p-value can
be obtained in order to determine whether the fit of the proposed model is
significantly worse than that of the saturated model. The null hypothesis for
this test is that model fit is adequate; i.e. the fit of the proposed model is close
to that of the saturated model. With a very small sample such as this one,
this approximation to the χ2 distribution does not hold (Agresti, 2002), and
we must therefore be very careful in how we interpret the statistic. For peda-
gogical purposes, let’s obtain the p-value for a χ2 of 12.966 with 18 degrees
of freedom. This value is 0.7936, which is larger than the α cut-off of 0.05,
indicating that we cannot reject that the proposed model fits the data as well
as the saturated model. Thus, we would then retain the proposed model as
being sufficient for explaining the relationships between the independent
and dependent variables.
The other deviance statistic for assessing fit that is provided by R is the
null deviance, which tests the null hypothesis that the proposed model does
not fit the data better than a model in which the average odds of having coro-
nary artery disease are used as the predicted outcome for every time value
(i.e. that x is not linear predictive of the probability of having coronary heart
disease). A significant result here would suggest that the proposed model
120 Multilevel Modeling Using R
is better than no model at all. Again, however, we must interpret this test
with caution when our sample size is very small, as is the case here. For this
example, the p-value of the null deviance test (χ2 = 27.726 with 19 degrees of
freedom) was 0.0888. As with the residual deviance test, the result is not
statistically significant at α = 0.05, suggesting that the proposed model does
not provide better fit than the null model with no relationships. Of course,
given the small sample size, we must interpret both hypothesis tests with
some caution.
Finally, R also provides the AIC value for the model. As we have seen in
previous chapters, AIC is a useful statistic for comparing the fit of differ-
ent, and not necessarily nested, models with smaller values indicating better
relative fit. If we wanted to assess whether including additional independent
variables or interactions improved model fit, we could compare AIC values
among the various models to ascertain which was optimal. For the current
example, there are no other independent variables of interest. However, it is
possible to obtain the AIC for the intercept-only model using the following
command. The purpose behind doing so would be to determine whether
including the time walking on the treadmill actually improved model fit,
after the penalty for model complexity was applied.
coronary.logistic.null<-glm(group~1, family=binomial)
The AIC for this intercept-only model was 29.726, which is larger than the
16.966 for the model including time. Based on AIC, along with the hypothesis
test results discussed above, we would therefore conclude that the full model
including time provided better fit to the outcome of coronary artery disease.
who were under a physician’s care for a health issue directly related to obe-
sity. Members of the sample were randomly assigned to either a control
condition in which they received no special instruction in how to plan and
prepare healthy meals from scratch, or a treatment condition in which they
did receive such instruction. The outcome of interest was a rating provided
two months after the study began in which each subject indicated the extent
to which they prepared their own meals. The response scale ranged from 0
(Prepared all of my own meals from scratch) to 4 (Never prepared any of my
own meals from scratch), so that lower values were indicative of a stronger
predilection to prepare meals at home from scratch. The dietician is inter-
ested in whether there are differences in this response between the control
and treatment groups.
One commonly used model for ordinal data such as these is the cumula-
tive logits model, which is as expressed as:
P(Y ≤ j)
log it[P(Y ≤ j)] = ln . (7.2)
1 − P(Y ≤ j)
In this model, there are J-1 logits where J is the number of categories in the
dependent variable, and Y is the actual outcome value. Essentially, this
model compares the likelihood of the outcome variable taking a value of j or
lower, versus outcomes larger than j. For the current example there would be
four separate logits:
p(Y = 0)
ln = β01 + β1x
p(Y = 1) + p(Y = 2) + p(Y = 3) + p(Y = 4)
p(Y = 0) + p(Y = 1)
ln = β02 + β1x
p(Y = 2) + p(Y = 3) + p(Y = 4)
(7.3)
p(Y = 0) + p(Y = 1) + p(Y = 2)
ln
p(Y = 3) + p(Y = 4) = β03 + β1x
In the cumulative logits model there is a single slope relating the indepen-
dent variable to the ordinal response, and each logit has a unique intercept.
In order for a single slope to apply across all logits we must make the propor-
tional odds assumption, which states that this slope is identical across logits.
In order to fit the cumulative logits model to our data in R, we use the polr
function, as in this example.
cooking.cum.logit<-polr(cook~treatment, method=c("logistic"))
122 Multilevel Modeling Using R
The dependent variable, cook, must be an R factor object, and the inde-
pendent variable can be either a factor or numeric. In this case, treat-
ment is coded as 0 (control) and 1 (treatment). To ensure that cook is a
factor, we use cook<-as.factor(cook) prior to fitting the model. Using
summary(cooking.cum.logit) after fitting the model, we obtain the fol-
lowing output.
Call:
polr(formula = cook ~ treatment, method = c("logistic"))
Coefficients:
Value Std. Error t value
treatment -0.7963096 0.3677003 -2.165649
Intercepts:
Value Std. Error t value
0|1 -2.9259 0.4381 –6.6783
1|2 -1.7214 0.3276 –5.2541
2|3 -0.2426 0.2752 –0.8816
3|4 1.3728 0.3228 4.2525
Residual Deviance: 293.1349
AIC: 303.1349
After the function call, we see the results for the independent variable,
treatment. The coefficient value is −0.796, indicating that a higher value
on the treatment variable (i.e. treatment = 1) was associated with a greater
likelihood of providing a lower response on the cooking item. Remember
that lower responses to the cooking item reflected a greater propensity to
eat scratch-made food at home. Thus, in this example those in the treatment
conditions had a greater likelihood of eating scratch-made food at home.
Adjacent to the coefficient value is the standard error for the slope, which is
divided into the coefficient in order to obtain the t-statistic residing in the
final column. We note that there is not a p-value associated with this t-statistic,
because in the generalized linear model context this value only follows
the t distribution asymptotically (i.e. for large samples). In other cases, it is
simply an indication of the relative magnitude of the relationship between
the treatment and outcome variable. In this context, we might consider the
relationship to be “significant” if the t-value exceeds 2, which is approxi-
mately the t critical value for a two-tailed hypothesis test with α = 0.05 and
infinite degrees of freedom. Using this criterion, we would conclude that
there is indeed a statistically significant negative relationship between treat-
ment condition and self-reported cooking behavior. Furthermore, by expo-
nentiating the slope we can also calculate the relative odds of a higher-level
response to the cooking item between the two groups. Much as we did in
the dichotomous logistic regression case, we use the equation eβ1 to convert
the slope to an odds ratio. In this case, the value is 0.451, indicating that the
odds of a treatment group member selecting a higher-level response (less
self-cooking behavior) is only 0.451 as large as that of the control group. Note
Brief Introduction to Generalized Linear Models 123
that this odds ratio applies to any pair of adjacent categories, such as 0 versus
1, 1 versus 2, 2 versus 3, or 3 versus 4.
R also provides the individual intercepts along with the residual deviance,
and AIC for the model. The intercepts are, as with dichotomous logistic
regression, the log odds of the target response when the independent vari-
able is 0. In this example, a treatment of 0 corresponds to the control group.
Thus, the intercept represents the log odds of the target response for the
control condition. As we saw above, it is possible to convert this to the odds
scale through exponentiating the estimate. The first intercept provides the
log odds of a response of 0 versus all other values for the control group; i.e.
plans and prepares all meals for oneself versus all other options. The inter-
cept for this logit is −2.9259, which yields an e−2.9259 of 0.054. We can interpret
this to mean that the odds of a member of the control group planning and
preparing their own meals versus something less are 0.054. In other words,
it’s highly unlikely a member of the control group will do this.
We can use the deviance, along with the appropriate degrees of freedom,
to obtain a test of the null hypothesis that the model fits the data. The follow-
ing command line in R will do this for us.
1-pchisq(deviance(cooking.cum.logit), df.residual(cooking.cum.
logit))
[1] 0
The p-value is extremely small (rounded to 0), indicating that the model
as a whole does not provide very good fit to the data. This could mean that
to obtain better fit, we need to include more independent variables with a
strong relationship to the dependent. However, if our primary interest is in
determining whether there are treatment differences in cooking behavior,
then this overall test of model fit may not be crucial, since we are able to
answer the question regarding the relationship of treatment to the cooking
behavior item.
the dependent variable categories to be the baseline against which all other
categories are compared. More formally, the multinomial logistic regression
model can be expressed as
p (Y = i )
ln = βi 0 + βi1x (7.4)
p (Y = j )
In this model, category j will always serve as the reference group against
which the other categories, i, are compared. There will be a different logit for
each non-reference category, and each of the logits will have a unique inter-
cept (βi0) and slope (βi1). Thus, unlike with the cumulative logits model in
which a single slope represented the relationship between the independent
variable and the outcome, in the multinomial logits model we have multiple
slopes for each independent variable, one for each logit. Therefore, we do not
need to make the proportional odds assumption, which makes this model
a useful alternative to the cumulative logits model when that assumption is
not tenable. The disadvantage of using the multinomial logits model with
an ordinal outcome variable is that the ordinal nature of the data is ignored.
Any of the categories can serve as the reference, with the decision being
based on the research question of most interest (i.e. against which group
would comparisons be most interesting), or on pragmatic concerns such as
which group is the largest, should the research questions not serve as the
primary deciding factor. Finally, it is possible to compare the results for two
non-reference categories using the equation
For the present example, we will set the conservative group to be the refer-
ence, and fit a model in which age is the independent variable and politi-
cal viewpoint is the dependent. In order to do so, we will use the function
mulitnom within the nnet package, which will need to be installed prior to
running the analysis. We would then use the command library(nnet) to
make the functions in this library available. The data were read into the R
data frame politics, containing the variables age and viewpoint, which
were coded as C (conservative), M (moderate), or L (liberal) for each indi-
vidual in the sample. Age was expressed as the number of years of age. The
R command to fit the multinomial logistic regression model is politics.
multinom<-multinom(viewpoint~age, data=politics), producing
the following output.
# weights: 9 (4 variable)
initial value 1647.918433
final value 1617.105227
converged
Brief Introduction to Generalized Linear Models 125
This message simply indicates the initial and final values of the maximum
likelihood fitting function, along with the information that the model con-
verged. In order to obtain the parameter estimates and standard errors, we
use the summary(politics.multinom).
Call:
multinom(formula = viewpoint ~ age, data = politics)
Coefficients:
(Intercept) age
L 0.4399943 -0.016611846
M 0.3295633 -0.004915465
Std. Errors:
(Intercept) age
L 0.1914777 0.003974495
M 0.1724674 0.003415578
Residual Deviance: 3234.210
AIC: 3242.210
Based on these results, we see that the slope relating age to the logit com-
paring self-identification as liberal (L) is –0.0166, indicating that older indi-
viduals had a lower likelihood of being liberal versus conservative. In order
to determine whether this relationship is statistically significant, we can cal-
culate a 95% confidence interval using the coefficient and the standard error
for this term. This interval is constructed as
−0.0166 ± 2(0.0040)
−0.0166 ± 0.008
( −0.0246, −0.0086).
Because 0 is not in this interval, it is not a likely value of the coefficient in the
population, leading us to conclude that the coefficient is statistically signifi-
cant. In other words, we can conclude that in the population, older individu-
als are less likely to self-identify as liberal than as conservative. We can also
construct a confidence interval for the coefficient relating age to the logit for
moderate to conservative:
−0.0049 ± 2(0.0034)
−0.0049 ± 0.0068
( −0.0117 , 0.0019).
Thus, because 0 does lie within this interval, we cannot conclude that there
is a significant relationship between age and the logit. In other words, age is
not related to the political viewpoint of an individual when it comes to com-
paring moderate versus conservative. Finally, we can calculate estimates for
comparing L and M by applying Equation (7.5).
126 Multilevel Modeling Using R
When these are taken together, we would conclude that older individuals are
less likely to be liberal than conservative, and less likely to be liberal than
moderate.
In all other respects, the Poisson model is similar to other regression models
in that the relationship between the independent and dependent variables
is expressed through the slope, β1. And again, the assumption underlying
the Poisson model is that the mean is equal to the variance. This assump-
tion is typically expressed by stating that the overdispersion parameter ϕ = 1.
The ϕ parameter appears in the Poisson distribution density and thus is a
key component in the fitting function used to determine the optimal model
Brief Introduction to Generalized Linear Models 127
1200
1000
800
frequency
600
400
200
We can see that 0 was the most common response by individuals in the
sample, with the maximum number being 3.
In order to fit the model with the glm function, we would use the following
function call.
Call:
glm(formula = babies ~ sei, family = c("poisson"), data =
ses_babies)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7312 -0.6914 -0.6676 -0.6217 3.1345
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.268353 0.132641 -9.562 <2e-16 ***
sei -0.005086 0.002900 -1.754 0.0794 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1237.8 on 1496 degrees of freedom
Residual deviance: 1234.7 on 1495 degrees of freedom
(3 observations deleted due to missingness)
AIC: 1803
Number of Fisher Scoring iterations: 6
These results show that sei did not have a statistically significant relationship
with the number of children under six months old living in the home (p = 0.0794).
We can use the following command to obtain the p-value for the test of the null
hypothesis that the model fits the data. 1-pchisq(deviance(babies.pois-
son), df.residual(babies.poisson))
[1] 0.9999998
The resulting p is clearly not significant at α = 0.05, suggesting that the
model does appear to fit the data adequately. The AIC of 1,803 will be useful
as we compare the relative fit of the Poisson regression model with that of
other models for count data.
Call:
glm(formula = babies ~ sei, family = c("quasipoisson"), data =
ses_babies)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7312 -0.6914 -0.6676 -0.6217 3.1345
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.268353 0.150108 -8.45 <2e-16 ***
sei -0.005086 0.003282 -1.55 0.121
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be
1.280709)
Null deviance: 1237.8 on 1496 degrees of freedom
Residual deviance: 1234.7 on 1495 degrees of freedom
(3 observations deleted due to missingness)
AIC: NA
Number of Fisher Scoring iterations: 6
As noted above, the coefficients themselves are the same in the quasipois-
son and Poisson regression models. However, the standard errors in the for-
mer are somewhat larger than those in the latter. In addition, the estimate of
ϕ is provided for the quasipoisson model, and is 1.28 in this case. While this
is not exactly equal to 1, it is also not markedly larger, suggesting that the
data are not terribly overdispersed. We can test for model fit as we did with
the Poisson regression, using the command
1-pchisq(deviance(babies.quasipoisson), df.residual(babies.
quasipoisson))
[1] 0.9999998
And, as with the Poisson, the quasipoisson model also fits the data
adequately.
A second alternative to the Poisson model when data are overdispersed is
a regression model based on the negative binomial distribution. The mean
of the negative binomial distribution is identical to that of the Poisson, while
the variance is
µ2
var(Y ) = µ + . (7.7)
θ
130 Multilevel Modeling Using R
Call:
glm.nb(formula = babies ~ sei, data = ses_babies, init.theta =
0.60483559440229,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6670 -0.6352 -0.6158 -0.5778 2.1973
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.260872 0.156371 -8.063 7.42e-16 ***
sei -0.005262 0.003386 -1.554 0.120
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 x log-likelihood: -1749.395
the quasipoisson. Indeed, the resulting hypothesis test results provide the
same answer for all three models, that there is not a statistically significant
relationship between the sei and the number of babies living in the home.
In addition to the parameter estimates and standard errors, we also obtain
an estimate of θ of 0.605. In terms of determining which model is optimal,
we can compare the AIC from the negative binomial (1755.4) to that of the
Poisson (1803), to conclude that the former provides somewhat better fit to
the data than the latter. In short, it would appear that the data are somewhat
overdispersed as the model designed to account for this (negative binomial)
provides better fit than the Poisson, which assumes no overdispersion. From
a more practical perspective, the results of the two models are very similar,
and a researcher using α = 0.05 would reach the same conclusion regarding
the lack of relationship between sei and the number of babies living in the
home, regardless of which model they selected.
Summary
Chapter 7 marks a major change in direction in terms of the type of data upon
which we will focus. Through the first six chapters, we have been concerned
with models in which the dependent variable is continuous, and generally
assumed to be normally distributed. In Chapter 7 we learned about a variety
of models designed for categorical dependent variables. In perhaps the sim-
plest instance, such variables can be dichotomous, so that logistic regression
is most appropriate for data analysis. When the outcome variable has more
than two ordered categories, we see that logistic regression can be easily
extended in the form of the cumulative logits model. For dependent vari-
ables with unordered categories, the multinomial logits model is the typical
choice, and can be easily employed with R. Finally, we examined dependent
variables that are counts, in which case we may choose the Poisson regres-
sion, the quasipoisson model, or the negative binomial model, depending
upon how frequently the outcome being counted occurs. As with Chapter 1,
the goal of Chapter 7 was primarily to provide an introduction to the single-
level versions of the multilevel models to come. In Chapter 8 we will see that
the model types described here can be extended into the multilevel context
using our old friends lme and lmer.
8
Multilevel Generalized Linear
Models (MGLMs)
133
134 Multilevel Modeling Using R
library(lme4)
attach(mathfinal)
summary(model8.1<-glmer(score2~numsense+(1|school),family=bino
mial, na.action=na.omit))
Scaled residuals:
Min 1Q Median 3Q Max
-5.2722 -0.7084 0.2870 0.6448 3.4279
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.2888 0.5374
Number of obs: 9316, groups: school, 40
Multilevel Generalized Linear Models (MGLMs) 135
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.659653 0.305358 -38.18 <2e-16 ***
numsense 0.059177 0.001446 40.94 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The function call is similar to what we saw with linear models in Chapter 3.
In terms of interpretation of the results, we first examine the variability in inter-
cepts from school to school. This variation is presented as both the variance and
standard deviation of the U0j terms from Chapter 2 (τ 02 ) , which are 0.2888 and
0.5374, respectively, for this example. The modal value of the intercept across
schools is −11.659653. With regard to the fixed effect, the slope of numsense,
we see that higher scores are associated with a greater likelihood of passing the
state math assessment, with the slope being 0.059177 (p < .05). (Remember that
R models the larger value of the outcome in the numerator of the logit, and in
this case, passing was coded as 1 whereas failing was coded as 0.) The standard
error, test statistic, and p-value appear in the next three columns. The results are
statistically significant (p < 0.001), leading to the conclusion that overall, num-
ber sense scores are positively related to the likelihood of a student achieving a
passing score on the assessment. Finally, we see that the correlation between the
slope and intercept is strongly negative (−0.955). Given that this is an estimate of
the relationship between two fixed effects, we are not particularly interested in
it. Information about the residuals appears at the very end of the output.
As we discussed in Chapter 3, it is useful for us to obtain confidence inter-
vals for parameter estimates of both the fixed and random effects in the model.
With lme4, we have several options in this regard, using the confint func-
tion, as we saw in Chapter 3. Indeed, precisely the same options are available
for multilevel generalized models fit using glmer, as was the case for mod-
els fit using the lmer function. We would refer the reader to Chapter 3 for a
review of how each of these methods works. In the following text, we demon-
strate the use of each of these confidence interval types for model8.1, and then
discuss implications of these results for interpretation of the problem at hand.
#Percentile Bootstrap
#Basic Bootstrap
#Normal Bootstrap
#Wald
confint(model8.1, method=c("Wald"))
2.5 % 97.5 %
.sig01 NA NA
(Intercept) -12.25814446 -11.06116177
numsense 0.05634366 0.06201022
#Profile
confint(model8.1, method=c("profile"))
Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 0.42422171 0.6968439
(Intercept) -12.26384797 -11.0669380
numsense 0.05637074 0.0620358
For the random effect, all of the methods for calculating confidence inter-
vals yield similar results, in that the lower bound is approximately 0.41 to
0.42 and the upper bound is between 0.67 and 0.70. Regardless of the method,
we see that 0 is not in the interval, and thus would conclude that the ran-
dom intercept variance is statistically significant; i.e. there are differences in
the intercept across schools. Similarly, the confidence intervals for the fixed
effects (intercept and the coefficient for numsense) also did not include 0,
which indicates that they were statistically significant as well. In particu-
lar, this would lead us to the conclusion that there is a positive relationship
between numsense and the likelihood of receiving a passing test score,
which we noted above.
In addition to the model parameter estimates, the results from glmer also
include information about model fit, in particular values for the AIC and
Multilevel Generalized Linear Models (MGLMs) 137
summary(model8.2<-glmer(score2~numsense+(numsense|school),fami
ly=binomial))
Scaled residuals:
Min 1Q Median 3Q Max
-4.7472 -0.6942 0.2839 0.6352 3.6943
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.170e+01 4.65843
numsense 4.105e-04 0.02026 -1.00
Number of obs: 9316, groups: school, 40
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.901304 0.836336 -15.43 <2e-16 ***
numsense 0.064903 0.003735 17.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We will focus on aspects of the output for the random coefficients model
that differ from those of the random intercepts. In particular, note that we
have an estimate of τ12 , the variance of the U1j estimates for specific schools.
This value, 0.00041, is relatively small when compared to the variation of
intercepts across schools (0.217), meaning that the relationship of number
sense with the likelihood of an individual receiving a passing score on the
math achievement test is relatively similar across the schools. The modal
slope across schools is 0.064274, again indicating that individuals with
higher number sense scores also have a higher likelihood of passing the
math assessment. Finally, it is important to note that the correlation between
the random components of the slope and intercept, the standardized version
of τ01, is very strongly negative.
As with the random intercept model, we can obtain confidence intervals
for the random and fixed effects of the random coefficients model. The same
options are available as was the case for the random intercept model. For this
example, we will use the Wald, profile, and percentile bootstrap confidence
intervals.
#Wald
confint(model8.2, method=c("Wald"))
2.5 % 97.5 %
.sig01 NA NA
.sig02 NA NA
.sig03 NA NA
(Intercept) -14.54049182 -11.26211656
numsense 0.05758284 0.07222221
#Profile
confint(model8.2, method=c("profile"))
Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 3.33785318 6.47047746
.sig02 -0.99922647 -0.98878396
.sig03 0.01428772 0.02843558
(Intercept) -14.66242289 -11.28916466
numsense 0.05768995 0.07285651
#Percentile Bootstrap
confint(model8.2, method=c("boot"), boot.type=c("perc"))
Computing bootstrap confidence intervals ...
2.5 % 97.5 %
.sig01 3.18387655 6.01887096
.sig02 -0.99869150 -0.99035490
.sig03 0.01367105 0.02621692
(Intercept) -14.88458123 -11.29933632
numsense 0.05781441 0.07362581
Multilevel Generalized Linear Models (MGLMs) 139
None of the confidence intervals for the random effects included 0, lead-
ing us to conclude that each of them is likely to be different from 0 in the
population.
The inclusion of AIC and BIC in the GLMER output allows for a direct com-
parison of model fit, thus aiding in the selection of the optimal model for the
data. As a brief reminder, AIC and BIC are both measures of unexplained
variation in the data with a penalty for model complexity. Therefore, mod-
els with lower values provide relatively better fit. Comparison of either AIC
or BIC between Models 8.1 (AIC = 9835.9, BIC = 9857.4) and 8.2 (AIC = 9768.7,
BIC = 9804.4) reveals that the latter provides better fit to the data. We do need
to remember that AIC and BIC are not significance tests, but rather mea-
sures of relative model fit. In addition to the relative fit indices, we can also
compare the fit of the two models using the anova command, as we demon-
strated in Chapter 3.
anova(model8.1, model8.2)
Data: NULL
Models:
model8.1: score2 ~ numsense + (1 | school)
model8.2: score2 ~ numsense + (numsense | school)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
model8.1 3 9835.9 9857.4 -4915.0 9829.9
model8.2 5 9768.7 9804.4 -4879.4 9758.7 71.208 2
3.447e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
where 1 = female, and 0 = male) and the likelihood of passing the state math
assessment, as well as the relationship of passing and number sense score.
In addition, we include a level-2 predictor, the proportion of students in the
school receiving free lunch. In order to fit the additional level-1 variable to
the random coefficients model, we would use the following command to
obtain the subsequent output. Note that this is a random intercept model,
with no random coefficients.
summary(model8.3<-glmer(score2~numsense+female+L_
Free+(1|school), family=binomial), data=mathfinal.nomiss,
na.action=na.omit)
Scaled residuals:
Min 1Q Median 3Q Max
-4.8524 -0.6875 0.2734 0.6301 3.7430
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.2824 0.5314
Number of obs: 6810, groups: school, 32
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.969032 0.424692 -28.183 <2e-16 ***
numsense 0.063229 0.001761 35.911 <2e-16 ***
female -0.021276 0.058786 -0.362 0.7174
L_Free -0.008820 0.003733 -2.363 0.0181 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These results indicate that being female is not significantly related to one’s
likelihood of passing the math achievement test; i.e. there are no gender dif-
ferences in the likelihood of passing. Scores on the numsense subscale were
positively associated with the likelihood of passing the test, and attending
Multilevel Generalized Linear Models (MGLMs) 141
summary(model8.4<-glmer(score2~numsense+female+L_Free+(female|
school),family=binomial), data=mathfinal.nomiss, na.action=na.
omit)
Scaled residuals:
Min 1Q Median 3Q Max
-4.6373 -0.6880 0.2752 0.6279 3.6198
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 0.22883 0.4784
female 0.01029 0.1015 1.00
Number of obs: 6810, groups: school, 32
Fixed effects:
142 Multilevel Modeling Using R
The variance estimate for the random coefficient effect for gender is 0.01029.
The coefficients for the other variables in model8.4 are quite similar to those
in model8.3. In order to ascertain whether the random coefficient for female
is statistically significant, we can examine the 95% confidence interval, in
this case again using the percentile bootstrap.
These results show that the correlation between the fixed effects (.sig02)
is not different from 0, nor is the coefficient linking female to the outcome
variable, because in both cases the confidence interval includes 0.
We can use the relative fit indices to make some judgments regarding
which of these two models might be optimal for better understanding the
population. Both AIC and BIC are very slightly smaller for model8.3 (7064.0
and 7098.1), as compared to model8.4 (7065.3 and 7113.1), indicating that
the simpler model (without the female random coefficient term) might be
preferable. In addition, the results of the likelihood ratio test, which appear
below, reveal that the fit of the two models is not statistically significantly
different. Given this statistically equivalent fit, we would prefer the simpler
model, without the random coefficient effect for female.
Multilevel Generalized Linear Models (MGLMs) 143
anova(model8.3, model8.4)
Data: NULL
Models:
model8.3: score2 ~ numsense + female + L_Free + (1 | school)
model8.4: score2 ~ numsense + female + L_Free + (female |
school)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
model8.3 5 7064.0 7098.1 -3527.0 7054.0
model8.4 7 7065.3 7113.1 -3525.7 7051.3 2.6122 2
0.2709
Summary(model8.5<-clmm(as.factor(score)~computation+(1|sch
ool), data=mathfinal, na.action=na.omit))
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.3844 0.62
Number of groups: school 40
Coefficients:
Estimate Std. Error z value Pr(>|z|)
computation 0.06977 0.00143 48.78 <2e-16 ***
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 13.6531 0.3049 44.77
2|3 17.2826 0.3307 52.26
One initial point to note is that the syntax for clmm is very similar in form
to that for lmer. As with most R model syntax, the outcome variable (score)
is separated from the fixed effect (computation) by ~, and the random
effect, school, is included in parentheses along with 1, to denote that we are
fitting a random intercepts model. We should also note that the dependent
variable needs to be a factor, leading to our use of as.factor(score) in
the command sequence. It is important to state at this point that, currently,
there is not an R package available to fit a random coefficients model for the
cumulative logits model.
An examination of the results presented above reveals that the variance
and standard deviation of intercepts across schools are 0.3844 and 0.62,
respectively. Given that the variation is not near 0, we would conclude that
there appear to be differences in intercepts from one school to the next. In
addition, we see that there is a significant positive relationship between per-
formance on the computation aptitude subtest and performance on the math
achievement test, indicating that examinees who have higher computation
skills also are more likely to attain higher ordinal scores on the achievement
test; e.g. pass versus fail or pass with distinction versus pass. We also obtain
estimates of the model intercepts, which are termed thresholds by clmm. As
was the case for the single-level cumulative logits model, the intercept rep-
resents the log odds of the likelihood of one response versus the other (e.g. 1
versus 2) when the value of the predictor variable is 0. A computation score
of 0 would indicate that the examinee did not correctly answer any of the
items on the test. Applying this fact to the first intercept presented above,
Multilevel Generalized Linear Models (MGLMs) 145
along with the exponentiation of the intercept that was demonstrated in the
previous chapter, we can conclude that the odds of a person with a computa-
tion score of 1 passing the math achievement exam are e13.6531 = 850, 092.12
to 1, or quite high! Finally, we also have available the AIC value (13365.74),
which we can use to compare the relative fit of this to other models.
As an example of fitting models with both level-1 and level-2 variables,
let’s include the proportion of students receiving free lunch in the schools
(L _ Free) as an independent variable along with the computation score.
summary(model8.6<-clmm(as.factor(score)~computation+L_
Free+(1|school), data=mathfinal, na.action=na.omit))
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.4028 0.6347
Number of groups: school 34
Coefficients:
Estimate Std. Error z value Pr(>|z|)
computation 0.074606 0.001698 43.940 <2e-16 ***
L_Free -0.007612 0.004099 -1.857 0.0633 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 14.2381 0.4288 33.21
2|3 17.8775 0.4545 39.33
Given that we have already discussed the results of the previous model in
some detail, we will not reiterate those basic ideas again. However, it is impor-
tant to note those aspects that are different here. Specifically, the variability
in the intercepts declined somewhat with the inclusion of the school-level
variable, L _ Free. In addition, we also see that there is not a statistically
significant relationship between the proportion of individuals receiving free
lunch in the school and the likelihood that an individual student will obtain
146 Multilevel Modeling Using R
a higher achievement test score. Finally, a comparison of the AIC values for
the computation-only model (13365.74) and the computation and free-lunch
model (10080.25) shows that model8.6 provides a somewhat better fit to the
data than does model8.5, given its smaller AIC value. In other words, in
terms of the model fit to the data, we are better off including both free-lunch
and computation score when modeling the three-level achievement outcome
variable, even though L _ Free is statistically significantly related to the
outcome variable. Note that the anova command is not available for models
fit with clmm.
As of the writing of this book (December 2018), lme4 does not provide
for the fitting of multilevel ordinal logistic regression models. Therefore, the
clmm function within the ordinal package represents perhaps the most
straightforward mechanism for fitting such models, albeit with its own limi-
tations. As can be seen above, the basic fitting of these models is not complex,
and indeed the syntax is similar to that of lme4. In addition, the ordinal
package also allows for the fitting of ordered outcome variables in the non-
multilevel context (see the clm function), and for multinomial outcome
variables (see the clmm2 function, discussed below). As such, it represents
another method available for fitting such models in a unified framework.
from a heart attack and who were entering rehabilitation agreed to be ran-
domly assigned to either a new exercise treatment program, or the standard
treatment protocol. Of particular interest to the researcher heading up this
study is the relationship between treatment condition and the number of
cardiac warning incidents. The new approach to rehabilitation is expected
to result in fewer such incidents as compared to the traditional method. In
addition, the researcher has also collected data on the sex of the patients,
and the number of hours that each rehabilitation facility is open during the
week. This latter variable is of interest as it reflects the overall availability of
the rehabilitation programs. The new method of conducting cardiac rehabili-
tation is coded in the data as 1, while the standard approach is coded as 0.
Males are also coded as 1 while females are assigned a value of 0.
summary(model8.7<-glmer(heart~trt+sex+(1|rehab),family=pois
son, data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-5.906 -1.695 -0.881 0.756 40.163
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 1.216 1.103
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.83408 0.11181 7.46 8.67e-14 ***
trt -0.45612 0.03389 -13.46 < 2e-16 ***
sex 0.39305 0.03344 11.76 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
148 Multilevel Modeling Using R
In terms of the function call, the syntax for Model 8.7 is virtually identical
to that used for the dichotomous logistic regression model. The dependent
and independent variables are linked in the usual way that we have seen in
R: heart~trt+sex. Here, the outcome variable is heart, which reflects the
frequency of the warning signs for heart problems that we described above.
The independent variables are treatment (trt) and sex of the individual,
while the specific rehabilitation facility is contained in the variable rehab.
In this model, we are fitting a random intercept-only, with no random slope
and no rehabilitation-center-level variables.
The results of the analysis indicate that there is variation among the inter-
cepts from rehabilitation facility to rehabilitation facility, with a variance
of 1.216. As a reminder, the intercept reflects the mean frequency of events
when (in this case) both of the independent variables are 0; i.e. females in the
control condition. The average intercept across the 110 rehabilitation centers
is 1.216, and this non-zero value suggests that the intercept does differ from
center to center. Put another way, we can conclude that the mean number of
cardiac warning signs varies across rehabilitation centers, and that the aver-
age female in the control condition will have approximately 1.2 such incidents
over the course of six months. In addition, these results reveal a statistically
significant negative relationship between heart and trt, and a statistically
significant positive relationship between heart and sex. Remember that
the new treatment is coded as 1 and the control as 0, so that a negative rela-
tionship indicates that there are fewer warning signs over six months for
those in the treatment than those in the control group. Also, given that males
were coded as 1 and females as 0, the positive slope for sex means that males
have more warning signs on average than do females.
summary(model8.8<-glmer(heart~trt+sex+(trt|rehab),family=pois
son, data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-6.109 -1.552 -0.725 0.640 31.917
Random effects:
Groups Name Variance Std.Dev. Corr
rehab (Intercept) 1.869 1.367
trt 1.844 1.358 -0.62
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.52852 0.14124 3.742 0.000183 ***
trt -0.12222 0.14749 -0.829 0.407310
sex 0.34415 0.03523 9.769 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The syntax for the inclusion of random slopes in the model is identi-
cal to that used with logistic regression and thus will not be commented
on further here. The random effect for slopes across rehabilitation centers
was estimated to be 1.844, which indicates that there is some differential
center effect to the impact of treatment on the number of cardiac warning
signs experienced by patients. Indeed, the variance for the random slopes
is approximately the same magnitude as the variance for the random inter-
cepts, which indicates that these two random effects are quite comparable
in magnitude. The correlation of the random slope and intercept model
components is fairly large and negative (−0.62), meaning that the greater
the number of cardiac events in a rehab center, the lower the impact of the
treatment on the number of such events. The average slope for treatment
across centers was no longer statistically significant, which indicates that
when we account for the random coefficient effect for treatment, the treat-
ment effect itself goes away.
As with the logistic regression, we can compare the fit of the two models
using both information indices, and a likelihood ratio test.
150 Multilevel Modeling Using R
anova(model8.7,model8.8)
Data: rehab_data
Models:
model8.7: heart ~ trt + sex + (1 | rehab)
model8.8: heart ~ trt + sex + (trt | rehab)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
model8.7 4 11470 11490 -5731.1 11462
model8.8 6 10555 10584 -5271.3 10543 919.63 2 <
2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model8.9<-glmer(heart~trt+sex+hours+(1|rehab),family=p
oisson, data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-5.907 -1.688 -0.887 0.755 40.150
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 1.148 1.071
Number of obs: 1000, groups: rehab, 110
Multilevel Generalized Linear Models (MGLMs) 151
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.80895 0.10979 7.368 1.73e-13 ***
trt -0.45617 0.03389 -13.459 < 2e-16 ***
sex 0.39319 0.03344 11.758 < 2e-16 ***
hours 0.24530 0.11142 2.201 0.0277 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These results show that the more hours a center is open the more warning
signs patients who attend will experience over a six-month period. In other
respects, the parameter estimates for model8.9 do not differ substantially
from those of the earlier models, generally revealing similar relationships
among the independent and dependent variables.
We can also make comparisons among the various models in order to
determine which yields the best fit to our data. Given that the AIC and BIC
values for model8.8 are lower than those of model8.9, we would conclude
that model8.8 yields the best fit to the data. In addition, below are the results
for the likelihood ratio tests comparing these models with one another.
anova(model8.8,model8.9)
Data: rehab_data
Models:
model8.9: heart ~ trt + sex + hours + (1 | rehab)
model8.8: heart ~ trt + sex + (trt | rehab)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
model8.9 5 11468 11492 -5728.7 11458
model8.8 6 10555 10584 -5271.3 10543 914.95 1 <
2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(model8.7,model8.9)
Data: rehab_data
Models:
model8.7: heart ~ trt + sex + (1 | rehab)
model8.9: heart ~ trt + sex + hours + (1 | rehab)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
152 Multilevel Modeling Using R
These results show that model8.8 fits the data significantly better than
model8.9, which in turn fits the data significantly better than model8.7.
Earlier, we found that model8.8 also fit the data better than model8.7, based
on the likelihood ratio test results, and AIC/BIC values. Thus, given all of
these results, we would conclude that model8.8 provides the best fit to the
data, from among the three that we tried here.
Recall that the signal quality of the Poisson distribution is the equality
of the mean and variance. In some instances, however, the variance of
a variable may be larger than the mean, leading to the problem of over-
dispersion, which we described in Chapter 7. In the previous chapter we
described alternative statistical models for such situations, including
one based on the quasipoisson distribution, which took the same form
as the Poisson, except that it relaxed the requirement of equal mean and
variance. It is possible to fit the quasipoisson distribution in the multi-
level modeling context as well, though not using lme4. The developer of
lme4 is not confident in the quasipoisson fitting algorithm, and has thus
removed this functionality from lme4, though alternative estimators for
overdispersed data are available using lme4. Rather, we would need to
use the glmmPQL package from the nlme library. In this case, we would
use the following syntax for the random intercept model with the qua-
sipoisson estimator.
summary(model8.10<-glmmPQL(heart~trt+sex,random=~1|rehab,famil
y=quasipoisson))
Random effects:
Formula: ~1 | rehab
(Intercept) Residual
StdDev: 0.6620581 4.010266
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: heart ~ trt + sex
Value Std.Error DF t-value p-value
(Intercept) 1.2306419 0.09601707 888 12.816908 0.0000
Multilevel Generalized Linear Models (MGLMs) 153
summary(model8.11<-glmer.nb(heart~trt+sex+(1|rehab),
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-0.4240 -0.4170 -0.4101 0.1304 9.6989
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 0.2079 0.456
Number of obs: 1000, groups: rehab, 110
154 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2140 0.1541 7.879 3.3e-15 ***
trt -0.5126 0.1708 -3.001 0.00269 **
sex 0.4189 0.1729 2.423 0.01541 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The function call includes the standard model setup in R for the fixed
effects (trt, sex), with the random effect (intercept within school in this
example) denoted as for the glmer-based models. In terms of the out-
put, after the function call, we see the table of parameter estimates, stan-
dard errors, test statistics, and p-values. These results are similar to those
described above, indicating the significant relationships between the fre-
quency of cardiac warning signs and both treatment and sex. The variance
associated with the random effect was estimated to be 0.2079. The findings
with respect to the fixed effects are essentially the same as those for the
standard Poisson regression model, with a statistically significant negative
relationship between treatment and the number of cardiac events, and a sig-
nificant positive relationship for sex. The 95% profile confidence intervals for
the fixed and random effects in model8.11 appear below.
confint(model8.11, method=c("profile"))
From these results, we can see that 0 is not included in any of the intervals,
meaning that they are all statistically significant.
As with the Poisson regression model, it is possible to fit a random coef-
ficients model for the negative binomial, using very similar R syntax as that
for glmer. In this case, we will fit a random coefficient for the trt variable,
as we did for the Poisson regression model.
summary(model8.12<-glmer.nb(heart~trt+sex+(trt|rehab),
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-0.4267 -0.4180 -0.4121 0.1375 8.0981
Random effects:
Groups Name Variance Std.Dev. Corr
rehab (Intercept) 0.36992 0.6082
trt 0.09906 0.3147 -1.00
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1155 0.1694 6.583 4.61e-11 ***
trt -0.3487 0.1986 -1.756 0.0791 .
sex 0.4000 0.1717 2.330 0.0198 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The variance estimate for the random trt effect is 0.09906, as compared
to the larger random intercept variance estimate of 0.36992. This result sug-
gests that the differences in the mean number of cardiac events across rehab
centers is greater than the cross-center differences in the treatment effect on
the number of events. The 95% percentile bootstrap confidence intervals for
the fixed and random effects appear below. Note that the profile confidence
interval approach did not converge, and thus wasn’t used here.
The intervals for the random intercept (.sig01), the random coefficient
(.sig03), the fixed intercept, and sex all excluded 0, meaning that these
terms can be seen as statistically significant. However, intervals for the cor-
relation between the random effects (.sig02) and the fixed treatment effect
(trt) all included 0. Thus, we would conclude that there is not a statistically
significant relationship between treatment condition and the number of car-
diac events, when the random treatment effect is included in the model. This
conclusion based on the confidence interval mirrors the result of the hypoth-
esis test in the original output for model8.12. There is, however, a difference
in the treatment effect across the rehab centers, given the statistically signifi-
cant random coefficient effect.
As with other models that we have examined in this book, it is possible to
include level-2 independent variables, such as number of hours the centers
are open, in the model, and compare the relative fit using the relative fit indi-
ces, as in Model8.13.
summary(model8.13<-glmer.nb(heart~trt+sex+hours+(1|rehab),
data=rehab_data))
Scaled residuals:
Min 1Q Median 3Q Max
-0.4223 -0.4161 -0.4090 0.1177 9.4494
Random effects:
Groups Name Variance Std.Dev.
rehab (Intercept) 0.1284 0.3583
Number of obs: 1000, groups: rehab, 110
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.20520 0.15109 7.977 1.5e-15 ***
trt -0.49358 0.16690 -2.957 0.00310 **
sex 0.40764 0.16925 2.408 0.01602 *
hours 0.26415 0.09546 2.767 0.00566 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
trt -0.534
sex -0.292 -0.246
hours -0.066 0.104 -0.082
As we have seen previously, the number of hours that the centers are open
is significantly positively related to the number of cardiac warning signs
over the six-month period of the study. The 95% profile confidence intervals
for model8.13 appear below.
The 95% profile confidence intervals for the random and fixed effects
in model8.13 appear below. The interval for the random intercept effect
(.sig01) includes 0, leading us to conclude that under the negative binomial
model when the number of hours for which the center is open is included in
the model, there is not a statistically significant difference among the rehab
centers in terms of the number of cardiac events.
confint(model8.13, method=c("profile"))
Now that we have examined the random and fixed effects for each of the
three negative binomial models, we can compare their fit with one another
in order to determine which is optimal given the set of data at hand. We
can formally compare these three models using the likelihood ratio test, as
below. In addition, we will also compare the AIC and BIC values to provide
further evidence regarding the relative model fit.
anova(model8.11, model8.12)
Data: rehab_data
Models:
model8.11: heart ~ trt + sex + (1 | rehab)
model8.12: heart ~ trt + sex + (trt | rehab)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
model8.11 5 3937.9 3962.5 -1964.0 3927.9
model8.12 7 3939.3 3973.7 -1962.7 3925.3 2.6163 2
0.2703
anova(model8.11, model8.13)
Data: rehab_data
Models:
158 Multilevel Modeling Using R
First, the results of the likelihood ratio test show that the fit of Model8.12
was not significantly different than that of Model8.11, meaning that inclusion
of the random treatment effect did not improve model fit. This finding is
further reinforced by the lower AIC and BIC values for model8.11. Next, we
compared the fit of models8.11 and 8.13. Here, we see a statistically signifi-
cant difference in the model fit, with slightly lower AIC and BIC values asso-
ciated with model8.13, which included the number of hours that the rehab
centers were open. Thus, we would conclude that having this variable in the
model is associated with better fit to the data. Finally, models8.12 and 8.13
are not nested within one another, and thus cannot be compared using the
likelihood ratio test. However, the AIC and BIC values for model8.13 were
smaller than those of model8.12, suggesting that inclusion of the rehab center
hours is more important to yielding good fit to the data than is inclusion of
the random treatment condition effect.
Summary
In Chapter 8 we learned that the generalized linear models featured in
Chapter 7, which accommodate categorical dependent variables, can be eas-
ily extended to the multilevel context. Indeed, the basic concepts that we
learned in Chapter 2 regarding sources of variation, and various types of
models can be easily extended for categorical outcomes. In addition, R pro-
vides for easy fitting of such models through the lme and lmer families of
functions. Therefore, in many ways Chapter 8 represents a review of mate-
rial that by now should be familiar to us, even while applied in a different
scenario than we have seen up to now. Perhaps the most important point to
take away from this chapter is the notion that modeling multilevel data in
the context of generalized linear models is not radically different from the
normally distributed continuous dependent variable case, so that the same
types of interpretations can be made, and the same types of data structure
can be accommodated.
9
Bayesian Multilevel Modeling
159
160 Multilevel Modeling Using R
the posterior distributions for each of the model parameters (e.g. regres-
sion coefficients, random-effect variances). From this posterior distribution,
parameter values are simulated a large number of times in order to obtain an
estimated posterior distribution. After each such sample is drawn, the pos-
terior is updated. This iterative sampling and updating process is repeated a
very large number of times (e.g. 10,000 or more) until there is evidence of con-
vergence regarding the posterior distribution; i.e. a value from one sampling
draw is very similar to the previous sample draw. The Markov Chain part
of MCMC reflects the process of sampling a current value from the posterior
distribution, given the previous sampled value, while Monte Carlo reflects
the random simulation of these values from the posterior distribution. When
the chain of values has converged, we are left with an estimate of the poste-
rior distribution of the parameter of interest (e.g. regression coefficient). At
this point, a single model parameter estimate can be obtained by calculating
the mean, median, or mode from the posterior distribution.
When using MCMC, the researcher must be aware of some technical
aspects of the estimation that need to be assessed in order to ensure that the
analysis has worked properly. The collection of 10,000 (or more) individual
parameter estimates forms a lengthy time series, which must be examined
to ensure that two things are true. First, the parameter estimates must con-
verge, and second the autocorrelation between different iterations in the pro-
cess should be low. Parameter convergence can be assessed through the use
of a trace plot, which is simply a graph of the parameter estimates in order
from the first iteration to the last. The autocorrelation of estimates is calcu-
lated for a variety of iterations, and the researcher will look for the distance
between estimates at which the autocorrelation becomes quite low. When
it is determined at what point the autocorrelation between estimates is suf-
ficiently low, the estimates are thinned so as to remove those that might be
more highly autocorrelated with one another than would be desirable. So, for
example, if the autocorrelation is low when the estimates are ten iterations
apart, the time series of 10,000 sample points would be thinned to include
only every tenth observation, in order to create the posterior distribution
of the parameter. The mean/median/mode of this distribution would then
be calculated using only the thinned values, in order to obtain the single
parameter estimate value that is reported by R. A final issue in this regard
is what is known as the burn-in period. Thinking back to the issue of distri-
butional convergence, the researcher will not want to include any values in
the posterior distribution for iterations prior to the point at which the time
series converged. Thus, iterations prior to this convergence are referred to as
having occurred during the burn-in period, and are not used in the calcula-
tion of posterior means/medians/modes. Each of these MCMC conditions
(number of iterations, thinning rate, and burn-in period) can be set by the
user in R, or default values can be used. In the remainder of this chapter we
will provide detailed examples of the diagnosis of MCMC results, and the
setting not only of MCMC parameters, but also of prior distributions.
162 Multilevel Modeling Using R
library(MCMCglmm)
prime_time.nomiss<-na.omit(prime_time)
attach(prime_time.nomiss)
model9.1<-MCMCglmm(geread~gevocab, random=~school, data=prime_
time.nomiss)
plot(model9.1)
summary(model9.1)
The function call for MCMCglmm is fairly similar to what we have seen in
previous chapters. One important point to note is that MCMCglmm does not
accommodate the presence of missing data. Therefore, before conducting the
analysis we needed to expunge all of the observations with missing data. We
created a dataset with no missing observations using the command prime _
time.nomiss<-na.omit(prime _ time), which created a new data frame
called prime _ time.nomiss containing no missing data. We then attached
this data frame and fit the multilevel model, indicating the random effect
Bayesian Multilevel Modeling 163
0.52 30
0.51 20
0.50 10
0.49 0
4000 6000 8000 10000 12000 0.49 0.50 0.51 0.52 0.53 0.54 0.55
Iterations N = 1000 Bandwidth = 0.002201
0.16 20
0.14
15
0.12
0.10 10
0.08 5
0.06
0
4000 6000 8000 10000 12000 0.05 0.10 0.15
Iterations N = 1000 Bandwidth = 0.00467
Trace of units Density of units
3.90
3.85 6
3.80
4
3.75
3.70 2
3.65
3.60 0
4000 6000 8000 10000 12000 3.6 3.7 3.8 3.9
Iterations N = 1000 Bandwidth = 0.01384
164 Multilevel Modeling Using R
For each model parameter, we have the trace plot on the left, showing the
entire set of parameter estimates as a time series across the 13,000 iterations.
On the right, we have a histogram of the distribution of parameter estimates.
Our purpose for examining these plots is to ascertain to what extent the
estimates have converged on a single value. As an example, the first pair of
graphs reflects the parameter estimates for the intercept. For the trace, con-
vergence is indicated when the time series plot hovers around a single value
on the y-axis, and does not meander up and down. In this case, it is clear
that the trace plot for the intercept shows convergence. This conclusion is
reinforced by the histogram for the estimate, which is clearly centered over
a single mean value, with no bimodal tendencies. We see similar results for
the coefficient of vocabulary, the random effect of school, and the residual.
Given that the parameter estimates appear to have successfully converged,
we can have confidence in the actual estimated values, which we will exam-
ine shortly.
Prior to looking at the parameter estimates, we want to assess the auto-
correlation of the estimates in the time series for each parameter. Our
purpose here is to ensure that the rate of thinning (taking every tenth
observation generated by the MCMC algorithm) that we used is sufficient
to ensure that any autocorrelation in the estimates is eliminated. In order
to obtain the autocorrelations for the random effects, we use the command
autocorr(model9.1$VCV), and obtain the following results.
, , school
school units
Lag 0 1.00000000 -0.05486644
Lag 10 -0.03926722 -0.03504799
Lag 50 -0.01636431 -0.04016879
Lag 100 -0.03545104 0.01987726
Lag 500 0.04274662 -0.05083669
, , units
school units
Lag 0 -0.0548664421 1.000000000
Lag 10 -0.0280445140 -0.006663408
Lag 50 -0.0098424151 0.017031804
Lag 100 0.0002654196 0.010154987
Lag 500 -0.0022835508 0.046769152
We read this table as follows: in the first section, we see results for the ran-
dom effect school. This output includes correlations involving the school
variance component estimates. Under the school column are the actual
autocorrelations for the school random effect estimate. Under the units
column are the cross correlations between estimates for the school random
effect and the residual random effect, at different lags. Thus, for example,
Bayesian Multilevel Modeling 165
the correlation between the estimates for school and the residual with no
lag is –0.0549. The correlation between the school estimate 10 lags prior to
the current residual estimate is −0.035. In terms of ascertaining whether our
rate of thinning is sufficient, the more important numbers are in the school
column, where we see the correlation between a given school effect estimate
and the school effect estimate 10, 50, 100, and 500 estimates before. The auto-
correlation at a lag value of 10, −0.0393, is sufficiently small for us to have con-
fidence in our thinning the results at 10. We would reach a similar conclusion
regarding the autocorrelation of the residual (units), such that 10 appears
to be a reasonable thinning value for it as well. We can obtain the autocorre-
lations of the fixed effects using the command autocorr(model9.1$Sol).
Once again, it is clear that there is essentially no autocorrelation as far out
as a lag of 10, indicating that the default thinning value of 10 is sufficient for
both the intercept and the vocabulary test score.
, , (Intercept)
(Intercept) gevocab
Lag 0 1.000000000 -0.757915532
Lag 10 -0.002544175 -0.013266125
Lag 50 -0.019405970 0.007370979
Lag 100 -0.054852949 0.029253018
Lag 500 0.065853783 -0.046153346
, , gevocab
(Intercept) gevocab
Lag 0 -0.757915532 1.000000000
Lag 10 0.008583659 0.020942660
Lag 50 -0.001197203 -0.002538901
Lag 100 0.047596351 -0.022549594
Lag 500 -0.057219532 0.026075911
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 43074.14
G-structure: ~school
R-structure: ~units
We are first given information about the number of iterations, the thinning
interval, and the final number of MCMC values that were sampled (Sample
size) and used to estimate the model parameters. Next, we have the model
fit index, the DIC, which can be used for comparing various models and
selecting the one that provides optimal fit. The DIC is interpreted in much
the same fashion as the AIC and BIC, which we discussed in earlier chapters,
and for which smaller values indicate better model fit. We are then provided
with the posterior mean of the distribution for each of the random effects,
school and residual, which MCMCglmm refers to as units. The mean variance
estimate for the school random effect is 0.09962, with a 95% credibility inter-
val of 0.06991 to 0.1419. Remember that we interpret credibility intervals in
Bayesian modeling in much the same way that we interpret confidence inter-
vals in frequentist modeling. This result indicates that reading achievement
scores do differ across schools, because 0 is not in the interval. Similarly,
the residual variance also differs from 0. With regard to the fixed effect of
vocabulary score, which had a mean posterior value of 0.5131, we also con-
clude that the results are statistically significant, given that 0 is not in its 95%
credibility interval. We also have a p-value for this effect, and the intercept,
both of which are significant with values lower than 0.05. The positive value
of the posterior mean indicates that students with higher vocabulary scores
also had higher reading scores.
In order to demonstrate how we can change the number of iterations, the
burn-'in period, and the rate of thinning in R, we will reestimate Model9.1
with 100,000 iterations, a burn-in of 10,000, and a thinning rate of 50. This
will yield 1,800 samples for the purposes of estimating the posterior distri-
bution for each model parameter. The R commands for fitting this model,
followed by the relevant output, appear below.
model9.1b<-MCMCglmm(geread~gevocab, random=~school,
data=prime_time.nomiss, nitt=100000, thin=50, burnin=10000)
plot(model9.1b)
summary(model9.1b)
Bayesian Multilevel Modeling 167
0.16 20
0.14 15
0.12
10
0.10
0.08 5
0.06 0
2e+04 4e+04 6e+04 8e+04 1e+05 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
Iterations N = 1800 Bandwidth = 0.004352
Trace of units Density of units
3.9 6
3.8 4
3.7 2
3.6 0
2e+04 4e+04 6e+04 8e+04 1e+05 3.6 3.7 3.8 3.9 4.0
Iterations N = 1800 Bandwidth = 0.01261
As with the initial model, all parameter estimates appear to have success-
fully converged. The results, in terms of the posterior means, are also very
similar to what we obtained using the default values for the number of itera-
tions, the burn-in period, and the thinning rate. This result is not surprising,
given that the diagnostic information for our initial model was all very posi-
tive. Nonetheless, it was useful for us to see how the default values can be
changed if we need to do so.
Iterations = 10001:99951
Thinning interval = 50
Sample size = 1800
168 Multilevel Modeling Using R
DIC: 43074.19
G-structure: ~school
R-structure: ~units
model9.2<-MCMCglmm(geread~gevocab+senroll, random=~school,
data=prime_time.nomiss)
plot(model9.2)
autocorr(model9.2$VCV)
, , school
school units
Lag 0 1.000000000 -0.05429139
Lag 10 -0.002457293 -0.07661475
Lag 50 -0.020781555 -0.01761532
Lag 100 -0.027670953 0.01655270
Lag 500 0.035838857 -0.03714127
, , units
170 Multilevel Modeling Using R
school units
Lag 0 -0.05429139 1.000000000
Lag 10 0.03284220 -0.004188523
Lag 50 0.02396060 -0.043733590
Lag 100 -0.04543941 -0.017212479
Lag 500 -0.01812893 0.067148463
autocorr(model9.2$Sol)
, , (Intercept)
, , gevocab
, , senroll
The summary results for the model with 40,000 iterations, and a thinning
rate of 100, appear below. It should be noted that the trace plots and his-
tograms of parameter estimates for Model9.2 indicated that convergence
had been attained. From these results we can see that the overall fit, based
on the DIC, is virtually identical to that of the model not including sen-
roll. In addition, the posterior mean estimate and associated 95% credible
interval for this parameter show that senroll was not statistically signifi-
cantly related to reading achievement; i.e. 0 is in the interval. Taken together,
we would conclude that school size does not contribute significantly to the
variation in reading achievement scores, nor to the overall fit of the model.
summary(model9.2)
Iterations = 3001:39901
Thinning interval = 100
Sample size = 1700
Bayesian Multilevel Modeling 171
DIC: 43074.86
G-structure: ~school
R-structure: ~units
model9.3<-MCMCglmm(geread~gevocab, random=~school+gevocab,
data=prime_time.nomiss)
plot(model9.3)
summary(model9.3)
172 Multilevel Modeling Using R
autocorr(model9.3$VCV)
, , school
, , gevocab
, , units
autocorr(model9.4$Sol)
, , (Intercept)
(Intercept) gevocab
Lag 0 1.00000000 -0.86375013
Lag 10 -0.01675809 0.01808335
Lag 50 -0.01334607 0.03583885
Lag 100 0.02850369 -0.01102134
Lag 500 0.03392102 -0.04280691
, , gevocab
(Intercept) gevocab
Lag 0 -0.863750126 1.0000000000
Lag 10 0.008428317 0.0008246964
Lag 50 0.007928161 -0.0470879801
Lag 100 -0.029552813 0.0237866610
Lag 500 -0.029554289 0.0425010354
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 42663.14
G-structure: ~school
~gevocab
R-structure: ~units
variable, we would first need to define our prior, as below. Step one in this
process is to create the covariance matrix (var) containing the prior of the
fixed effects in the model (intercept and memory). In this case, we set the
prior variances of the intercept and the coefficient for memory to 1 and 0.1,
respectively. We select a fairly small variance for the working memory coef-
ficient because we have much prior evidence in the literature regarding the
anticipated magnitude of this relationship. In addition, we will need to set
the priors for the error and random intercept terms. The inverse-Wishart
distribution variance structure is used, and here we set the value at 1 with a
certainty parameter of nu=0.002.
The model appears to have converged well, and the autocorrelations sug-
gest that the rate of thinning was appropriate.
Trace of (Intercept) Density of (Intercept)
3.8 6
5
3.7
4
3.6
3
3.5 2
3.4 1
0
4000 6000 8000 10000 3.3 3.4 3.5 3.6 3.7 3.8
Iterations N = 1000 Bandwidth = 0.01769
300
0.012 200
100
0.010 0
4000 6000 8000 10000 0.010 0.012 0.014
Iterations N = 1000 Bandwidth = 0.0002136
176 Multilevel Modeling Using R
6
0.45
4
0.35
2
0.25
0
4000 6000 8000 10000 0.2 0.3 0.4 0.5 0.6
Iterations N = 1000 Bandwidth = 0.01316
5.1 5
4
5.0
3
4.9
2
4.8 1
4.7 0
4000 6000 8000 10000 4.7 4.8 4.9 5.0 5.1 5.2
Iterations N = 1000 Bandwidth = 0.01852
autocorr(model9.4$VCV)
, , school
school units
Lag 0 1.000000000 -0.019548306
Lag 10 0.046970940 -0.001470008
Lag 50 -0.014670119 -0.051306845
Lag 100 0.020042317 0.013675599
Lag 500 -0.005250327 0.028681171
, , units
school units
Lag 0 -0.01954831 1.00000000
Lag 10 -0.01856012 0.03487098
Lag 50 0.05694637 0.01137949
Lag 100 -0.04096406 -0.03780291
Lag 500 0.02678024 0.02372686
autocorr(model9.4$Sol)
, , (Intercept)
(Intercept) npamem
Lag 0 1.00000000 -0.640686295
Lag 10 0.02871714 0.004803420
Lag 50 -0.03531602 0.011157302
Lag 100 0.01483541 -0.040542950
Bayesian Multilevel Modeling 177
, , npamem
(Intercept) npamem
Lag 0 -0.6406862955 1.000000000
Lag 10 -0.0335729209 -0.022385089
Lag 50 0.0229034652 -0.002681217
Lag 100 0.0007594231 0.008694124
Lag 500 0.0311681203 -0.015291965
The summary of the model fit results appear below. Of particular inter-
est is the coefficient for the fixed-effect working memory (npamem). The
posterior mean is 0.01266, with a credible interval ranging from 0.01221 to
0.01447, indicating that the relationship between working memory and read-
ing achievement is statistically significant. It is important to note, however,
that the estimate of this relationship for the current sample is well below
that reported in prior research, and which was incorporated into the prior
distribution. In this case, because the sample is so large, the effect of the prior
on the posterior distribution is very small. The impact of the prior would be
much greater were we working with a smaller sample.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 45908.05
G-structure: ~school
R-structure: ~units
As a point of comparison, we also fit the model using the default priors in
MCMCglmm to see what impact the informative priors had on the posterior
178 Multilevel Modeling Using R
mathfinal.nomiss<-na.omit(mathfinal)
model9.5<-MCMCglmm(score2~numsense, random=~school,
family="ordinal", data=mathfinal,)
plot(model9.5)
autocorr(model9.5$VCV)
autocorr(model9.5$Sol)
summary(model9.5)
The default prior parameters are used, and the family is defined as
ordinal. In other respects, the function call is identical to that for the
continuous outcome variables that were the focus of the earlier part of
this chapter. The output from R appears below. From the trace plots and
Bayesian Multilevel Modeling 179
histograms, we can see that convergence was achieved for each of the
model parameters, and the autocorrelations show that our rate of thin-
ning is sufficient.
–1.5 6
–1.6
4
–1.7
2
–1.8
0
4000 6000 8000 10000 12000 –1.9 –1.8 –1.7 –1.6 –1.5 –1.4
Iterations N = 1000 Bandwidth = 0.01492
0.0110 1000
0.0105 500
0
4000 6000 8000 10000 12000 0.0100 0.0105 0.0110 0.0115
Iterations N = 1000 Bandwidth = 6.007e-05
0.04 10
0.02 0
4000 6000 8000 10000 12000 0.02 0.04 0.06 0.08 0.10 0.12
Iterations N = 1000 Bandwidth = 0.003047
, , school
school units
Lag 0 1.000000000 0.24070410
Lag 10 0.016565749 0.02285168
Lag 50 0.012622856 0.02073446
Lag 100 0.007855806 0.02231629
Lag 500 0.007233911 0.01822021
180 Multilevel Modeling Using R
, , units
school units
Lag 0 0.24070410 1.00000000
Lag 10 0.02374442 0.00979023
Lag 50 0.02015865 0.00917857
Lag 100 0.01965188 0.00849276
Lag 500 0.01361470 0.00459030
, , (Intercept)
(Intercept) numsense
Lag 0 1.00000000 -0.09818969
Lag 10 0.00862290 -0.00878574
Lag 50 0.00688767 -0.00707115
Lag 100 0.00580816 -0.00603118
Lag 500 0.00300539 -0.00314349
, , numsense
(Intercept) numsense
Lag 0 -0.09818969 1.0000000
Lag 10 -0.00876214 0.00894084
Lag 50 -0.00704441 0.00723130
Lag 100 -0.00594502 0.00618679
Lag 500 -0.00315547 0.00328528
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 6333.728
G-structure: ~school
R-structure: ~units
Exactly the same command sequence that was used here would also be
used to fit a model for an ordinal variable with more than two categories.
attach(heartdata)
model9.6<-MCMCglmm(heart~trt+sex, random=~rehab,
family="poisson", data=heartdata)
plot(model9.6)
autocorr(model9.6$VCV)
autocorr(model9.6$Sol)
summary(model9.6)
, , rehab
rehab units
Bayesian Multilevel Modeling 183
, , (Intercept)
, , trt
, , sex
An examination of the trace plots and histograms shows that the param-
eter estimation converged appropriately. In addition, the autocorrelations
are sufficiently small for each of the parameters so that we can have confi-
dence in our rate of thinning. Therefore, we can move to the discussion of the
model parameter estimates, which appear below.
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
184 Multilevel Modeling Using R
DIC: 2735.293
G-structure: ~rehab
R-structure: ~units
In terms of the primary research question, the results indicate that the fre-
quency of cardiac risk signs was lower among those in the treatment condition
than those in the control, when accounting for the participants’ sex. In addi-
tion, there was a statistically significant difference in the rate of risk symptoms
between males and females. With respect to the random effects, the variance
in the outcome variable due to rehabilitation facility, as well as the residual,
were both significantly different from 0. The posterior mean effect of the rehab
facility was 0.5414, with a 95% credibility interval of 0.1022 to 1.009. This result
indicates that symptom frequency does differ among the facilities.
We may also be interested in examining a somewhat more complex explana-
tion of the impact of treatment on the rate of cardiac symptoms. For instance,
there is evidence from previous research that the number of hours the facili-
ties are open may impact the frequency of cardiac symptoms, by providing
more, or less, opportunity for patients to make use of their services. In turn, if
more participation in rehabilitation activities is associated with the frequency
of cardiac risk symptoms, we might expect the hours of operation to impact
them. In addition, it is believed that the impact of the treatment on the out-
come might vary among rehabilitation centers, leading to a random coeffi-
cients model. The R commands to fit the random coefficients (for treatment)
model, with a level-2 covariate (hours of operation) appear below, followed
by the resulting output. Note that, as we have seen in previous examples in
this chapter, in order to specify a random coefficients model, we include the
variables of interest (rehab and hours) in the random statement.
model9.7<-MCMCglmm(heart~trt+sex+hours, random=~rehab+trt,
family="poisson", data=heartdata)
Bayesian Multilevel Modeling 185
plot(model9.7)
autocorr(model9.7$VCV)
autocorr(model9.7$Sol)
summary(model9.7)
, , rehab
, , trt
, , units
186 Multilevel Modeling Using R
, , trt
, , sex
, , hours
The trace plots and histograms reveal that estimation converged for each
of the parameters estimated in the analysis, and the autocorrelations of esti-
mates are small. Thus, we can move on to interpretation of the parameter
estimates. The results of the model fitting revealed several interesting pat-
terns. First, the random coefficient term for treatment was statistically sig-
nificant, given that the credible interval ranged between 5.421 and 7.607, and
did not include 0. Thus, we can conclude that the impact of treatment on
Bayesian Multilevel Modeling 187
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 2731.221
G-structure: ~rehab
~trt
R-structure: ~units
Summary
The material presented in Chapter 9 represents a marked departure from
that presented in the first eight chapters of the book. In particular, methods
presented in the earlier chapters were built upon a foundation of maximum
likelihood estimation. Bayesian modeling, which is the focus of Chapter 9,
188 Multilevel Modeling Using R
The purpose of this chapter is to introduce a wide array of topics in the area
of multilevel analysis that do not fit neatly in any of the other chapters in
the book. We refer to these as advanced issues because they represent exten-
sions, of one kind or another, on the standard multilevel modeling frame-
work that we have discussed heretofore. In this chapter we will describe how
the estimation of multilevel model parameters can be adjusted for the pres-
ence of outliers using robust or rank-based methods. As we will see, such
approaches provide the researcher with powerful tools for handling situa-
tions when data do not conform to the distributional assumptions underly-
ing the models that we discussed in Chapters 3 and 4. We will then turn
our attention to the problem of fitting multilevel models when there are a
large number of predictor variables. Such data structure can make it diffi-
cult to obtain reasonable parameter and standard error estimates. In order to
address this issue, penalized estimation procedures can be used in order to
identify only those predictors that are the most salient for a given model and
set of data. We will then turn our attention to multivariate multilevel data
problems, in which there are multiple dependent variables. This situation
corresponds to multivariate analysis of variance and multivariate regres-
sion in the single-level modeling framework. Next, we will examine mul-
tilevel generalized additive models, which can be used in situations where
the relationship between a predictor and an outcome variable is nonlinear.
We will finish out the chapter with a discussion of predicting level-2 out-
come variables with level-1 independent variables, and with a description of
approaches for power analysis/sample size determination in the context of
multilevel modeling.
189
190 Multilevel Modeling Using R
∑ (e − e )
i ij
2
Di = i =1
(10.1)
kMSr
where
ei = residual for observation i for model containing all observations
eij = residual for observation i for model with observation j removed
k = number of independent variables
MSr = mean square of the residuals
There are no hard and fast rules for how large Di should be in order for
us to conclude that it represents an outlying observation. Fox (2016) recom-
mends that the data analyst flag observations that have Di values that are
unusual when compared to the rest of those in the dataset, and this is the
approach that we will recommend as well.
Another influence diagnostic, which is closely related to Di, is DFFITS.
This statistic compares the predicted value for individual i when the full
dataset is used ( ŷi), against the prediction for individual i when individual j
is dropped from the data ( ŷij). For individual j, DFFITS is calculated as:
yˆ i − yˆ ij
DFFITSj = (10.2)
MSEj h j
where
MSEj = mean squared error when observation j is dropped from the data
hj = leverage value for observation j
As was the case with Di, there are no hard and fast rules about how large
DFFITSj should be in order to flag observation j as an outlier. Rather, we
examine the set of DFFITSj and focus on those that are unusually large (in
absolute value) when compared to the others. One final outlier detection tool
for single-level models that we will discuss here is the COVRATIO, which
measures the impact of an observation on the precision of model estimates.
For individual i, this statistic is calculated as
k +1
1 n − k − 2 + Ei*2
COVRATIOi = (10.3)
(1 − hi ) n − k − 1
192 Multilevel Modeling Using R
where
Ei* = studentized residual
n = total sample size
Fox (2016) states that COVRATIO values greater than 1 improve the preci-
sion of the model estimates, whereas those with values lower than 1 decrease
the precision of the estimate. Clearly, it is preferable for observations to
increase model precision, rather than decrease it.
−1
N
H i ,Fixed = Xi
∑
i =1
−1
Xi′ Vi Xi Xi′ Vi−1 (10.4)
where
Xi = matrix of fixed effects for subject i
Vi = covariance matrix of the fixed effects for subject i
N = number of level-2 units
Note that in Equation (10.4), the subject refers to the level-2 grouping vari-
able. Similarly, the leverage values based on the random effects for subjects
in the sample can be expressed as
where
Zi = matrix of random effects for subject i
D = covariance matrix of the random effects for subject i.
Advanced Issues in Multilevel Modeling 193
−1
N
∑
1
2 i′(
r I − H i ,Fixed ) Vi−1Xi Xi′Vi Xi Xi′Vi−1 ( I − H i ,Fixed ) ri (10.6)
−1 −1 −1
DMi =
mSe i=1
where
m = number of fixed-effects parameters
Se2 = estimated error variance
I = identity matrix
ri = yi − yˆ i
The interpretation of DMi is similar to that for Di, in that individual values
departing from the main body of values in the sample are seen as indicative
of potential outliers.
In addition to the influence statistics at each level, and DMi, there is also
a multilevel analog for DFFITS, known as MDFFITS. As in the single-level
case, MDFFITS is a measure of the amount of change in the model-predicted
values for the sample when an individual is removed versus when they
are retained. This measure is considered an indicator of potential outliers
with respect to the fixed effects, given the random-effect structure present
in the data. Likewise, the COVRATIO statistic can also be used to identify
potential outliers with respect to their impact on the precision with which
model parameters are estimated. The same rules of thumb for interpreting
MDFFITS and the COVRATIO that were described for single-level models
also apply in the context of multilevel modeling.
With respect to the random effects, the relative variance change (RVC;
Dillane, 2005) statistic can be used to identify potential outliers. The RVC
for one of the random variance components for a subject in the sample is
calculated as
θˆ i
RVCi = − 1 (10.7)
θˆ
where
θ̂ = variance component of interest (e.g. residual, random intercept) esti-
mated using full sample
θ̂i = variance component of interest estimated excluding subject i
194 Multilevel Modeling Using R
When RVC is close to 0, the observation does not have much influence on
the variance component estimate; i.e. is not likely to be an outlier. As with the
other statistics for identifying potential outliers, there is not a single agreed-
upon cut-value for identifying outliers using RVC. Rather, observations that
have unusual such values when compared to the full sample warrant special
attention in this regard.
Histogram of Distance
16 20 24 28 32
Distance
Advanced Issues in Multilevel Modeling 195
We can see that there are some individual measurements at both ends of
the distance scale that are separated from the main body of measurements.
It’s important to remember that this graph does not organize the observa-
tions by the individual on whom the measurements were made, whereas in
the context of multilevel data we are primarily interested in examining the
data at that level. Nonetheless, this type of exploration does provide us with
some initial insights into what we might expect to find with respect to outli-
ers moving forward.
Next, we can examine a boxplot of the measurements by subject gender.
28
Sex
24
20
16
Male Female
Distance
The distribution of distance is fairly similar for males and females, though
there is one measurement for a male subject that is quite small when com-
pared with the other male distances, and indeed is small when compared
to the female measurements as well. For this sample, typical male distance
measurements are somewhat larger than are those for females. Finally, we
can examine the relationship between age and distance using a scatterplot.
28
Distance
24
20
16
8 10 12 14
Age
Here we can see that there are some relatively low-distance measurements
at ages 8, 12, and 14, and relatively large measurements at ages 10 and 12.
As we noted above, outlier detection in the multilevel modeling context is
focused on level 2, which is the child being measured. In order to calculate
the various statistics described above, we must first fit the standard multi-
level model, which we do with lmer. We are interested in the relationships
between age, sex, and the distance measure.
Below is the portion of the output showing the random- and fixed-effects
estimates.
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 3.267 1.807
Residual 2.049 1.432
Number of obs: 108, groups: Subject, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.70671 0.83392 99.35237 21.233 < 2e-16 ***
Advanced Issues in Multilevel Modeling 197
library(HLMdiag)
subject.diag.1
$`fixef_diag`
IDS COOKSD MDFFITS COVTRACE COVRATIO
1 M01 0.059430549 0.058372709 0.001283859 0.9992782
2 M02 0.017644430 0.016373372 0.153849114 1.1605606
3 M03 0.004288299 0.003930055 0.191718307 1.2028549
4 M04 0.047790333 0.045490961 0.122488826 1.1263693
5 M05 0.028992132 0.027350696 0.105918181 1.1084350
6 M06 0.012432106 0.011444600 0.183111466 1.1930342
7 M07 0.011856240 0.010908774 0.183384134 1.1935544
8 M08 0.018070149 0.017113021 0.152846918 1.1591818
9 M09 0.013029622 0.016481864 0.059973890 0.9113512
10 M10 0.129831010 0.151859927 0.276663246 0.7305381
11 M11 0.025922322 0.024178483 0.168673525 1.1770987
12 M12 0.018243029 0.017244419 0.188554582 1.1992358
13 M13 0.219383621 0.271711664 0.055608520 0.9199265
14 M14 0.002429481 0.002285124 0.214674902 1.2290509
15 M15 0.033221623 0.031746926 0.169870517 1.1781381
16 M16 0.025888407 0.024358538 0.126292525 1.1299977
17 F01 0.026282969 0.023991066 0.201711812 1.2127109
18 F02 0.003743791 0.003422279 0.255982553 1.2753945
19 F03 0.016469517 0.014842882 0.218707274 1.2321454
20 F04 0.052586476 0.048732087 0.134238794 1.1371272
198 Multilevel Modeling Using R
$varcomp_diag
IDS sigma2 D11
M01 M01 -0.001876300 -0.056743115
M02 M02 0.023443524 0.009967692
M03 M03 0.012552130 0.039204688
M04 M04 -0.002001208 0.011194962
M05 M05 0.008555067 -0.005872565
M06 M06 0.038456882 0.015265800
M07 M07 0.027199899 0.023727561
M08 M08 -0.008971702 0.033333394
M09 M09 -0.241337636 0.085721068
M10 M10 0.029978359 -0.235840135
M11 M11 0.020531028 0.020415631
M12 M12 0.009505588 0.039682505
M13 M13 -0.225614426 0.076558841
M14 M14 0.022773285 0.044510352
M15 M15 -0.001355337 0.037240673
M16 M16 0.028174805 -0.008949734
F01 F01 0.016375549 0.022906873
F02 F02 0.031952503 0.041449235
F03 F03 0.022812888 0.027577068
F04 F04 0.033338696 -0.026626936
F05 F05 0.015156927 0.045817231
F06 F06 0.025880757 0.011610962
F07 F07 0.034511615 0.041047801
F08 F08 0.007504013 0.039603874
F09 F09 0.015156926 0.013292892
F10 F10 0.023958384 -0.196924903
F11 F11 0.028951674 -0.151253263
M13
F10
F11
M10
M01
F04
M04
F09
F08
M15
F06
M05
F01
Subject
M11
M16
F05
M12
M08
M02
F03
M09
M06
M07
M03
F02
F07
M14
0.00 0.05 0.10 0.15 0.20
Cook’s Distance
Based on this graph, it appears that subjects M13, F10, F11, and M10 all have
Cook’s D values that are unusual when compared to the rest of the sample.
M13
F10
M10
F11
M01
F04
M04
F09
F08
M15
F06
M05
M16
Subject
M11
F01
F05
M12
M08
M09
M02
F03
M06
M07
M03
F02
F07
M14
0.0 0.1 0.2
MDFFITS
200 Multilevel Modeling Using R
The same four observations had unusually large MDFFITS values, fur-
ther suggesting that they may be potential outliers. Finally, with regard to
the fixed effects, we can examine the COVRATIO to see whether any of the
observations decrease model estimate precision.
F07
F02
F05
F08
F03
M14
F01
F06
M03
M12
M07
M06
F09
Subject
M15
M11
M02
M08
F04
M16
M04
M05
M01
M13
M09
F11
F10
M10
0.8 0.9 1.0 1.1 1.2 1.3
COVRATIO
Recall that observations with values less than 1 are associated with
decreases in estimate precision, meaning that for this example subjects M10,
F10, F11, M9, and M13 each had COVRATIO values less than 1, indicating
that their presence in the sample was associated with a decrease in model
precision.
Potential outliers with regard to the random effects can also be identified
using this simple graphical approach.
M06
F07
F04
F02
M10
F11
M16
M07
F06
F10
M02
F03
M14
Subject
M11
F01
F05
F09
M03
M12
M05
F08
M15
M01
M04
M08
M13
M09
–0.25 –0.20 –0.15 –0.10 –0.05 0.00 0.05
RVC for Error Variance
As with Cook’s D and MDFFITS, observations with unusual RVC values are
identified as potential outliers, meaning that M9 and M13 would be likely can-
didates with regard to the error variance. In terms of the random intercept vari-
ance component, M10, F10, and F11 were the potential outlying observations.
M09
M13
F05
M14
F02
F07
M12
F08
M03
M15
M08
F03
M07
Subject
F01
M11
M06
F09
F06
M04
M02
M05
M16
F04
M01
F11
F10
M10
–0.2 –0.1 0.0 0.1
RVC for Intercept Variance
202 Multilevel Modeling Using R
Taken together, these results indicate that there are some potential outliers
in the sample. Four individuals, M13, F10, F11, and M10 all were associated
with statistics indicating that they have outsized impacts on the fixed-effects
parameter estimates and standard errors. For the random error effect, M9
and M13 were both flagged as potential outliers, whereas for the random
intercept variance, M10, F10, and F11 signaled as potential outliers. Notice
that there is quite a bit of overlap in these results, with F10, F11, M10, and M13
showing up in multiple ways as potential outliers.
Having identified these values, we must next decide how we would like to
handle them. As we have already discussed, it is recommended that, if pos-
sible, an analysis strategy designed to appropriately model data with outliers
be used, rather than the researcher simply removing them (Staudenmayer
et al., 2009). Indeed, in the current example, if we were to remove the four
subjects who were identified as potential outliers using multiple statistics,
we would reduce our already somewhat small sample size from 27 down
to 23. Thus, modeling the data using the full sample would seem to be the
more attractive option. In the following section, we will discuss estimators
that can be used for modeling multilevel data when outliers may be present.
After discussing these from a theoretical perspective, we will then demon-
strate how to employ them using R.
where
β = vector of fixed effects
Λj = level-1 covariance matrix
Ψ = level-2 covariance matrix
Xi = design matrix for the fixed effects
Zj = design matrix for the level-2 random effects
Pinheiro et al. (2001) showed that this model can be rewritten as:
Xiβ Z j ΨZ′j + Λ j Z j Ψ
yi , U 0 j ~ t , , v (10.9)
0 ΨZ′j Ψ
where
v = degrees of freedom for the t distribution
where
Y = dependent variable
X = matrix of independent variable values
β̂ = matrix of estimates of the fixed effects for the model
Y − Xβˆ ϕ = ∑ R ( y
i =1
ij )( )
− yˆ ij yij − yˆ ij
reducing power for inference regarding these parameters. Given this combi-
nation of results, Kloke and McKean recommended that JR_SE be used as the
default method for estimating model parameter standard errors. However,
when the level-2 sample size is small, researchers should use JR_CS, unless
they know that the exchangeability assumption has been violated.
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 3.267 1.807
Residual 2.049 1.432
Number of obs: 108, groups: Subject, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.70671 0.83392 99.35237 21.233 < 2e-16 ***
age 0.66019 0.06161 80.00000 10.716 < 2e-16 ***
SexFemale -2.32102 0.76142 25.00000 -3.048 0.00538 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When we ignore the potential outliers, we would conclude that age has a
statistically significant positive relationship with distance, and that, on aver-
age, females have smaller distance measures than do males. In addition, the
variance component associated with subject is somewhat larger than that
associated with error, indicating that there is a non-trivial degree of differ-
ence in distance measurements among individuals.
Now let’s first fit this model using an approach based on ranks of the data,
rather than the raw data itself. In order to do this, we will need to install the
jrfit library from github. The commands for this installation appear below.
install.packages("devtools")
library(devtools)
install_github("kloke/jrfit")
206 Multilevel Modeling Using R
We would then use the library command to load jrfit. After doing
this, we will need to create a matrix (X) that contains the independent vari-
ables of interest, age, and sex.
library(jrfit)
X<-cbind(dental$age, dental$Sex)
We are now ready to fit the model. Recall that there are two approaches
for estimating standard errors in the context of the rank-based approach,
one based on an assumption that the covariance matrix of within-subject
errors is compound symmetric, and the other using the sandwich estimator.
We will employ both for this example. First, we will fit the model with the
compound symmetry approach to standard error estimation.
Model10.2.cs<-jrfit(X,dental$distance,dental$Subject,var.typ
e='cs')
summary(model10.1.cs)
Coefficients:
Estimate Std. Error t-value p.value
X1 20.163737 1.421843 14.1814 < 2.2e-16 ***
X2 0.625000 0.063482 9.8454 < 2.2e-16 ***
X3 -2.163741 0.812548 -2.6629 0.008979 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that in these results, X1 denotes the intercept, X2 subject age, and X3
subject sex. These results are quite similar to those obtained using the stan-
dard multilevel model assuming that the data are normally distributed. The
results for the sandwich estimator standard errors appear below.
Model10.2.sandwich<-jrfit(X,dental$distance,dental$Subject,v
ar.type='sandwich')
summary(model10.1.sandwich)
Coefficients:
Estimate Std. Error t-value p.value
X1 20.16374 1.64352 12.2686 1.493e-12 ***
X2 0.62500 0.07846 7.9658 1.461e-08 ***
X3 -2.16374 0.86223 -2.5095 0.01839 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The standard errors produced using the sandwich estimator were some-
what larger than those assuming compound symmetry, though in the final
analysis the results with respect to relationships between the independent
variables and the response were qualitatively the same. In summary, then,
the standard errors yielded by the rank-based approach were somewhat
Advanced Issues in Multilevel Modeling 207
larger than those produced by the standard model, though they did not dif-
fer substantially in this example. In addition, the results of the three models
all yielded the same overall findings.
In order to fit the heavy-tailed estimators, we will need to use the heavy
library in R. In order to fit the model based on the t distribution, the follow-
ing command is used.
library(heavy)
Model10.3.t <- heavyLme(distance ~ age + Sex, groups = ~
Subject, data = dental, family = Student())
summary(model10.1.t)
Random effects:
Formula: ~age; Groups: ~Subject
Scale matrix estimate:
(Intercept) age
(Intercept) 3.66883654
age -0.15707253 0.03418491
Within-Group scale parameter: 0.9017474
The estimated degrees of freedom for the t that yielded the best fit was
5.36305. In terms of the fixed effects, the standard errors were larger than
was the case for either the standard or rank-based models. Indeed, the stan-
dard error for sex was over twice as large for the robust t model as com-
pared to either the rank or standard model estimates. The result of this larger
standard error was a non-statistically significant test result for sex. In addi-
tion, the estimated coefficient for sex was also much smaller than the one
yielded by the other two modeling approaches. Thus, based on this result,
we would conclude that when children were older, they had larger distance
measurements, but that there were not any such differences between males
and females.
We can also fit this model using other heavy-tailed distributions, such as
the Cauchy, the slash, and a contaminated normal. The commands and asso-
ciated output for each of these models appear below.
208 Multilevel Modeling Using R
Cauchy
Model10.4.slash <- heavyLme(distance ~ age + Sex, random = ~
age, groups = ~ Subject,data = dental, family = slash())
summary(model10.1.slash)
Random effects:
Formula: ~age; Groups: ~Subject
Scale matrix estimate:
(Intercept) age
(Intercept) 2.97619934
age -0.10957288 0.02658286
Within-Group scale parameter: 0.6567805
Slash
Model10.5.slash <- heavyLme(distance ~ age + Sex, random = ~
age, groups = ~ Subject,data = dental, family = slash())
summary(model10.1.slash)
Random effects:
Formula: ~age; Groups: ~Subject
Scale matrix estimate:
(Intercept) age
(Intercept) 2.30753577
age -0.08770319 0.02131623
Within-Group scale parameter: 0.5798421
Contaminated
model10.1.contaminated <- heavyLme(distance ~ age + Sex,
random = ~ age, groups = ~ Subject,data = dental, family =
contaminated())
summary(model10.1.contaminated)
Random effects:
Formula: ~age; Groups: ~Subject
Scale matrix estimate:
(Intercept) age
(Intercept) 3.53020881
age -0.10475888 0.03117002
Within-Group scale parameter: 1.003603
Though the estimates differ slightly across the estimation methods, they all
yield the same basic result, which is that age is positively related to distance,
and there are no differences between males and females.
Given these results, the reader will correctly ask the question, which of these
methods should I use? There has not been a great deal of research comparing
these various methods with one another. However, one study (Finch, 2017)
did compare the rank-based, heavy-tailed, and standard multilevel models
with one another, in the presence of outliers. The results of this simulation
work showed that the rank-based approaches yielded the least biased param-
eter estimates, and had smaller standard errors than did the heavy-tailed
approaches. Certainly, those standard error results are echoed in the current
example, where the standard errors for the heavy-tailed approaches were
more than twice the size of those from the rank-based method. Given these
simulation results, it would seem that the rank-based results may be the best
to use when outliers are present, at least until further empirical work demon-
strates otherwise.
210 Multilevel Modeling Using R
Multilevel Lasso
In some research contexts, the number of variables that can be measured
(p) approaches, or even exceeds, the number of individuals on whom such
measurements can be made (N). For example, researchers working with gene
assays may have thousands of measurements that were made on a sample
of only 10 or 20 individuals. The consequence of having such small samples
coupled with a large number of measurements is known as high-dimensional
data. In such cases, standard statistical models often do not work well, yield-
ing biased standard errors for the model parameter estimates (Bühlmann
and van de Geer, 2011). These biased standard errors in turn can lead to inac-
curate Type I error and power rates for inferences made about these parame-
ters. High dimensionality can also result in parameter estimation bias due to
the presence of collinearity (Fox, 2016). Finally, when p exceeds N, it may not
be possible to obtain parameter estimates at all using standard estimators.
In the context of standard single-level data structures, statisticians have
worked to develop estimation methods that can be used with high-dimen-
sional data, with one of the more widely used such approaches being reg-
ularization or shrinkage methods. Regularization methods involve the
application of a penalty to the standard estimator such that the coefficients
linking the independent variables to the dependent variables are made
smaller, or shrunken. The goal of this technique is that only those vari-
ables that are most strongly related to the dependent variable are retained
in the model, whereas the others are eliminated by having their coefficients
reduced (shrunken) to 0. This approach should eliminate from the model
independent variables that exhibit weak relationships to the dependent vari-
able, thereby rendering a reduced model.
One of the most popular regularization approaches is the least absolute
shrinkage and selection operator (lasso; Tibshirani, 1996), which can be
expressed as
∑ ∑
N p
βˆ j . (10.11)
( yi − yˆ i )
2
e2 = +λ
i =1 j =1
where
yi = the observed value of the dependent variable for individual i
ŷi = the model-predicted value of the dependent variable for individual i
β̂ j = sample estimate of the coefficient for independent variable j
λ = shrinkage penalty-tuning parameter
In summary, the goal of the lasso estimator is to eliminate from the model
those independent variables that contribute very little to the explanation of
the dependent variable, by setting their β̂ values to 0, while at the same time
retaining independent variables that are important in explaining y. The opti-
mal λ value is specific to each data analysis problem. A number of approaches
for identifying it have been recommended, including the use of cross-valida-
tion to minimize the mean squared error (Tibshirani, 1996), or selection of λ
that minimizes the Bayesian information criterion (BIC). This latter approach
was recommended by Schelldorfer, Bühlmann, and van de Geer (2011), who
showed that it works well in many cases. Zhao and Yu (2006) also found the
use of the BIC for this purpose to be quite effective. With this approach, sev-
eral values of λ are used, and the BIC values for the models are compared. The
model with the smallest BIC is then selected as being optimal.
Schelldorfer et al. (2011) described an extension of the lasso estimator that
can be applied to multilevel models. The multilevel lasso (MLL) utilizes the
lasso penalty function, with additional terms to account for the variance
components associated with multilevel models. The MLL estimator mini-
mizes the following function:
( ) ∑ βˆ (10.12)
1 1 V −1 ( yi − yˆ i )
Qλ β,τ ,σ : = ln V + ( yi − yˆ i )′
2 2
+λ j
2 2 j =1
where
τ2 = between cluster variance at level 2
σ2 = within cluster variance at level 1
V = covariance matrix
From Equation (10.12), we can see that model parameter estimates are
obtained with respect to penalization of level-1 coefficients, and otherwise
work similarly to the single-level lasso estimator. In order to conduct infer-
ence for the MLL model parameters, standard errors must be estimated.
However, the MLL algorithm currently does not provide standard error esti-
mates, meaning that inference is not possible. Therefore, interpretation of
results from analyses using MLL will focus on which coefficients are not
shrunken to 0, as we will see in the example below.
library(lmmlasso)
data(classroomStudy)
Model10.6<-lmer(classroomStudy$y~classroomStudy$X+(1|classroom
Study$grp), data=classroomStudy)
summary(Model10.6)
The syntax that we use here generally matches the general structure dem-
onstrated in Chapter 3. Notice that the major difference from the examples
in the earlier chapter is that the independent fixed-effects variables are col-
lected in the matrix X. The resulting output for the fixed and random effects
appears below.
Random effects:
Groups Name Variance Std.Dev.
classroomStudy$grp (Intercept) 0.1583 0.3979
Residual 0.4745 0.6888
Number of obs: 156, groups: classroomStudy$grp, 44
Fixed effects:
Estimate Std. Error df t
value Pr(>|t|)
(Intercept) 0.02599 0.08611 26.90650
0.302 0.765
classroomStudy$Xsex -0.07690 0.06046 141.64555
-1.272 0.206
classroomStudy$Xminority -0.02015 0.06803 134.13265
-0.296 0.768
classroomStudy$Xmathkind -0.63959 0.06356 148.47809
-10.063 <2e-16 ***
classroomStudy$Xses 0.05329 0.06119 143.19103
0.871 0.385
classroomStudy$Xyearstea -0.01861 0.09656 28.41109
-0.193 0.849
classroomStudy$Xmathprep -0.06041 0.10045 24.26140
-0.601 0.553
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From these results, we would conclude that the only statistically signifi-
cant effect was kind of math instruction, with a coefficient of −0.64.
Now, let’s fit the same model using the multilevel lasso. In order to do this,
we will need to install and then load the lmmlasso package in R, which we
Advanced Issues in Multilevel Modeling 213
have done above. The R commands to fit the model, and to obtain a sum-
mary of the output, appear below.
Model10.7 <-lmmlasso(x=classroomStudy$X,y=classroomStudy$y,z
=classroomStudy$Z,grp=classroomStudy$grp,lambda=15,pdMat="pd
Ident")
summary(Model10.7)
There are several things to note when using lmmlasso. First, the fixed-effects
variables are collected in the matrix X, which we have already commented on
when discussing the use of lmer with the classroomStudy data. When using
lmmlasso, the fixed effects will always need to be in such a matrix. Second, the
object classroomStudy$Z is a column of 1s in this case, indicating that we have
only a random slope. If we were also fitting random slopes for one or more inde-
pendent variables, then those variables would need to each have a column in the
Z matrix. The level-2 effect itself, grp, appears after the grp subcommand. We
must set the value of lambda, which in this case is 15. We will try other values,
and compare model fit using AIC and BIC, selecting the lambda that minimizes
these values. Finally, the pdMat subcommand defines the covariance structure
for the random effect, in this case setting it to be the identity matrix.
The output for Model10.7 appears below.
Model fitted by ML for lambda = 15 :
AIC BIC logLik deviance objective
363.9 379.1 -176.9 353.9 186.0
Fixed effects:
|active set|= 3
Estimate
(Intercept) 0.02065969 (n)
sex -0.02376197
mathkind -0.57887176
Number of iterations: 4
Using a lambda of 15, only two fixed effects had non-zero coefficients, sex
and mathkind. Also, notice that the coefficients for these effects are smaller
than were those estimated using the standard multilevel model, reflecting the
shrinkage associated with the lasso. Let’s now fit a model with a lambda of 10.
Model10.8 <-lmmlasso(x=classroomStudy$X,y=classroomStudy$y,z
=classroomStudy$Z,grp=classroomStudy$grp,lambda=10,pdMat="pd
Ident")
summary(Model10.8)
214 Multilevel Modeling Using R
Fixed effects:
|active set|= 4
Estimate
(Intercept) 0.02039525 (n)
sex -0.04084089
mathkind -0.59798409
ses 0.00546273
Number of iterations: 5
Model10.9 <-lmmlasso(x=classroomStudy$X,y=classroomStudy$y,z
=classroomStudy$Z,grp=classroomStudy$grp,lambda=20,pdMat="pd
Ident")
summary(Model10.9)
Fixed effects:
|active set|= 3
Estimate
(Intercept) 0.021169801 (n)
sex -0.006581567
mathkind -0.560115122
Number of iterations: 5
Advanced Issues in Multilevel Modeling 215
where
yicp = value on dependent variable p for individual i in cluster c
γ0p = intercept for variable p
x1ic = value on independent variable 1 for individual i in cluster c
γ1p = coefficient linking independent variable 1 to dependent variable p
Upc = random error component for cluster c on dependent variable p
Ricp = random error for dependent variable p for individual i in cluster c
Given that there are p dependent variables in model (10.13), the random-
effect variances in a univariate multilevel model (i.e. variances of Uc and Ric)
become the covariance matrices for these model terms:
Σ = cov(Ric ) (10.14)
T = cov( U c )
m k m k m
yicp = ∑γ
s=1
d
0 sp sicp + ∑∑γ
p=1 s=1
x d
1sp 1sic sicp + ∑∑U
p=1 s=1
d
spc sicp
(10.15)
k m
+ ∑∑R
p=1 s=1
d
sicp sicp
This model yields hypothesis testing results for each of the response
variables, accounting for the presence of the others in the data. In order
to test the multivariate null hypothesis of group mean equality across all
of the response variables, we can fit a null multivariate model to the data
for which the independent variables are not included. The fit of this null
model can then be compared with that of the full model including the
independent variable(s) of interest to test the null hypothesis of no mul-
tivariate group mean differences, if the independent variable is categori-
cal. This comparison can be carried out using a likelihood ratio test. If the
Advanced Issues in Multilevel Modeling 217
resulting p-value is below the threshold (e.g. α = 0.05) then we would reject
the null hypothesis of no group mean differences, because the fit of the full
model including the group effect was better than that of the null model.
The reader interested in applying this model using R is encouraged to read
the excellent discussion and example provided in Chapter 16 of Snijders
and Bosker (2012).
y = β0 + β1x + β 2 x 2 + β 3 x 3 (10.16)
where
y = dependent variable
x = independent variable
βj = coefficient for model term j
The cubic spline then fits different versions of this model between each
pair of adjacent knots. The more knots in the GAM, the more piecewise poly-
nomials will be estimated, and the more potential detail about the relation-
ship between x and y will be revealed. GAMs take these splines and apply
them to a set of one or more predictor variables as in (10.17).
yi = β0 + Σf j ( xi ) + ε i (10.17)
where
yi = value of outcome variable for subject i
β0 = intercept
fj = smoothing spline for independent variable j, such as a cubic spline
xi = value of independent variable for observation i
εi = random error, distributed N(0, σ2)
Each independent variable has a unique smoothing function, and the opti-
mal set of smoothing functions is found by minimizing the penalized sum
of squares criterion (PSS) in (10.18).
218 Multilevel Modeling Using R
∑ {y − β + ∑ f (x )} + ∑λ ∫ f ′′(t ) dt (10.18)
N p
2
2
PSS = i 0 j i j j j j
i =1 j =1
Here, yi is the value of the response variable for subject i, and λj is a tuning
parameter for variable j such that λj ≥ 0. The researcher can use λj to control
the degree of smoothing that is applied to the model. A value of 0 results
in an unpenalized function and relatively less smoothing, whereas values
approaching ∞ result in an extremely smoothed (i.e. linear) function relating
the outcome and the predictors. The GAM algorithm works in an iterative
fashion, beginning with the setting of β0 to the mean of Y. Subsequently, a
smoothing function is applied to each of the independent variables in turn,
minimizing the PSS. The iterative process continues until the smoothing
functions for the various predictor variables stabilize, at which point final
model parameter estimates are obtained. Based upon empirical research, a
recommended value for λj is 1.4 (Wood, 2006), and as such will be used in
this study.
GAM rests on the assumption that the model errors are independent of
one another. However, when data are sampled from a clustered design, such
as measurements made longitudinally for the same individual, this assump-
tion is unlikely to hold, resulting in estimation problems, particularly with
respect to the standard error (Wang, 1998). In such situations, an alterna-
tive approach to modeling the data, which accounts for the clustering of
data points, is necessary. The generalized additive mixed model (GAMM)
accounts for the presence of clustering in the form of random effects (Wang,
1998). GAMM takes the form:
yi = β0 + Σf j ( xi ) + Zib + ε i (10.19)
where
Zi = random effects in the model; e.g. person being measured
b = random-effects coefficients
Other model terms in (4) are the same as in (2). GAMM is fit in the same
fashion as GAM, with an effort to minimize PSS, and the use of λj to control
the degree of smoothing.
reading score (geread) is the dependent variable, and the verbal score
(npaverb) is the predictor. In this case, however, we would use splines to
investigate whether there exists a nonlinear relationship between the two
variables. The R code for fitting this model appears below.
Model10.8<-gamm4(geread~s(npaverb), family=gaussian,
random=~(1|school), data=prime_time)
summary(Model10.8$mer)
Scaled residuals:
Min 1Q Median 3Q Max
-2.3364 -0.6007 -0.1976 0.3110 4.7359
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 0.1032 0.3213
Xr s(npaverb) 2.0885 1.4452
Residual 3.8646 1.9659
Number of obs: 10765, groups: school, 163; Xr, 8
Fixed effects:
Estimate Std. Error t value
X(Intercept) 4.34067 0.03215 135.00
Xs(npaverb)Fx1 2.30471 0.25779 8.94
Here we see estimates for variance associated with the school random
effect, along with the residual. The smoother also has a random component,
though it is not the same as a random slope for the linear portion of the model,
which we will see below. The fixed effects in this portion of the output cor-
respond to those that we would see in a linear model, so that X(Intercept)
is a standard intercept term, and Xs(npaverb)Fx1 is the estimate of a lin-
ear relationship between npaverb and geread. Here we see that the t-value
for this linear effect is 8.94, suggesting that there may be a significant linear
relationship between the two variables.
The nonlinear smoothed portion of the relationship can be obtained using
the following command.
summary(Model10.8$gam)
Family: gaussian
Link function: identity
Formula:
geread ~ s(npaverb)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.34067 0.03215 135 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.28
lmer.REML = 45286 Scale est. = 3.8646 n = 10765
When reading these results, note that the values for the intercept fixed
effect and its standard error are identical to the results from the mer output.
Of more interest when using GAM is the nature of the smoothed nonlinear
term, which appears as s(npaverb). We can see that this term is statistically
significant, with a p-value well below 0.05, meaning that there is a nonlinear
relationship between npaverb and geread. Also note that the adjusted R2
is 0.28, meaning that approximately 28% of the variance in the reading score
is associated with the model. In order to characterize the nature of the rela-
tionship between the reading and verbal scores, we can examine a plot of the
GAM function.
plot(Model10.8$gam)
Advanced Issues in Multilevel Modeling 221
1
s(npaverb,6.03)
–1
–2
–3
0 20 40 60 80 100
npaverb
The plot includes the estimated smoothed relationship between the two
variables, represented by the solid line, and the 95% confidence interval of
the curve, which appears as the dashed lines. The relationship between the
two variables is positive, such that higher verbal scores are associated with
higher reading scores. However, this relationship is not strictly linear, as we
see a stronger relationship between the two variables for those with rela-
tively low verbal scores, as well as those for relatively higher verbal scores,
when compared with individuals whose verbal scores lie in the midrange.
Just as with linear multilevel models, it is also possible to include random
coefficients for the independent variables. The syntax for doing so, along
with the resulting output, appears below.
summary(Model10.9$mer)
Linear mixed model fit by REML ['lmerMod']
Scaled residuals:
Min 1Q Median 3Q Max
-2.8114 -0.5995 -0.2001 0.3159 4.8829
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 2.414e-02 0.1554
npaverb 8.101e-05 0.0090 -1.00
222 Multilevel Modeling Using R
Fixed effects:
Estimate Std. Error t value
X(Intercept) 4.32243 0.03308 130.671
Xs(npaverb)Fx1 2.18184 0.24110 9.049
The variance for the random effect of npaverb is smaller than are the vari-
ance terms for the other model terms, indicating its relative lack of import in
this case. We would interpret this result as indicating that the linear portion
of the relationship between npaverb and geread is relatively similar across
the schools.
summary(Model10.9$gam)
Family: gaussian
Link function: identity
Formula:
geread ~ s(npaverb)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.32243 0.03308 130.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.28
lmer.REML = 45167 Scale est. = 3.8052 n = 10765
The smoothed component of the model is the same for the random coeffi-
cients model as it was for the random intercept-only model. This latter result
is further demonstrated in the plot below. It appears to be very similar to that
for the non-random coefficient model.
plot(Model10.9$gam)
Advanced Issues in Multilevel Modeling 223
1
s(npaverb,5.74)
–1
–2
–3
0 20 40 60 80 100
npaverb
where
xig = value of independent variable for individual i in group g
ξg = group-level score on the independent variable
υig = error term for individual i in group g
The relationship between the level-2 dependent variable and the latent
level-2 predictor variable can then be written as
y g = β0 + β1ξ g + g (10.21)
Advanced Issues in Multilevel Modeling 225
where
yg = value of the dependent variable for group g
g = error term associated with group g
β1 = coefficient linking the latent predictor with the level-2 outcome
β0 = intercept
In the estimation of this model, the best linear unbiased predictors (BLUPs)
of the group means must be used in the regression model, rather than the
standard unadjusted means, in order for the regression coefficient estimates
to be unbiased. Croon and van Veldhoven provide a detailed description of
how this model can be fit as a structural equation model. We will not delve
into these details here, but do refer the interested reader to this article. Of key
import here is that we see the possibility of linking these individual-level
variables with a group-level outcome variable, using the model described
above.
The micro-macro model can be fit in R using the MicroMacroMultilevel
library. The following R code demonstrates first how the variables are appro-
priately structured for use with the library, followed by how the model is fit
and subsequently summarized.
micromacro.summary(model.output)
Call:
micromacro.lm( y.prime_time ~ BLUP.prime_time.nomiss.gemath +
BLUP.prime_time.nomiss.gelang + z.mean, ...)
Residuals:
Min 1Q Median 3Q Max
-13.25487 -2.442717 -0.0003431752 2.850811 12.2095
Coefficients:
Estimate Uncorrected S.E.
Corrected S.E. df t
(Intercept) -190.52742279 30.98161003
28.24900466 156 -6.7445712
BLUP.prime_time.nomiss.gemath -0.01019005 0.02838026
0.02303021 156 -0.4424645
BLUP.prime_time.nomiss.gelang 0.01604510 0.02405509
0.01940692 156 0.8267722
z.mean 3.04344765 0.32214229
0.29422655 156 10.3438919
Advanced Issues in Multilevel Modeling 227
Pr(>|t|) r
(Intercept) 2.824611e-10 0.47514746
BLUP.prime_time.nomiss.gemath 6.587660e-01 0.03540331
BLUP.prime_time.nomiss.gelang 4.096290e-01 0.06605020
z.mean 2.012448e-19 0.63783643
---
Residual standard error: 4.29361 on 156 degrees of freedom
Multiple R-squared: 0.3649649621, Adjusted R-squared:
0.3527527498
F-statistic: 29.88525 on 3 and 156 DF, p-value: 0
From these results, we can see that neither level-1 independent variable
was statistically significantly related to the dependent variable, whereas the
level-2 predictor, average daily attendance, was positively associated with
the school mean cognitive skills index score. The total variance in the cogni-
tive index that was explained by the model was 0.353, or 35.3%.
library(simr)
x1 <- 1:20
cluster <- letters[1:25]
The preceding code specifies the level-1 (x1) and level-2 (cluster) sample
sizes, and places them in the object sim _ data using the expand.grid
function, which is part of base R. We then specify the fixed effects for the
intercept (1) and the slope (0.5) for our particular problem. The variance of
the random intercept term is then set at 0.25 for this example, and the covari-
ance matrix of the random effects is specified in the object V2. The standard
deviation for the error term is 1. Finally, the makeLmer command is used to
generate our model, which appears below. This output is useful for checking
that everything we think we specified is actually specified correctly, which
is the case here.
Advanced Issues in Multilevel Modeling 229
In order to generate the simulated data and obtain an estimate of the power
for detecting the coefficient of 0.5 given the sample size parameters we have
put in place, we will use the powerSim command from the simr library. We
must specify the model and the number of simulations that we would like to
use. In this case, we request 100 simulations. The results of these simulations
appear below.
powerSim(sim_model1, nsim=100)
Time elapsed: 0 h 0 m 20 s
These results indicate that for our sample of 500 individuals nested within
25 clusters, the power of rejecting the null hypothesis of no relationship
between the independent and dependent variables when the population coef-
ficient value is 0.5, is 100%. In other words, we are almost certain to obtain a
statistically significant result in this case with the specified sample sizes.
How would these power results change if our sample size was only 100
(10 individuals nested within each of 10 clusters)? The following R code will
help us to answer this question.
x1 <- 1:10
cluster <- letters[1:10]
powerSim(sim_model1, nsim=100)
Time elapsed: 0 h 0 m 13 s
Clearly, even for a sample size of only 100, the power remains at 100% for
the slope fixed effect.
Rather than simulate the data for one sample size at a time, it is also pos-
sible using simr to simulate the data for multiple sample sizes, and then plot
the results in a power curve. The following R code will take the simulated
model described above (sim _ model1), and do just this (Figure 10.1).
FIGURE 10.1
Power curve using simr function.
Advanced Issues in Multilevel Modeling 231
This curve demonstrates that, given the parameters we have specified for
our model, the power for identifying the coefficient of interest as being dif-
ferent from 0 will exceed 0.8 when the number of individuals per cluster is
4 or more. There are a number of settings for these functions that the user
can adjust in order to tailor the resulting plot to their particular needs, and
we encourage the interested reader to investigate those in the software doc-
umentation. We do hope that this introduction has provided you with the
tools necessary to further investigate these additional functions, however.
Summary
Our focus in Chapter 10 was on a variety of extensions to multilevel model-
ing. We first described models for use with unusual or difficult data situa-
tions, particularly when the normality of errors cannot be assumed, such as
in the presence of outliers. We saw that there are several options available
to the researcher in such instances, including estimators based on heavy-
tailed distributions, as well as rank-based estimators. In practice, we might
use multiple such options and compare their results in order to develop a
sense as to the nature of the model parameter estimates. We then turned
our attention to models designed for high-dimensional situations, in which
the number of independent variables approaches (or even surpasses) the
number of observations in the data set. Standard estimation algorithms will
frequently yield biased parameter estimates and inaccurate standard errors
in such cases. Penalized estimators such as the lasso can be used to reduce
the dimensionality of the data statistically, rather than through an arbitrary
selection by a researcher of predictors to retain. In addition to distributional
issues, and high dimensionality, we also learned about models that are
appropriate for situations in which the relationships between the indepen-
dent and dependent variables are not linear in nature. There are a number
of possible solutions for such a scenario, with our focus being on a spline-
based solution in the form of the GAM. This modeling strategy provides the
data analyst with a set of tools for selecting the optimal solution for a given
dataset, and for characterizing the nonlinearity present in the data both with
coefficient estimates, and graphically. We concluded the chapter with a dis-
cussion of the micro-macro modeling problem, in which level-1 variables are
to serve as predictors of level-2 outcome variables, and with a review of how
simulation can be used to conduct a power analysis in the multilevel model-
ing context. Both of these problems can be addressed using libraries in R, as
we have demonstrated in this chapter.
References
Agresti, A. (2002). Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons,
Publications.
Aiken, L.S. & West, S.G. (1991). Multiple Regression: Testing and Interpreting Interactions.
Thousand Oaks, CA: Sage.
Anscombe, F.J. (1973). Graphs in Statistical Analysis. American Statistician, 27(1), 17–21.
Bickel, R. (2007). Multilevel Analysis for Applied Research: It’s Just Regression! New York:
Guilford Press.
Breslow, N. & Clayton, D.G. (1993). Approximate Inference in Generalized Linear
Mixed Models. Journal of the American Statistical Association, 88, 9–25.
Bryk, A.S. & Raudenbush, S.W. (2002). Hierarchical Linear Models. Newbury Park, CA:
Sage.
Bühlmann, P. & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods,
Theory and Applications. Berlin, Germany: Springer-Verlag.
Crawley, M.J. (2013). The R Book. West Sussex, UK: John Wiley & Sons, Ltd.
Croon, M.A. & van Veldhoven, M.J. (2007). Predicting Group-Level Outcome
Variables from Variables Measured at the Individual Level: A Latent Variable
Multilevel Model. Psychological Methods, 12(1), 45–57.
de Leeuw, J. & Meijer, E. (2008). Handbook of Multilevel Analysis. New York: Springer.
Dillane, D. (2005) Deletion Diagnostics for the Linear Mixed Model. Ph.D. Thesis,
Trinity College, Dublin.
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Los Angeles: Sage.
Finch, W.H. (2017). Multilevel Modeling in the Presence of Outliers: A Comparison of
Robust Estimation Methods. Psicologica, 38, 57–92.
Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models. Thousand
Oaks, CA: Sage.
Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. London, UK, Chapman
and Hall.
Hofmann, D.A. (2007). Issues in Multilevel Research: Theory Development,
Measurement, and Analysis. In S.G. Rogelberg (Ed.). Handbook of Research
Methods in Industrial and Organizational Psychology, pp. 247–274. Malden, MA:
Blackwell.
Hogg, R.V. & Tanis, E.A. (1996). Probability and Statistical Inference. New York: Prentice
Hall.
Hox, J. (2002). Multilevel Analysis: Techniques and Applications. Mahwah, NJ: Erlbaum.
Iversen, G. (1991). Contextual Analysis. Newbury Park, CA: Sage.
Jaekel, L.A. (1972). Estimating Regression Coefficients by Minimizing the Dispersion
of Residuals. Annals of Mathematical Statistics, 43, 1449–1458.
Kloke, J. & McKean, J.W. (2013). Small Sample Properties of JR Estimators. Paper pre-
sented at the annual meeting of the American Statistical Association, Montreal,
QC, August.
Kloke, J.D., McKean, J.W., & Rashid, M. (2009). Rank-Based Estimation and Associated
Inferences for Linear Models with Cluster Correlated Errors. Journal of the
American Statistical Association, 104, 384–390.
233
234 References
Kreft, I.G.G. & de Leeuw, J. (1998). Introducing Multilevel Modeling. Thousand Oaks,
CA: Sage.
Kreft, I.G.G., de Leeuw, J., & Aiken, L. (1995). The Effect of Different Forms of
Centering in Hierarchical Linear Models. Multivariate Behavioral Research, 30,
1–22.
Kruschke, J.K. (2011). Doing Bayesian Data Analysis. Amsterdam, Netherlands:
Elsevier.
Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust Statistical Modeling Using
the t Distribution. Journal of the American Statistical Association, 84, 881–896.
Liu, Q. & Pierce, D.A. A Note on Gauss-Hermite Quadrature (1994). Biometrika, 81,
624–629.
Lynch, S.M. (2010). Introduction to Applied Bayesian Statistics and Estimation for Social
Scientists. New York: Springer.
Pinheiro, J., Liu, C., & Wu, Y.N. (2001). Efficient Algorithms for Robust Estimation in
Linear Mixed-Effects Models Using the Multivariate t Distribution. Journal of
Computational and Graphical Statistics, 10, 249–276.
Potthoff, R.F. & Roy, S.N. A Generalized Multivariate Analysis of Variance Model
Useful Especially for Growth Curve Problems. Biometrika, 51(3–4), 313–326.
R Development Core Team (2012). R: A Language and Environment for Statistical
Computing. Vienna, Austria: R Foundation for Statistical Computing.
Rogers, W.H. & Tukey, J.W. (1972). Understanding Some Long-Tailed Symmetrical
Distributions. Statistica Neerlandica, 26(3), 211–226.
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. New York: Springer.
Schall, R. (1991). Estimation in Generalized Linear Models with Random Effects.
Biometrika, 78, 719–727.
Schelldorfer, J., Bühlmann, P., & van de Geer, S. (2011). Estimation for High-
Dimensional Linear Mixed-Effects Models Using l1-Penalization. Scandinavian
Journal of Statistics, 38, 197–214.
Snijders, T. & Bosker, R. (1999). Multilevel Analysis: An Introduction to Basic and
Advanced Multilevel Modeling, 1st edition. Thousand Oaks, CA: Sage.
Snijders, T. & Bosker, R. (2012). Multilevel Analysis: An Introduction to Basic and Advanced
Multilevel Modeling, 2nd edition. Thousand Oaks, CA: Sage.
Song, P.X.-K., Zhang, P., & Qu, A. (2007). Maximum Likelihood Inference in Robust
Linear Mixed-Effects Models Using Multivariate t Distributions. Statistica
Sinica, 17, 929–943.
Staudenmayer, J., Lake, E.E., & Wand, M.P. (2009). Robustness for General Design
Mixed Models Using the t-Distribution. Statistical Modeling, 9, 235–255.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the
Royal Statistical Society, Series B, 58, 267–288.
Tong, X. & Zhang, Z. (2012). Diagnostics of Robust Growth Curve Modeling Using
Student’s t Distribution. Multivariate Behavioral Research, 47(4), 493–518.
Tu, Y.-K., Gunnell, D., & Gilthorpe, M.S. (2008). Simpson’s Paradox, Lord’s Paradox,
and Suppression Effects are the Same Phenomenon – The Reversal Paradox.
Emerging Themes in Epidemiology, 5(2), 1–9.
Tukey, J.W. (1949). Comparing Individual Means in the Analysis of Variance.
Biometrics, 5(2), 99–114.
Wang, Y. (1998). Mixed Effects Smoothing Spline Analysis of Variance. Journal of the
Royal Statistical Society B, 60(1), 159–174.
References 235
Wang, J. & Genton, M.G. (2006). The Multivariate Skew-Slash Distribution. Journal of
Statistical Planning and Inference, 136, 209–220.
Welsh, A.H. & Richardson, A.M. (1997). Approaches to the Robust Estimation of
Mixed Models. In G. Maddala and C.R. Rao (Eds.). Handbook of Statistics, vol. 15,
pp. 343–384. Amsterdam: Elsevier Science B.V.
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin,
1(6), 80–83.
Wolfinger, R. & O’Connell, M. (1993). Generalized Linear Mixed Models: A Pseudo-
Likelihood Approach. Journal of Statistical Computation and Simulation, 48,
233–243.
Wood, S.N. (2006). Generalized Additive Models: An Introduction with R. New York:
Chapman and Hall/CRC.
Wooldridge, J. (2004). Fixed Effects and Related Estimators for Correlated Random
Coefficient and Treatment Effect Panel Data Models. East Lansing: Department of
Economics, Michigan State University.
Yuan, K.-H. & Bentler, P.M. (1998). Structural Equation Modeling with Robust
Covariances. Sociological Methodology, 28, 363–396.
Yuan, K.-H., Bentler, P.M., & Chan, W. (2004). Structural Equation Modeling with
Heavy Tailed Distributions. Psychometrika, 69(3), 421–436.
Zhao, P. & Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of Machine
Learning Research, 7, 2541–2563.
Index
237
238 Index
lm() function, 11, 14–16, 15, 16, 43, MLE, see Maximum likelihood
45, 48, 104 estimation (MLE)
lmmlasso R library, 211, 212, 213 MLL, see Multilevel lasso (MLL)
Logistic regression, 149 MLM, see Multilevel linear
model, 116, 117 models (MLM)
for dichotomous outcome Model comparison statistics, 57
variable, 116–120 Model fit, 12–13, 57–58, 119–120, 123,
for ordinal outcome variable, 129, 136–137, 139, 158, 166,
120–123 177–178, 215, 219
random coefficients, 137–139 Multilevel cumulative logits model, 143
random intercept, 134–137, 143–146 Multilevel data structure, 23–41
Longitudinal data analysis and intraclass correlation, 24–28
multilevel models, 75–83 multilevel linear models (MLM),
benefits, 82–83 29–40
framework, 75–76 assumptions, 37
person period data structure, 77–78 centering, 34–35
using lme4 package, 78–82 longitudinal designs and
relationship, 40
MANOVA, see Multivariate analysis of parameter estimation with, 35–37
variance (MANOVA) random intercept, 30–31
Markov Chain Monte Carlo (MCMC), random slopes, 31–34
159–161, 166, 188 three-level, 39–40
Markov Chain Monte Carlo two-level, 37–38
(MCMC)glmm nested data and cluster sampling
for count dependent variable, 181–187 designs, 23–24
for dichotomous dependent pitfalls of ignoring, 29
variable, 178–181 Multilevel generalized additive
including level-2 predictors models, 217–218
with, 168–174 Multilevel generalized linear models
for normally distributed (MGLMs), 133–158
response variable, 162–168 for count data, 146–158
MASS library, 117, 130 additional level-2 effects
Mathematics achievement test, 133, 134, inclusion to multilevel Poisson
138, 139, 143–144, 178, 180 regression model, 150–158
Maximum likelihood estimation (MLE), random coefficient Poisson
35, 36, 57 regression, 148–150
Maximum likelihood (ML), 30, 34, 60, random intercept Poisson
143, 159, 202 regression, 147–148
MCMC, see Markov Chain Monte for dichotomous outcome
Carlo (MCMC) variable, 133–143
MCMCglmm, see Markov Chain Monte additional level-1 and level-2
Carlo (MCMC)glmm effects, 139–143
MDFFITS measure, 192, 193, 198, 200, 201 random coefficients logistic
Mean squared error, 211 regression, 137–139
MGLMs, see Multilevel generalized random intercept logistic
linear models (MGLMs) regression, 134–137
Micro-macro model, 224–225, 231 for ordinal outcome variable, 143–146
MicroMacroMultilevel library, 225–226 random intercept logistic
ML, see Maximum likelihood (ML) regression, 143–146
240 Index