Exercise Solutions for Experimental Design
Exercise Solutions for Experimental Design
For
Scott E. Maxwell
University of Notre Dame
Harold D. Delaney
University of New Mexico
Ken Kelley
University of Notre Dame
Contents
IV MIXED-EFFECTS MODELS
15 An Introduction to Mixed-Effects Models:
Within-Subjects Designs 119
16 An Introduction to Mixed-Effects Models: Nested Designs 122
I
Conceptual Bases
of Experimental Design
and Analysis
1
1. As discussed in the text, the Baconian view that the whole process of science can be purely objective
and empirical is flawed with regard to the following:
a. Data collection—Preexisting ideas of what is interesting and relevant necessarily influence the sci-
entist’s decisions about what to study.
b. Data analysis—Selecting the most appropriate means for summarizing the collected data involves
the judgment of the scientist, and although precise rules can be stated for certain steps of the process
once a procedure has been decided upon, critical preliminary decisions, such as what statistical
hypotheses should be tested and how, are certainly debatable.
c. Data interpretation—The task of discovering theoretical mechanisms appropriate for explaining any
particular phenomenon is not accomplished by following a specified set of logical rules.
3. The logical problem in the suggestion that a materialist monism is necessitated by empirical findings
can be stated succinctly: one does not prove the nonexistence of nonempirical entities by empirical
methods. To suggest that a materialistic monist position is necessitated by empirical findings is to fall
into the logical error of begging the question—that is, of using as a premise of an argument the conclu-
sion the argument is trying to prove. In the present case, the erroneous argument takes the following
form as a solution to the problem of determining what exists:
1. One can use empirical observation of material entities to determine what exists.
2. With these methods, one observes only material entities.
3. Therefore, all that exists is material.
The argument rests on the premise stated in the first proposition, which is valid only if the conclusion
stated in the third proposition is correct. Examples of this kind of argument in the history of psychology
are given by Robinson (1995, Chapter 9).
For our purposes, the general methodological lesson to be learned concerns the relationship between
assumptions and conclusions. One must presuppose certain principles—for example, regarding the
uniformity of nature—in order to do science; the validity of the conclusions one draws from data will
rest on the validity of those presuppositions, but one’s findings will not ensure or necessitate the valid-
ity of those presuppositions. Similarly, within statistics, conclusions reached are valid under certain
assumptions. For example, under the assumption of homogeneity of within-group variances across
groups, a statistical test may suggest that the means of two groups are different, but the test of means
will say nothing about the validity of the homogeneity of variance assumption. As another example,
as discussed in the “Introduction to the Fisher Tradition” portion of Chapter 1, the probability value
3
4 CHAPTER 1
associated with a test statistic assumes that a particular hypothesis is true, but it does not inform you of
the probability that the hypothesis presupposed is true (cf. the answer to Exercise 12).
7. The contrapositive of an assertion, in general, is that the negation of the conclusion of the assertion
implies the negation of the antecedent condition assumed in the assertion. Thus, the contraposi-
tive of the learning theorist’s assertion is, “If partially reinforced animals do not persist longer in
responding during extinction than continuously reinforced animals, then frustration theory is not
correct.”
12. False. The p value is the probability of the observed (or more extreme) results, given that you assume
that the results are due to chance. The question, on the other hand, asserts that the p value is the prob-
ability that “chance,” or the null hypothesis, is the correct explanation of the results. The distinction is
an important one.
The point can be underscored by using conditional probability notation. The p value is a probability
of the form Pr (data | null hypothesis)—that is, the probability that data having particular characteristics
will occur, assuming that the null hypothesis is true. However, p values are frequently misunderstood
as indicating Pr (null hypothesis | data)—that is, the probability that the null hypothesis is true, given
the data (see Bakan, 1966). Arriving at such a probability requires far more knowledge than is typically
available in a scientific investigation—for example, what are the alternative hypotheses that are pos-
sible, and for each, what is the prior probability that it is true and the probability of obtaining the data
if it were true? Thus, although one may wish that the probability of the truth of a particular hypothesis
could be determined on the basis of the results of a study, that is not the information yielded by a Fish-
erian hypothesis test.
15. This statement, like that in Exercise 12, is a version of the inverse probability misinterpretation of a p
value. That is, a statement about the chance a particular decision is wrong is an assertion about the prob-
ability of a state of the world given a particular decision. Such a statement could be represented as a condi-
tional probability of the form Pr (H0 | Decided to reject H0). There are different ways of making clear that
this does not have to be less than .05. First, consider what would be the case if the null hypothesis were
true and you decided to reject it. In such a case, the chance your decision is wrong is 100%, not less than
5% (see Greenland et al., 2016 for further discussion of this point). Second, one might think about how to
estimate Pr (H0 | Decided to reject H0) in a situation where many null hypotheses are being tested, most of
which are true. For example, in testing whether a particular gene is relevant to a given disease, for every
gene tested that is relevant, one might be testing 100 that are irrelevant. Even in a situation where one has
a very high sensitivity for rejecting the null hypothesis given the gene is relevant (i.e., essentially one out
of every one relevant gene tested will lead to a rejection) and one has exactly .05 probability of obtain-
ing data leading to a rejection of the null hypothesis when it is in fact true (i.e., essentially 5 out of every
100 irrelevant genes tested will lead to a rejection), the probability of a false positive given the test was
rejected would in the long run approach five out of six, not be less than .05. We will discuss this scenario
in more detail in Chapter 5 in introducing the idea of the false discovery rate.
17. Decisions about whether the staff member’s performance is significantly different from chance can be
made by carrying out Fisher’s exact test. To this end, it is convenient to summarize the obtained results
as a 2 × 2 table, where the columns correspond to the actual categories of patients and the rows indicate
the staff member’s judgments about who was released early. Thus, “five out of six correct” would be
indicated by the following table:
Actual
Released Not Released
Released 5 1 6
Judged
Not Released 1 5 6
6 6 12
The Logic of Experimental Design and Analysis 5
Following the logic used in the tea-tasting example, there are 6C5 . 6C1 ways of choosing five out
of the six actually “released” patients and one out of the six actually “not released” patients, or
6 · 6 = 36 ways of correctly identifying five out of six early release patients. This number must
be considered relative to the 12 C6 = (12 . 11 ⋅ 10 ⋅ 9 ⋅ 8 ⋅ 7)/(6 ⋅ 5 ⋅ 4 ⋅ 3 ⋅ 2 ⋅ 1) = 924 different ways
of selecting 6 patients out of the total group of 12. Thus, given that the staff member knew six
patients were released, the probability that he would identify five of those six correctly just by
guessing is
C5 . 6C1 36
Pr (5 of 6 correct) = 6
= = .039.
12 C6 924
Notice that, with the table arrayed as shown above with the actual categories corresponding to the
columns, the combinations involved in the probability are the number of ways of choosing the number
indicated in the first row out of the column total. Thus, the denominator of the probability is the num-
ber of ways of choosing the number of patients judged to have been released—that is, the marginal
total for the first row, or 6, out of the total for the table, 12. Similarly, the numerator involves the
product of the number of combinations of ways of choosing the number indicated in the first row cells
from the corresponding column totals. Notice also that the numbers chosen (5 and 1) and the sizes of
the subgroups from which they are chosen (6 and 6) in the numerator must sum to, respectively, the
number chosen (6) and the total sample size (12) in the denominator.
To determine a significance level, one needs to compute the probability not only of the obtained
results but also of every other outcome that provides as much or more evidence of association (cf. Hays,
1994). Clearly, getting all six correct would be stronger evidence of an association between the actual
categories and the staff member’s judgments than that obtained. The probability of this occurring would
be computed similarly:
C6 ⋅ 6C0 1 ⋅1 1
Pr (6 of 6 correct) = 6
= = = .001.
C
12 6 924 924
The problem also requests that a two-tailed test be performed. To carry out a two-tailed test,
one needs to consider the possibility of judgments that are predominantly incorrect but that are as
strongly indicative of an association between the actual and judged classifications (albeit in the oppo-
site direction) as the obtained results. (In actual practice, neither the staff member nor you might be
persuaded that you owe him money if his judgments were surprisingly worse than would be expected
by chance. Nonetheless, carrying out two-tailed tests here allows an important pedagogical point to
be made about when the two tails of the distribution used in Fisher’s exact test will be symmetrical.)
It turns out that, when either the column totals are equal to each other or the row totals are equal to
each other, the problem is perfectly symmetrical. For example, the probability of getting five of six
incorrect is the same as the probability of getting five of six correct. Thus, the probability of results as
good as or better than those obtained can be doubled to obtain a final answer that may be interpreted
as having a two-tailed significance. To illustrate this and to summarize the answer for the current
example, we have
Because the significance level of .08 is greater than the specified alpha level of .05, you conclude the
results are not significant. You do not owe the staff member any money.
6 CHAPTER 1
If the staff member had identified 5 of 6 early release patients out of a total set of 15, the computed prob-
ability would, of course, be different. The obtained results in such a case could be summarized as follows:
Actual
Released Not Released
Judged Released 5 1 6
Not Released 1 8 9
6 9 15
C6 ⋅ 9C0 1
Pr (6 of 6 correct, out of 15) = 6
= = .0002 .
C
15 6 5005
Thus, the probability of results as good as or better than those obtained is
54 + 1
Pr (5 of 6 or 6 of 6, out of 15) = = .0110.
5005
Now, since both row totals are unequal and column totals are unequal, one cannot simply double
probabilities to get a significance level but must examine probabilities of predominantly incorrect clas-
sifications to see if they are as extreme as the probabilities of these predominantly correct classifica-
tions. Again treating the marginal totals as fixed, we might first consider the likelihood of getting five
out of six incorrect, as we did before. The corresponding table would be as follows:
Actual
Released Not Released
Judged Released 1 5 6
Not Released 5 4 9
6 9 15
C0 ⋅ 9C6 1 ⋅ 84 84
Pr (0 of 6 correct, out of 15) = 6
= = = .0168.
15 C6 5005 5005
Thus the probability of missing them all is also a more likely outcome in this situation than getting five
of six correct by chance alone.
The Logic of Experimental Design and Analysis 7
This means that the only other chance outcome that is as extremely unlikely as or more extremely
unlikely than the observed outcome is correctly identifying all the early release patients. Thus, in this
case the two-tailed probability associated with the observed results turns out to be the same as the one-
tailed probability—namely, .0110. We would conclude that 5 of 6 correct identifications out of a set of
15 is compelling evidence for the staff member’s claim. In this case, he could collect on his bet.
21. a. The observed sum of differences is 372—that is, 22 + 34 + 38 + 38 + 12 + 3 + 55 + 29 + 76 + 23
– 17 + 39 = 372.
b. 212 or 4096 assignments of signs to differences are possible.
c. (i) When all 12 differences are assigned positive values, the largest possible sum of 406 results;
e.g., 372 – (–17) + 17 = 406.
(ii) If either or both of the absolute differences that are less than 17 were negative, one would obtain
a sum in between the maximal sum of 406 and the observed sum of 372. There are three such
sums: one when 3 has the only negative sign, one when 12 has the only negative sign, and one
when both 3 and 12 are negative:
Case Sum
“3” negative 406 – 3 – 3 = 400
“12” negative 406 – 12 – 12 = 382
“3” and “12” negative 406 – 3 – 12 – 3 – 12 = 376
(iii) We have enumerated the 5 assignments of signs that result in sums greater than or equal to the
observed sum of 372. If all 12 signs were reversed in each of these 5 assignments, the 5 most
extremely negative sums possible would be obtained—namely, –372, –406, –400, –382, and
–376. Thus, 10 of the 4096 possible assignments of signs to the obtained differences result in
sums at least as large in absolute value as the obtained sum.
Since the probability of obtaining a sum at least as extreme as that observed is only 10/4096
or .0024, one would reject the null hypothesis that it was equally likely that sums would be
preceded by negative as by positive signs. Thus, it appears on the basis of this experiment that
the enriched environment caused an increase in the size of the cortex.
22. a. Several issues regarding Darwin’s design could be noted. The most important relate to his basic
design strategy, which was to attempt to achieve an unbiased and precise experiment by comparing
each cross-fertilized plant against a matched self-fertilized plant under conditions controlled to be
as equal as possible. As Darwin was well aware, many factors, besides the independent variable
encapsulated in the two seeds for a pair, would influence the height the plants ultimately achieved.
Thus he attempted to achieve a valid experiment by trying to ensure that the plants experienced equal
soil fertility, illumination, and watering. That he would be unable to achieve exact equality in such
conditions, even within a pair of plants, is evident not only logically but also in the data we have to
analyze, as we shall see.
One can think of the environmental conditions for each potential plant site as being predeter-
mined. Certainly, Darwin was aware of some of the relevant factors, such as amount of watering, and
just as certainly, there were factors he could not assess, such as air currents around the plants. The
basic validity of the experiment would have been ensured had Darwin, once the plots were divided
into 15 pairs of locations where environmental conditions were thought to be similar within a pair of
locations, randomly assigned the cross-fertilized plant to one location and the matched self-fertilized
plant to the other paired location. As it was, the validity of the experiment has to rest on the presump-
tion that Darwin knew enough not to bias conditions in favor of the cross-fertilized plants by the
particular set of sites to which he assigned them. Lack of perfect knowledge of the relevant causes
virtually ensures some inadvertent biasing of the true difference between the plant types. In Chapter 2,
8 CHAPTER 1
we will refer to this basic sort of validity as the “internal” validity of a study and distinguish it from
other kinds of validity.
b. (i) As shown in the table of data in the answer to part (b) (ii), 13 of the 15 differences in columns II
and III favor the cross-fertilized plant. If cross-fertilization had no effect, each of the 15 differ-
ences would have a .5 probability of being positive, and one would expect .5(15) = 7.5 differ-
ences on the average to favor cross-fertilization. One can determine if the observed number of
differences favoring cross-fertilization is significantly different from this expected number by
using the binomial probability formula—namely,
n−r
Pr (r successes) = n Cr p r (1 − p) ,
where r is the required number of successes, n is the number of trials, and p is the probability of
success. Here a success is defined as a difference favoring cross-fertilization, and each difference
constitutes a trial or an opportunity for a success. The statistical significance here is determined
by the “sign test,” which simply computes the binomial probability of results at least as extreme
as those observed. The probability of 13 or more differences favoring cross-fertilization is
13 2 14 1 15 0
1 1 1 1
Pr (13 or more successes) = 15 C13 1− + 15 C14 1 1− + 15 C15 1 1−
2 2 2 2 2 2
15
15! 15! 15! 1
= + +
13!2! 14!1! 15!0! 2
1 121
= [105 + 15 + 1] = = .00369.
32, 768 32, 768
Given the symmetry of the distribution under the null hypothesis that p = .5, the probability of
two or fewer successes is also .00369. Thus, the significance of the observed number of differ-
ences equals the probability of the observed or more extreme results:
Since this is considerably less than .05, we would reject the null hypothesis that the probability
of a difference favoring cross-fertilization is .5.
(ii) The simplest possible parametric test appropriate for these data is a matched-pairs t test. The test
is carried out as a one-sample t test on the differences shown next.
21 18 3
221/8 126⁄8 9⅜
23 154⁄8 74⁄8
12 18 –6
Mean difference, D 2.617
Standard deviation, sD 4.718
We wish to test the null hypothesis that the population mean difference µD is 0. We can do so by
using the test statistic (see Weinberg & Abramowitz, 2002, p. 302)
D − µD
t= ,
sD
where D is the mean difference score, and sD is the estimated standard error of the mean and is
defined as
sD
sD = .
n
Here we have
4.718 4.718
sD = = = 1.218,
15 3.873
and
D − µD 2.617 − 0
t= = = 2.148,
sD 1.218
which just exceeds the critical t value of 2.145 for a two-tailed test with 14 degrees of freedom
and a of .05. In particular, the p value associated with an observed t of 2.148 is .0497. The p
value for the parametric t test is in this case considerably less extreme than the p value for the
nonparametric sign test of part (b)(1). This will not in general be the case. The sign test ignores
the magnitude of the differences and thus in effect treats them all as equal. The t test of course
uses a mean that reflects all the differences, and in these data, the two negative differences hap-
pen to be two of the largest differences in the set and thus bring down the mean more than they
would if they were only average in absolute value. So, since the two negative differences are
large ones, the evidence in favor of cross-fertilization is less compelling when appraised using
a procedure that considers the magnitude of the disconfirming evidence in these two cases. In a
sense, the sign test makes the evidence appear stronger than it really is by not taking into account
the fact that the differences that do favor self-fertilization in these data are large ones.
(iii) The assumption required for the sign test is only that the trials or replications or pairs of plants
included in the experiment are independent of each other. In the case of the t test, one not only
assumes that the differences are independent of each other but also that, over replications of the
study, the differences would be distributed as a normally distributed random variable.
(iv) Carrying out a randomization test for these data requires consideration of the mean difference
resulting from each of the 32,768 possible assignments of signs to the 15 observed differences.
The significance level for a two-tailed randomization test is simply twice the proportion of these
mean differences that are equal to or greater than the observed mean of 2.617. Although only
two of the observed differences were negative, because these were large scores, there are many
combinations of signs (including some with as many as six negative signs being assigned to
10 CHAPTER 1
small differences) that result in mean differences larger than 2.617. Enumeration of these is very
tedious, but Fisher (1935/1971, p. 46) reports that 863 of the 32,768 mean differences are at least
this large and positive, and so the significance level of the randomization test is 2(863/32,768) =
2(.02634) = .0527.
The only assumptions required by the randomization test are that the differences are inde-
pendent of each other and that not only their signs but also their magnitudes are meaningful.
Notice that all three of the tests make the assumption of independence, and we are not assured by
Darwin’s procedures that this was achieved. Because of the lack of random assignment, it is con-
ceivable that some factor was inadvertently confounded with the type of fertilization. Any causal
factor that Darwin’s procedures made systematically different across levels of the independent
variable would thus invalidate all three tests. To take a blatant example, if all the cross-fertilized
plants were given a more southerly exposure than the self-fertilized member of the pair, and if
southerly exposure is good for plant growth, then the difference scores would not be independent.
The hypothesis tested by the randomization test is that the observed set of signed differences
arose by a process of assigning a + or – sign with equal likelihood to each of the absolute differ-
ences observed. Because of the lack of random assignment, we cannot be assured that this test
or the others are valid. However, assuming the test’s validity, we have only marginal evidence
for the difference between the two kinds of fertilization, since the results fail to reach the con-
ventional .05 level. These data would need to be combined with other similar results to make the
case compelling.
c. Although the mean of Galton’s differences in column VIII is necessarily the same as the mean of the
differences between the original pairs of plants in columns II and III, the mean difference is made
to appear far too reliable by the re-pairing of the data. In essence, the differences in column VIII
make it look as though Darwin had far greater control over his experimental material than he did. In
particular, the data as arranged in columns VI and VII imply that Darwin knew enough to control the
factors affecting the height of plants to such an extent that, if one member of a pair were the tallest of
its type, the other member of the pair would also be the tallest in its series. Despite his best efforts to
achieve homogeneous pairs of plants, Darwin in fact was not able to approach this degree of control.
Rather than the correlation approaching +1 as implied by columns VI and VII, the correlation of the
heights of the original pairs was actually negative, r = –.338. Perhaps because of competition for
resources between plants planted close together, all of Darwin’s efforts to achieve homogeneous
plants within pairs were not sufficient in that the taller one member of a pair in his data, the shorter
the other member is expected to be.
Carrying out the computations for a matched-pairs t test on the rearranged data, we find that the
standard deviation sD of the differences in column VIII is only 1.9597. This yields a standard error
of .5060 and a t value as follows:
2.617
t= = 5.171.
0.516
The probability of a result this large occurring, given that the null hypothesis is true, is less than 2
in 10,000 or .0002. Similarly, when a randomization test is performed on the differences in column
VIII, the significance level is extreme, with only 46 of the 32,678 mean differences being as large as
or larger in absolute value than the one obtained, implying p = .0014.
Thus, the mean difference by Galton’s approach is made to appear so reliable that it would occur
only on the order of 10 times in 10,000 replications if there were no effect of cross-fertilization. In
fact, results as extreme as those obtained would occur 5 times in 100 by chance alone, or on the order
of 500 times in 10,000 replications. Galton’s rearrangement in effect makes the evidence appear 50
times more compelling than it really is.
2
2. Because students in the United States are not randomly assigned to public versus Catholic high schools,
you should not conclude from a difference between means on a mathematics achievement test that it is
the education provided by the Catholic high schools that caused the scores of the Catholic students to
be higher. One would be concerned as to whether there was a selection bias operating. The attribution of
the cause of the difference to the nature of the high school education would be made more compelling
if other information were to be presented showing that public and Catholic students were comparable
on other background variables that could reasonably be viewed as contributing causes of mathematics
achievement. In fact, it turns out, as Wolfle (1987) reports, that there are large preexisting differences
between the two groups of students on variables that would predict higher mathematics achievement for
the Catholic students even if the mathematics instruction they received was the same as that received
by the public school students. For example, the mothers and fathers of the Catholic students had higher
levels of education and socio-economic status, on the average, than the parents of the public school
students. Within each group of students, these variables were related to students’ achievement and
fathers’ educational level in particular was predictive of higher mathematics achievement. This kind of
information about selection factors operating makes the effectiveness of the high school education per
se a less compelling explanation of the 3-point difference in mathematics achievement scores.
6. To make more compelling that both the memory reconsolidation task and the Tetris game were required
to reduce intrusive memories, a subsequent study should include two additional expanded control
groups—namely, a group that experienced only the memory reconsolidation task but not Tetris and a
group that experienced the Tetris game without memory reconsolidation. In fact, the study described in
Exercise 6 was Experiment 1 in James et al. (2015), and the study including the two groups from the
initial study plus the two expanded control groups constituted Experiment 2 of James et al. (2015).
9. The correlation described in this exercise does not necessarily imply causation and, in particular, does
not necessarily imply that watching television violence causes violent behavior. One other possible
explanation is that certain children are predisposed toward violent behavior and that predisposition
leads them to prefer violent television programs. The inference of the alleged causal relationship could
be strengthened by showing that the children who watch more violent television are essentially equiva-
lent on relevant background variables to the children who watch less. Doing so presumes of course that
you know enough to know which background variables are relevant and have access to their values.
Equivalence (in the long run) could only be assured by random assignment to television-watching
conditions.
11
II
Model Comparisons for
Between-Subjects Designs
3
5. False. MSB is an estimate of the variance of the individual scores in the population. It is, however,
based on the variability among the sample means. In particular, in an equal-n design, MSB is n times
the variance of the sample means, or n times the quantity that estimates the variance of the sampling
distribution of sample means.
7. False. Although in one-way designs it is most often the case that ER will equal SSTotal, this will not always
be the case. ER equals SSTotal when the restriction being tested is that the means of all groups are equal.
In other cases, ER could be either larger or smaller than SSTotal, depending on the restriction of interest.
8. The loss function used to solve estimation problems in this book is to summarize the errors by squaring
each one and summing them over all observations. Parameters are estimated so as to minimize such
losses—that is, to satisfy the least squares criterion.
10. a. Although 24 animals were used in Experiment 2, because they represent 12 pairs of littermates
whose cortex weights are expected to be positively correlated, there are only 12 independent obser-
vations. The information from each pair can be summarized as a difference score that we may denote
Di, as shown next. (The error scores on the right are used in the answer to part (d).)
Experiment #2
Exp. Con. Di ei2 = Di − D ei2 ei2 = Di2
F F R
15
16 CHAPTER 3
Di = µD + εi .
c. Here we want to test the null hypothesis (restriction) that the mean difference score µD is zero. The
restricted model incorporating this constraint is
Di = 0 + εi .
d. The estimated parameter value for the full model is the sample mean of the difference scores D, which
here is 45. Subtracting this from the observed differences, we obtain the error scores ei F for the full
model shown in the preceding data table. Squaring these errors and summing yields EF =10,690,
as shown. Alternatively, if one were computing EF using a hand calculator having a single-key
standard deviation function, one could compute EF as (n − 1) sD2 . Here n = 12 and sD = 31.374;
consequently,
2
EF = (12 − 1)(31.174) = 11(971.818) = 10, 690.
Since there are no parameters to estimate in the restricted model, the errors are the differences
between the observed difference scores and 0, or simply the observed difference scores themselves.
2
Thus, the values of ei R are as shown in the rightmost column of the table in part (a) and sum to 34,990.
Alternatively, one could obtain ER − EF as the sum over all observations of the squared differences
of the predictions of the two models (see Equation 3.57)—that is,
and so
e. The full model requires estimation of one parameter, and the restricted model none, so
df F = n − 1 = 12 − 1 = 11 and df R = n = 12. Thus we have
which exceeds the critical value of 19.7 from Appendix Table 2 for an F with 1 and 11 degrees of
freedom at a = .001.
f. On the basis of the test in part (e), we reject the restricted model. That is, we conclude that it is not
reasonable to presume that the population mean cortex weight is the same for experimental as for
control animals. Or stated positively, we conclude that being raised in an enriched (as opposed to a
deprived) environment results in rats that have heavier cortexes.
g. The data from the three experiments can again be summarized by using difference scores. We will
use these differences to compare the following two models:
The parameter estimates for the full model are the mean difference scores for the three experiments,
D1 = 26, D2 = 45, and D3 = 31. Subtracting these from the observed differences yields the errors and
squared errors shown next.
Introduction to Model Comparisons 17
Equivalently we could use the standard deviations (see Equation 3.55) as follows:
This would enable us to obtain the same result within rounding errors.
The numerator sum of squares for our test statistic may be obtained by taking n times the sum of
squared deviations of group means around the grand mean (see Equation 3.51). Because the grand
mean D, the estimate of the restricted model’s population mean, is here (26 + 45 + 31)/3 = 34, we have
a
ER − EF = n ∑ ( D j − D ) 2
j =1
The full model with its three parameters has N – a = 36 – 3 = 33 degrees of freedom. The restricted
model requires estimation of only one mean, so df R = N – 1 = 36 – 1 = 35. Thus, our test statistic may
be computed as
3.23 for 30 and 40 denominator degrees of freedom, respectively), we cannot reject the restricted
model. The time of year when the experiments were run did not seem to affect the magnitude of the
effect of the environment on the cortex weights.
11. a. Designating the experimental group as group 1 and the control group as group 2, we wish to test the
restriction µ1 = µ2 by comparing the following models:
The parameter estimates for the full model are the group sample means, that is, µ̂1 = Y1 = 702 and
µˆ 2 = Y2 = 657, whereas for the restricted model the single population mean is estimated by the grand
mean—that is, µˆ = Y = 679.5. The errors for the full model are as follows:
Thus,
and
which means ER = EF + 12,150 = 18, 038 + 12,150 = 30,188. Because at the moment we are
acting as if the observations within a pair are independent, we have df F = n – a = 24 – 2 = 22
and df R = N – 1 = 24 – 1 = 23. Thus, our test statistic is
b. Certainly, we have strong evidence ( p < .001) against the null hypothesis in both the independent
groups analysis conducted for the current problem and the matched-pairs analysis conducted in
parts (a)–(f) of the previous problem. However, closer inspection reveals that there is slightly less
evidence against the restricted model here. If p values were computed by a statistical computer
program, we would see that, in the current analysis, we have p = .0009 associated with F(l, 22) =
14.819 here, whereas p = .0004 for F(l, 11) = 25.005 in Exercise 10(e). There are two main dif-
ferences in the analyses that are relevant. The more important difference is in the magnitude of
EF , which appears in the denominator of our test statistic. EF in the independent groups analysis is
68.7% larger than EF in the matched-pairs analysis (18,038 vs. 10,690) and is responsible for the F
being 68.7% larger in the matched-pairs analysis than in the independent groups analysis. This sum
of squared errors will be smaller (and hence the F larger) in the matched-pairs analysis whenever
the pairs of scores are positively correlated, which they are here, r = .4285. This within-pair pre-
dictability of scores allows us to reduce our errors and results in more sensitive tests, as we discuss
extensively in Part III of the text.
A secondary difference in the analysis is in the denominator degrees of freedom, which deter-
mine the particular F distribution used to determine the significance of the observed F. As df F
increases, the critical value required to declare a result significant at a given p value decreases.
While for larger values of n this decrease is trivial, for relatively small n the difference in criti-
cal F’s is noticeable, particularly for very small p values. For example, for p = .05, the critical
F of 4.84 for 1 and 11 degrees of freedom is 13% larger than the critical F of 4.30 for 1 and
22 degrees of freedom, but for p = .001, the critical F is 37% larger in the matched-pairs case
(19.7 vs. 14.4). Even so, the matched-pairs analysis is more compelling here because the cost
of having fewer degrees of freedom is more than outweighed by the benefit of reduced error
variability.
12. a. To standardize the observed difference between means of 702 – 657 = 45 mg, we need to determine
an estimate of the within-group standard deviation as indicated in Equations 3.81 and 3.85. The
corresponding pooled estimate of the within-group variance was determined in the answer to the
preceding question to be 819.909. Thus, the estimated standard deviation is
s p = EF / df F = 819.909 = 28.634.
This implies that the difference in mean cortex weights is more than one-and-a-half standard deviations:
Y1 − Y2 45
d= = = 1.572 .
sp 28.634
b. The proportion of the total sum of squares accounted for by the between-group differences is, from
Equation 3.94 and the results of the preceding problem,
ER − EF 12,150
R2 = = = .4025.
ER 30,188
A corrected estimate of the proportion of variability in the population accounted for by group mem-
bership is provided by ω̂ 2 as defined in Equation 96:
( ER − EF ) − (a −1)( EF / dfF )
ωˆ 2 =
E R + ( E F / dfF )
(3.96)
12,150 −(1)(819.909) 11, 330.091
= = = .3654
30,188 + 819.909 31, 007.909
20 CHAPTER 3
14. Given µ1 = 21, µ2 = 24, µ3 = 30, µ4 = 45, and σε = 20, we need to compute σm and f, as indicated in
Equations 3.90 and 3.91, where μ = (Σ μj)/a = (21 + 24 + 30 + 45)/4 = 120/4 = 30:
σm =
∑ j
(µ j − µ ) 2
a
(21 − 30) 2 + (24 − 30) 2 + (30 − 30) 2 + (45 − 30) 2
=
4
81 + 36 + 0 + 225 342
= = = 85.5 = 9.2466,
4 4
σ 9.2466
f = m = = .4623.
σε 20
From this, we may obtain φ for trial values of n, using Equation 3.106, and determine the resulting
power from Appendix Table l1. With n = 9 per group, we would have
which would result in df F = 4 (9 – 1) = 4 ⋅ 8 = 32. However, the vertical line above a φ of 1.4 for
the chart with 3 numerator degrees of freedom intersects the power curve for 30 denominator degrees
of freedom at a height corresponding to just less than a power of .60. Thus, we need to try a larger value
for n. If we were to increase n to 16, we would have
which for dfF = 4(16 – 1) = 60 would result in a power of over .86. Because this is more than the
required power, we would decrease n. A computation with n = 14 yields
Next, noting the equal n in the two groups, we can compute MSW using Equation 3.56 simply as the
average of the two variances:
EF
=
∑s 2
j
=
1.57 2 + 2.552
=
2.46 + 6.50
= 4.48.
df F a 2 2
Thus, the value of the test statistic can be computed as
( ER − EF ) / ( df R − df F ) MSB N α12 / ( a − 1)
F= = = 2
EF / df F MSW ∑ sj / a
13.225 / 1
= = 2.95, p > .05 .
4.48
To compute the effect size d we need to compute the pooled estimate of the within-group standard
deviation as the square root of MSW:
Introduction to Model Comparisons 21
EF
sp = = MS W = 4.48 = 2.117.
df F
Y1 − Y2 3.65 − 4.80
d= = = −.543.
sp 2.117
Similarly, if the value of f = σm / σε were computed by using sample values of the effect parameters
and within-group standard deviation to compute fˆObv as in Equation 3.107, we would have
2
∑ αˆ j / a [(−.575)2 + .5752 ] / 2 .575
fˆObv = = = = .272.
MSW 4.48 2.117
Thus, we note that fˆObv here is l/2 d and since fˆObv > .25, it is a medium-sized effect.
b. To determine the sample size required to achieve a power of .80, using f = fˆObv = .272, we would com-
pute values of φ for various values of n and then read the power value off the chart in Appendix Table 11.
Given that the results were not significant at α = .05 with 20 subjects per group, we might begin by
trying a larger value of n. For example, if n were to be 40, we would compute N = f n = (.272) 40
= 1.717, which would result in a power of approximately .68. By trying still larger values of n, we
eventually achieve a power of .80 by using n = 55, which yields N = f n = (.272) 55 = 2.01.
c. An effect size of f.50 denotes the lower bound of a one-sided 50% confidence interval for f but this
also corresponds to the estimated median fˆMed . This value can be computed using the Shiny App
at designingexperiments.com for the confidence interval for the square root of the signal-to-noise
ratio by specifying a confidence level of 0. The other values required are the observed F ratio from
part a of 2.9495, the numerator degrees of freedom of 1, the denominator degrees of freedom of 38,
and the total sample size of 40. The Shiny App returns the value of f.5 = f Med = .26958. This effect
size can then be used to determine power by hand, as illustrated in the answer to part b, or by using
software. To illustrate the use of software, using the SPSS syntax in footnote 22 of Chapter 3, one
could set up a data file with various values of group size, starting at n = 55, and using a = 2 and fef-
fect = .26958, computing the non-centrality parameter, critical F value, and power using the lines
of syntax
about 27% smaller than the obvious estimate of f and is a plausible value given the uncertainty in
the estimation of the effect size, if this smaller value is the true effect size, 182% more subjects than
suggested by the previous power analysis would be required to achieve 80% power.
19. a. An omnibus test of the Condition factor fails to reach significance, F (3,151) = 2.506, p = .061. Thus,
such an approach would mean that one could not reject the null hypothesis that the assigned perspec-
tive did not affect the implicit anger score.
b. The two Immersed perspective conditions resulted in similar mean levels of implicit anger, 3.42 in
the What group and 3.69 in the Why group. A two-group one-way ANOVA comparing these group
means did not approach significance, F (1, 75) = .62, p = .434. In contrast, the two-group ANOVA
comparing the What and Why groups in the Distanced perspective did (barely) reach statistical sig-
nificance at α = .05, F (1, 76) = 3.969, p = .04994.
c. The omnibus approach of part a has the advantage of utilizing all the data collected in a single
analysis. This approach has the advantage of an error term based on more degrees of freedom
and hence a very slightly smaller critical F value would be required to achieve significance than
in the separate two-group ANOVAs of part b. The tests in part b have the important advantage of
corresponding more directly to the investigators’ hypotheses about where the difference among
groups would lie and did allow the test of the expected difference to be declared significant at
α = .05. The next two chapters will deal with methods of analysis that allow the advantages of
these two approaches to be combined. That is, tests of pre-planned comparisons of means could be
conducted that would base estimates of denominator error variance on observations in all groups,
but would allow tests to be conducted of the particular comparisons of interest. (Incidentally,
Kross et al. (2005, Study 1) did conduct tests of a complex four-group comparison but reported
erroneous degrees of freedom for such tests.) Chapter 5 will also deal at length with alternative
methods to avoid inflating the probability of a Type I error as a result of having conducted mul-
tiple tests. We will see that recommended methods would result in a more conservative test that
would mean the comparison of the What and Why Distanced conditions would not be declared
significant. We will consider this data set again in one of the Exercises in Chapter 9, which makes
use of information on a covariate collected by Kross et al. that allows for a more sensitive test of
the hypotheses of interest.
1. False. F* can be more powerful than F when sample sizes are unequal and population variances are also
unequal.
7. False. Simulation studies have shown that the Kruskal-Wallis test is sensitive to heterogeneity of vari-
ance when sample sizes are unequal. F* and W are usually better alternatives in this situation.
8. a.
a
∑ n (Y
j =1
j j − Y ) 2 / (a − 1)
F= a
∑ (n
j =1
j − 1) s 2j ( N − a )
so
a
∑ n (Y
2
+ 50[(14 − 12.6667) 2 ]
= 240.00.
In addition,
a
2 19(10) + 19(10) + 49(50)
∑ ( n j −1) s j / ( N − a) =
j =1 87
= 32.5287.
Thus,
240.00 / 2
F=
32.5287
= 3.69.
b.
a
∑ n (Y
j =1
j j − Y )2
F* = a .
∑ (1 − (n
j =1
j / N )) s 2j
∑ n (Y −Y )
j =1
j j
2
= 240.00.
a 20 20 50
∑ (1− (n / N ))s 2j =1− (10) + 1− (10) + 1− (50) .
j =1
j 90 90 90
= 7.7778 + 7.7778 + 22.2222
= 37.7778
Thus,
240.00
F* = ,
37.7778
= 6.35
c.
a
∑ w (Y − Y ) j j
2
/ (a −1)
W = j =1 ,
[1 + 23 (a − 2)Λ]
24 CHAPTER 3
where
w j = n j / s 2j ,
a a
Y = ∑w Y /∑w ,
j j j
j =1 j =1
2
w j a
1−
a
∑
j =1
w
j
Λ = 3∑ .
j =1 (n j −1) (a 2 −1)
w1 = 20 / 10 = 2,
w2 = 20 / 10 = 2,
w3 = 50 / 50 = 1.
Further,
and
Thus,
d. Yes, the F value obtained in part (a) is substantially lower than either F* from part (b) or W from part
(c). When large samples are paired with large sample variances, as in this example, F will be smaller
than F* or W. When this pattern holds for population variances, F will tend to be conservative, while
F* and W will tend to be robust.
10. The Kruskal-Wallis test provides a nonparametric analysis of these data. The first step in applying the
test is to rank order all observations in the entire set of N subjects. Replacing each score with its rank
(where 1 = lowest and 18 = highest) for these data yields the following values.
Introduction to Model Comparisons 25
a
12
H= ∑
N ( N + 1) j=1
n j ( R j − (( N + 1) / 2)) 2
12
= [6(7.4167 − 9.5000) 2 + 6(7.4167 − 9.5000) 2 + 6(13.6667 − 9.5000) 2 ] (3E2.7)
18(19)
= (.03509)(26.0408 + 26.0408 + 104.1683)
= 5.48 .
Because there are tied observations, the correction factor T should be applied, where
G
∑ (t 3
i − ti )
T = 1− i =1 .
N3 − N
There are three sets of tied scores (at 7.5, 11.5, and 13.5), so G = 3.
In each case, there are two observations tied at the value, so t1 =2, t2 = 2, and t3 = 2. Thus, the
correction factor for these data is
H′ = H /T
= 5.48 / .9969
= 5.50.
The critical value is a chi-square with a – 1, or 2, degrees of freedom. At the .05 level, the critical
value is 5.99 (see Appendix Table 9), so the null hypothesis cannot be rejected.
4
1. a. c1 = 1, c2 = − 1, c3 = 0, c4 = 0
b. c1 = 1, c2 = −.5, c3 = −.5, c4 = 0
c. c1 = 0, c 2 = 1, c3 = 0, c 4 = −1
1 1 1
d. c1 = − , c2 = − , c3 = − , c4 = 1
3 3 3
3. a. Testing the contrast for statistical significance involves a four-step process. First, ψ̂ must be found
from Equation 31:
a
ψˆ = ∑ c jY j
j =1
(4.31)
= .5(12) + .5(10) −1(6)
= 5.
Second, the sum of squares associated with the contrast is determined from Equation 30:
˰ a
ER − EF = ( ψ ) 2 ∑ (c
j =1
2
j nj ) (4.30)
Third, the F value for the contrast can be determined from Equation 32:
˰
(ψ )2
F= a .
MSW ∑ (c 2j n j )
j =1
∑ (c
j =1
2
j / n j ) = 166.67.
26
Individual Comparisons of Means 27
We are told that MSW = 25, so the F value for the contrast is given by
166.67
F=
25
= 6.67.
Fourth, this F value must be compared to a critical F value. The critical F here has 1 numerator and
27 (i.e., 30 – 3) denominator degrees of freedom. Appendix Table 2 shows the critical F for 1 and 26
degrees of freedom or for 1 and 28 degrees of freedom, but not for 1 and 27 degrees of freedom. To
be slightly conservative, we will choose the value for 1 and 26 degrees of freedom, which is 4.23.
The observed value of 6.67 exceeds the critical value, so the null hypothesis is rejected. Thus, there
is a statistically significant difference between the group 3 mean and the average of the group 1 and
2 means.
b. Once again, the same four steps as in part (a) provide a test of the contrast. First, from Equation 31 is
a
ψˆ = ∑ c jY j (4.31)
j =1
= 1(12) + 1(10) − 2(6)
= 10.
Second, from Equation 30 is
a
ER − EF = ( ψ̂ )2 ∑ (c
j =1
2
j nj ) (4.30)
( ψ̂ )2
F= a
(4.32)
MSW ∑ (c 2j n j )
j =1
166.67
=
25
= 6.67.
Fourth, as in part (a), the critical F value is 4.23. Thus, the null hypothesis is rejected.
c. In part (a), (ψ̂)2 = 25. In part (b), (ψ̂)2 = 100. Thus, (ψ̂)2 is four times larger in part (b) than in part
(a). Similarly, in part (a),
a
∑c
j =1
2
j = (.5) 2 + (.5) 2 + (−1) 2
∑c
j =1
2
j = (1) 2 + (1) 2 + (−2) 2
= 1 +1 + 4
= 6.
28 CHAPTER 4
Thus, ∑ j =1 c 2j is four times larger in part (b) than in part (a). The inclusion of the ∑ j =1 c 2j term
a a
in Equation 32 guarantees that the F test of a contrast will not be affected by the absolute magnitude
of the contrast coefficients. As our result for part (b) shows, all of the contrast coefficients can be
multiplied by 2 (or any other constant), without changing the F value for the contrast.
8. a. Because this is a pairwise comparison, Equation 18 provides the simplest expression for the F value
for the contrast:
n1n2 (Y1 − Y2 ) 2
F= . (4.18)
(n1 + n2 ) MSW
We know that n1 = 20, n2 = 20, Y1 = 6.0, and Y2 = 4.0. To find MSW, recall that, during the discussion
of pooled and separate error terms, it was stated that
a a
MS W = ∑ (n j −1) s 2j ∑ (n j −1)
j =1 j =1
(20)(20)(6.0 − 4.0) 2
F=
(20 + 20)(9.62)
= 4.16.
The critical F here has 1 numerator and 47 (i.e., 50 – 3) denominator degrees of freedom. Appendix
Table 2 shows the critical F for 1 and 40 degrees of freedom or for 1 and 60 degrees of freedom, but
not for 1 and 47 degrees of freedom. To be slightly conservative, we will choose the value for 1 and
40 degrees of freedom, which is 4.08. The observed F value of 4.16 exceeds the critical value, so the
null hypothesis is rejected. Thus, there is a statistically significant difference between the means of
the cognitive and the behavioral groups.
b. Once again, this is a pairwise comparison, so Equation 18 can be used:
n1n2 (Y1 − Y2 ) 2
F= , (4.18)
(n1 + n2 ) MSW
where the “1” subscript refers to the cognitive group, and the “2” subscript refers to the control
group. Substituting into the formula for F yields
(20)(20)(6.0 − 3.8) 2
F=
(20 + 10)(9.62)
= 3.35.
As in part (a), the critical F value is 4.08. The observed F value is less than the critical value, so
the null hypothesis cannot be rejected. The difference between the means of the cognitive and the
control groups is not statistically significant.
c. The mean difference between the cognitive and behavioral groups is 2.0; the mean differ-
ence between the cognitive and control groups is 2.2. However, the smaller mean difference is
Individual Comparisons of Means 29
statistically significant, while the larger mean difference is not. The reason for this discrepancy is
that the smaller mean difference is based on larger samples (viz., samples of 20 and 20, instead
of 20 and 10). As a result, the mean difference of 2.0 is based on a more precise estimate than the
mean difference of 2.2.
13. a. Using MSW:
ψˆ = Y3 − Y4
= −2
a
ER − EF = (ψˆ )2 ∑ (c
j =1
2
j / nj )
∑s 2
j
a
(1 + 1 + 9 + 9)
=
4
= 5.00
Thus,
˰
(ψ )2
F= a
MSW ∑ (c 2j / n j )
j =1
10.00
=
5.00
= 2.00.
From Appendix Table 2, the critical F value for 1 numerator and 16 denominator degrees of freedom
is 4.49. Thus, the null hypothesis cannot be rejected.
Using a Separate Error Term:
The F statistic is now given by
˰
(ψ )2
F= a
∑ (c
j =1
2
j / n j ) s 2j
(−2)2
= 2 2
( 0) ( 0) (1)2 (−1)2
(1) + (1) + (9) + ( 9)
5 5 5 5
4.00
=
3.60
= 1.11.
30 CHAPTER 4
The denominator degrees of freedom of the critical value are given by Equation 34:
2
a 2 2
c s n
∑
j =1
j j j
df = a
∑ (c s
j =1
2 2
j j n j )2 ( n j − 1)
2
(0)2 (1) (0)2 (1) (1)2 (9) (−1)2 (9)
5 + 5 + 5 + 5
= (4.34)
2 2 2 2
(0)2 (1) (0)2 (1) (1)2 (9) (−1)2 (9)
5 5 5 5
+ + +
5 −1 5 −1 5 −1 5 −1
2
(3.60)
=
0 + 0 + .81 + .81
= 8.00.
The corresponding critical value is 5.32. Thus, the separate error term approach here produces an
appreciably smaller observed F and a somewhat larger critical F.
b. Using MSW:
ψˆ = Y1 − Y2
= −2
a
ER − EF = (ψˆ )2 ∑ (c
j =1
2
j / nj )
∑s
j =1
2
j
MSW =
a
(1 + 1 + 9 + 9)
=
4
= 5.00
Thus,
a
F = ( ψ̂ )2 / MSW ∑ (c 2j / n j )
j =1
10.00
=
5.00
= 2.00.
As in part (a), the critical F value is 4.49, so the null hypothesis cannot be rejected.
Using a Separate Error Term:
Individual Comparisons of Means 31
( ψ̂ )2
F= a
∑ (c
j =1
2
j / n j ) s 2j
(−2)2
= 2 2
(1) (−1) ( 0) 2 ( 0) 2
(1) + (1) + (9) + (9)
5 5 5 5
4.00
=
0.40
= 10.00.
The denominator degrees of freedom of the critical value are given by Equation 34:
2
a 2 2
c s n
∑ j j j
j =1
df = 2
∑ (c s
j =1
2 2
j j n j )2 ( n j − 1)
2
(1)2 (1) (−1)2 (1) (0)2 (9) (0)2 (9)
+ + +
5 5 5 5
= 2 2 2 2
(4.34)
(1)2 (−1) (1)2 (−1) (0)2 (9) (0)2 (9)
5 5 5 5
+ + +
5 −1 5 −1 5 −1 5 −1
2
(.40 )
=
.01 + .01 + .01 + .01
= 8.00.
The corresponding critical value is 5.32, so the null hypothesis is rejected. Unlike part (a), the sepa-
rate error term approach in part (b) produced a substantially larger F value than was obtained using
MSW. The reason is that the separate error term approach takes into account that, in these data, the Y1
and Y2 estimates are much more precise than are Y3 and Y4 because of the large differences in within-
group variances. Although the separate error term approach necessarily has a larger critical value
than does the pooled error term approach, in these data, the much larger F value associated with the
separate error term approach overrides the slight increase in critical value.
c. Using MSW:
ψ̂ = Y1 + Y2 − Y3 − Y4
ψˆ = 4 + 6 − 6 − 8
= −4
a
ER − EF = ψˆ 2 ∑ (c
j =1
2
j / nj )
(1 + 1 + 9 + 9)
=
4
= 5.00.
32 CHAPTER 4
Thus,
a
F = ( ψ̂ )2 MSW ∑ (c 2j / n j )
j =1
20.00
=
5.00
= 4.00.
As in parts (a) and (b), the critical F value is 4.49, so the null hypothesis cannot be rejected.
Using a Separate Error Term:
The F statistic is now given by
a
F = ( ψ̂ )2 ∑ (c
j =1
2
j / n j )s 2j
(−4)2
=
(1) 2 (1) (−1)2 (−1)2
2
( )+ ( )+
5 1 5 1 5 (9) + 5 (9)
16.00
=
4.00
= 4.00.
The denominator degrees of freedom of the critical value are given by Equation 34:
2
a
c 2 s 2 n
∑ j j j
j =1
df = a
∑ (c s
j =1
2 2
j j n j ) 2 (n j − 1)
2
(1) 2 (1) (1) 2 (1) (−1) 2 (9) (−1) 2 (9)
+ + +
5 5 5 5
= 2 2 2 2
(4.34)
(1) 2 (1) (1) 2 (1) (−1) 2 (9) (−1) 2 (9)
5 5 5 5
+ + +
5 −1 5 −1 5 −1 5 −1
(4.00) 2
=
.01 + .01 + .81 + .81
= 9.76 .
The critical value for 1 numerator and 9 denominator degrees of freedom is 5.12, so the null hypoth-
esis cannot be rejected. Notice that the only difference between the two approaches here is that the
critical value is larger for the separate error term approach. In particular, the observed F values are
identical for the two approaches. Such equivalence will occur whenever sample sizes are equal and
all contrast coefficients are either 1 or –1.
5
2. a. The Bonferroni procedure should be used because all comparisons are planned, but not every pos-
sible pairwise comparison will be tested (see Figure 5.1). In addition, from Table 5.9, it is clear that
the Bonferroni critical value is less than the Seheffé critical value, since the number of contrasts to
be tested is less than eight.
b. With 13 subjects per group, the denominator degrees of freedom equal 60 (i.e., 65 – 5). From Appen-
dix Table 3, we find that the critical Bonferroni F value for testing three comparisons at an experi-
mentwise alpha level of .05 with 60 denominator degrees of freedom is 6.07.
c. Because the comparison of µ3 versus µ4 has been chosen post hoc, but all comparisons to be tested are
still pairwise, Tukey’s method must be used to maintain the experimentwise alpha level (see Figure 5.1).
d. From Appendix Table 4, we find that the critical q value for 5 groups, 60 denominator degrees of
freedom, and αEW = 05 equals 3.98. The corresponding critical F value is (3.98)2/2, which equals
7.92.
e. The Bonferroni critical value for testing three planned comparisons is substantially lower than the
Tukey critical value for testing all pairwise comparisons. Thus, the price to be paid for revising
planned comparisons after having examined the data is an increase in the critical value, which will
lead to a decrease in power for each individual comparison.
3. a. With equal n, the F statistic for this pairwise comparison is given by
n(Y2 − Y4 ) 2
F= . (4.19)
2 MS W
Further, with equal n, MSW is the unweighted average of the within-group variances:
a
MS W = ∑ s 2j a
j =1
96 + 112 + 94 + 98 (3.56)
=
4
= 100.
Substituting n = 25, Y2 = 46, Y4 = 54, and MSW = 100 into Equation 4.19 yields
2
25(46 − 54)
F=
2(100)
= 8.00.
33
34 CHAPTER 5
For these data, from Appendix Table 4, the Tukey critical value equals (3.74)2/2, or 6.99. The
observed F exceeds the critical F, so the mean difference can be declared statistically significant
with Tukey’s method. This exercise illustrates the fact that Tukey’s method is more powerful than
Scheffé’s method for testing pairwise comparisons.
5. a. Using MSW:
We saw in Chapter 4 that the observed F equals 2.00 for this comparison (see the answer to Problem
13(a) in Chapter 4). Also, using a pooled error term, there are 16 denominator degrees of freedom.
The critical value for testing all pairwise comparisons is given by
For these data, from Appendix Table 4, q.05; 4,16 = 4.05, so the appropriate Tukey critical value is
(4.05)2/2 = 8.20. Because the observed F value is less than the critical F, the contrast is nonsignificant.
5. a. Using a Separate Error Term:
We saw in the Chapter 4 answers that the observed F now equals 1.11, and there are now only eight
denominator degrees of freedom. The critical value is now given by
The observed F value is now lower than it was with MSW as the error term, and the critical value is
now higher. Thus, the contrast is also nonsignificant using a separate error term.
b. Using MSW:
We saw in the Chapter 4 answers that the observed F value for this contrast is 2.00. Again, there
are 16 denominator degrees of freedom, so the Tukey critical value is 8.20, and this contrast is also
nonsignificant.
Using a Separate Error Term:
We saw in the Chapter 4 answers that the observed F value is now 10.00. With eight denominator
degrees of freedom, the Tukey critical value is again 10.26, so the contrast just misses being statisti-
cally significant.
c. The separate error term seems more appropriate, given the wide disparity in sample values. In par-
ticular, there is more evidence that µ1 and µ2 are different from one another than that µ3 and µ4 are.
The separate error term reflects this fact, unlike the pooled error term, which regards these two dif-
ferences as equally significant.
Testing Several Contrasts 35
d. This exercise illustrates the fact that a separate error term can increase statistical power. Research-
ers’ beliefs that a separate error term will necessarily reduce power may result from the fact that the
separate error term will always have a larger critical value than a pooled error term because of lower
degrees of freedom. However, the separate error term may also have a smaller denominator, which
can more than compensate for the larger critical value. We should also acknowledge that the separate
error term might be larger than the pooled error term, resulting in a smaller observed test statistic, so
there is no guarantee that the separate error term will yield greater power.
7. a. It is possible to perform the test of the omnibus null hypothesis. Because this set of contrasts is
orthogonal, the sums of squares attributable to the three contrasts are additive. As a result, the
between-group sum of squares is given by
( ψ̂ )2
F= a
. (4.32)
MSW ∑ (c 2j / n j )
j =1
( ψ̂ )2
a
∑ (c
j =1
2
j / nj )
c. The omnibus observed F value of part (a) equaled 5.00. The three observed F values of part
(b) equaled 3.00, 7.00, and 5.00. Thus, the omnibus F value equals the average (i.e., the mean) of the
F values for the three contrasts. In general, the omnibus F value can be conceptualized as an average
F value, averaging over a set of a – 1 orthogonal contrasts. That this is true can be seen from the
following algebraic equivalence:
MS B
Fomnibus =
MS W
SS B / (a −1)
=
MS W
a−1
∑ SS
j =1
ψj / (a −1)
= (where the contrasts are orthogonal)
MS W
a−1 SSψ
= ∑ j
(a −1)
j=1 MS W
a−1
= ∑ F ψ j (a −1).
j =1
For a related perspective, see Exercise 9 at the end of Chapter 3 and Exercise 18 at the end of Chapter 4.
11. a. In general, the value of the observed F is
( ER − EF ) / (df R − df F )
F= .
EF / df F
a
n∑ (Y j − Y ) 2 / (a − 1)
j =1
F= a
.
∑s
j =1
2
j /a
For these data, n = 11, Y1 = 10, Y2 = 10, Y3 = 22, Y = 14, a = 3, s12 = 100, s22 = 196, and s32 = 154.
Substituting these values into the expression for the F statistic yields
11(16 + 16 + 64) / 2
F=
150
528
=
150
= 3.52.
From Appendix Table 2, we find that the critical F value for 2 numerator and 30 denominator degrees
of freedom at the .05 level is 3.32. Thus, the professor is correct that the null hypothesis can be
rejected.
b. With equal n, Equation 4.19 can be used to test pairwise comparisons. For example, the F statistic
for comparing the means of groups 1 and 3 is given by
n(Y1 − Y3 ) 2
F= .
2 MS W
Testing Several Contrasts 37
For these data, n = 11, Y1 = 10, Y3 = 22, and MSW = 150 (from part (a)). Thus, the observed F equals
11(10 − 22) 2
F=
2(150)
= 5.28.
From Appendix Table 4, we find that the critical value for 3 groups, 30 denominator degrees of
freedom, and αEW = .05 is 3.49. The corresponding critical F value is (3.49)2/2, or 6.09. Thus, the
means of groups 1 and 3 are not significantly different from one another. The exact same conclusion
applies to a comparison of groups 2 and 3, and the difference between groups 1 and 2 is obviously
nonsignificant. Thus, the professor is correct that none of the pairwise differences are significant.
c. As we pointed out in the chapter, a statistically significant omnibus test result does not necessar-
ily imply that a significant pairwise difference can be found. Instead, a significant omnibus result
implies that there is at least one comparison that will be declared significant with Scheffé’s method.
However, that comparison need not be pairwise. Indeed, using a pooled error term, in this numerical
example, the contrast that produces Fmaximum is a complex comparison with coefficients of 1, 1, and
–2. This comparison is statistically significant, even using Scheffé’s method to maintain αEW = .05
for all possible contrasts.
6
Trend Analysis
2. a. With equal n,
a
SSlinear = ER − EF = n( ψ̂ linear )2 ∑c
j =1
2
j . (6.11)
The linear contrast coefficients are c1 = –3, c2 = –1, c3 = 1, and c4 = 3 (see Appendix Table 10). Thus,
and
a
∑c
j =1
2
j = (−3) 2 + (−1) 2 + (1) 2 + (3) 2
= 20.
Thus,
SSlinear = 10(100) 2 / 20
= 5000.
b.
/ ∑c
a
SSquadratic = n( ψ̂ quadratic )2 2
j
j =1
From Appendix Table 10, the quadratic contrast coefficients are c1 = 1, c2 = –1, c3 = –1, and c4 = l. Thus,
and
a
∑c
j =1
2
j = (1) 2 + (−1)1 + (−1) 2 + (1) 2
= 4.
38
Trend Analysis 39
Thus,
SSquadratic = 10(0) 2 / 4
= 0.
c.
/ ∑c
a
SScubic = n( ψ̂ cubic )2 2
j
j =1
From Appendix Table 10, the cubic contrast coefficients are c1 = –1, c2 = 3, c3 = –3, and c4 = 1. Thus,
and
a
∑c
j =i
2
j = (−1) 2 + (3) 2 + (−3) 2 + (1) 2
= 20.
Thus,
SScubic = 10(0) 2 20
= 0.
d. Yes, Figure 6.3(a) reflects a pure linear trend, because SSlinear is nonzero but SSquadratic and SScubic are
both zero (that is, all the other a – 1 trends are zero).
6 a.
SSlinear
F=
MS W
We are told that MSW = 150, but we need to calculate SSlinear. which we can do from
˰
/ ∑c .
a
SSlinear = n(ψlinear )2 2
j
j =1
For five groups, the linear contrast coefficients are c1 = –2, c2 = –1, c3 = 0, c4 = 1, and c5 = 2 (see
Appendix Table 10). Thus,
and
a
∑c
j =1
2
j = (−2) 2 + (−1) 2 + (0) 2 + (1) 2 + (2) 2
= 10.
Thus,
SSlinear = 15(28) 2 / 10
= 1176.
40 CHAPTER 6
The value of the F test statistic for the linear trend then equals
SSlinear
F=
MS W
1176
=
150
= 7.84.
The critical F value has 1 and 70 degrees of freedom. From Appendix Table 2, we find that the criti-
cal F with 1 and 60 degrees of freedom (rounding downward) is 4.00 at the .05 level. Thus, the linear
trend is statistically significant at the .05 level.
b. To test the omnibus null hypothesis, we must calculate an observed F of the form
MS B
F= .
MS W
We are told that MSW = 150, but we must find MSB, which is given by
MS B = SS B / (a − 1)
a
= n∑ (Y j − Y ) 2 / (a − 1)
j =1
= 15[(80 − 86) 2 + (83 − 86) 2 + (87 − 86) 2 + (89 − 86) 2 + (91 − 86) 2 ] / 4
= 1200 / 4
= 300.
MS B
F=
MS W
300
=
150
= 2.00.
The critical F value has 4 and 70 degrees of freedom. From Appendix Table 2, we find that the
critical F with 4 and 60 degrees of freedom (rounding downward) is 2.53 at the .05 level. Thus, the
omnibus null hypothesis cannot be rejected.
c. The F statistic for the linear trend is
SSlinear
F= ,
MS W
SS between / 4
F= .
MS W
For these data, the linear trend accounts for 98% of the between-group sum of squares (that is, 1176
out of 1,200), so that SSlinear is almost as large as SSbetween. However, the linear trend is based on 1
degree of freedom, whereas the omnibus test is based on 4 degrees of freedom. In other words, with
one parameter, the linear trend model decreases the sum of squared errors by 1,176 relative to a
restricted model of the form
Yij = µ + εij .
Trend Analysis 41
On the other hand, the “cell means” model (that is, the full model of the omnibus test) decreases the
sum of squared errors by 1,200 relative to the same restricted model, but requires three more param-
eters than does the restricted model to accomplish this reduction. As a consequence, although SSlinear
is slightly smaller than SSbetween, MSlinear is almost four times larger than MSbetween. The same ratio
applies to the observed F values.
d. If in fact the true difference in population means is entirely linear, the observed F value for the linear
trend will likely be appreciably larger than the omnibus F value. Thus, statistical power is substan-
tially increased in this situation by planning to test the linear trend. Of course, if the true trend is
nonlinear, the planned test of the linear trend may be sorely lacking in power. This same effect may
occur for any planned comparison—not just for a linear trend.
e. Yes, because the omnibus test need not even be performed when planned comparisons have been
formulated.
9. The estimated slope of 2.35 is not accurate. Using Equation 7 to calculate the slope requires that linear
trend coefficients be defined as
cj = Xj – X , (6.8)
a
βˆ1 = ψˆ linear ∑c
j =1
2
j (6.7)
Thus, the correct estimated slope is .78, which is one-third (within rounding error) of the claimed value
of 2.35. Now, with a slope of .78, we would expect 10-year-olds to outperform 4-year-olds by approxi-
mately 4.68 units, which is virtually identical to the observed difference of 4.70 units.
11. a. The linear trend coefficients shown in Appendix Table 10 for testing a linear trend with four groups
are –3, –1, 1, and 3. However, Equation 7 requires that the contrast coefficients be defined as
cj = X j − X ,
/ ∑c
a
SSlinear = n( ψ̂ linear )2 2
j (see 6.11)
j =1
= 5(−2.60)2 / 5.00
= 6.760,
and
a
MS W = ∑ s 2j / a (3.56) (3.56)
j =1
From Appendix Table 2, the critical F value with 1 and 16 degrees of freedom is 4.49, so the linear
trend is statistically significant at the .05 level.
c. The least squares estimate of the slope parameter is β̂1 = –52, which is identical to the value obtained
in part (a).
d. The F value of 7.28 (which is the square of t = –2.698) is considerably less than the F value of 23.89
obtained in part (b).
e. Yes, in both cases, the sum of squares attributable to the linear trend equals 6.760. Thus, the value of
ER – EF is the same in the two analyses.
Trend Analysis 43
f. The denominator of the F statistic for testing the linear contrast in part (b) was MSW, which has
a value of .283 for these data. Because dfF = 16 for this analysis, the corresponding error sum of
squares is 4.528 (16 × .283). However, the error sum of squares for the regression approach is
16.720, obviously a much larger value. The associated degrees of freedom equal 18, so the denomi-
nator of the F statistic using this approach is .929 (16.720 ÷ 18). The larger denominator of the
regression approach (relative to testing the contrast, as in part (b)) produces a lower F value.
g. As stated in the text, the “cell means” model is mathematically equivalent to a model that includes
all a – 1 trends, such as
when a = 4. However, the full model of the regression approach excludes the quadratic and cubic
terms. As a result, any quadratic and cubic effects contribute to the error of this model. Specifically,
Thus,
Two-Way Between-Subjects
Factorial Designs
α j = µ j ⋅ − µ.. . (7.10)
Similarly,
µ2 ⋅ = (12 + 12 + 12) 3
= 12
and
b a
µ⋅⋅ = ∑ ∑ µ jk ab
k =1 j =1
3 3
= ∑ ∑ µ jk 9 (7.9)
k =1 j =1
= (10 + 12 + 17 + 10 + 12 + 17 + 10 + 12 + 17) 9
= 13.
44
Two-Way Between-Subjects Factorial Designs 45
α1 = µ1⋅ − µ⋅⋅
= 10 − 13
= −3,
α2 = µ2⋅ − µ⋅⋅
= 12 − 13
= −1,
α3 = µ3⋅ − µ⋅⋅
= 17 − 13
= 4.
(2) The β parameters are obtained in a similar manner, except that now we must focus on column
marginal means, instead of row marginal means:
Thus, the values of the column main effect β parameters are given by
β1 = µ⋅1 − µ⋅⋅
= 13 − 13
= 0,
β2 = µ⋅2 − µ⋅⋅
= 13 − 13
= 0,
β3 = µ.3 − µ..
= 13 − 13..
= 0.
(3) The interaction parameters are defined in terms of the cell means and the other effect parameters:
For example,
(4) Only the A main effect is nonzero in the population. For these data, simple visual inspection of
the cell means shows that the rows differ from one another but the columns do not. In addition,
the row differences are identical in each column. The αj, βk, and (αβ)jk parameter values confirm
that the B main effect and AB interactions are null in the population.
c. (1) From Equation 10, α j is defined to be
α j = µ j ⋅ − µ⋅⋅ . (7.10)
and
b a
µ⋅⋅ = ∑ ∑ µ jk ab
k =1 j =1
3 3
= ∑ ∑ µ jk 9
k =1 j =1
= (26 + 22 + 21 + 23 + 19 + 18 + 17 + 13 + 12) 9
= 19.
α1 = µ1⋅ − µ⋅⋅
= 23 − 19
= 4,
α2 = µ2⋅ − µ⋅⋅
= 20 − 19
= 1,
α3 = µ3⋅ − µ⋅⋅
= 14 − 19
= − 5.
Two-Way Between-Subjects Factorial Designs 47
(2) The β parameters are obtained in a similar manner, except that now we must focus on column
marginal means, instead of row marginal means:
Thus, the values of the column main effect β parameters are given by
β1 = µ⋅1 − µ⋅⋅
= 22 − 19
= 3,
β2 = µ⋅2 − µ⋅⋅
= 18 − 19
= −1,
β3 = µ⋅3 − µ⋅⋅
= 17 − 19
= −2.
(3) The interaction parameters are defined in terms of the cell means and the other effect parameters:
For example,
For these data, it turns out that all nine interaction parameters have a value of zero.
(4) The A main effect and the B main effect are nonzero in the population. The nonzero αj and
βk parameters corroborate the visual impression that rows differ from one another and so do
columns. However, the row differences are the same in all three columns (or, conversely, the
column differences are the same in all three rows), which is why the interaction is null in the
population.
4. a. From Equation 16, the general form of an estimated row main effect parameter is
αˆ j = Y j⋅ − Y.. . (7.16)
b. SSA = ∑ α̂
all obs
2
j
Notice that with equal n, a total of n b subjects have an α̂ j value of α̂1 , another n b subjects have an
α̂ j value of α̂2 , and so forth. For these data, then, 24 subjects have an α̂ j value of α̂1, 24 have α̂2 ,
and 24 have α̂3 . Thus,
SSA = 24αˆ12 + 24αˆ 22 + 24αˆ 32
= 24(0)2 + 24(0)2 + 24(0)2
= 0.
c. From Equation 17, the general form of an estimated column main effect parameter is
d. SSB = ∑ β̂
all obs
2
k
As in part (b), 24 subjects have a β̂k value of β̂1 , 24 have β̂2 , and 24 have β̂3 . Thus,
= −2,
Two-Way Between-Subjects Factorial Designs 49
= 10 − (11 + 11 − 11)
αβ 12
= −1,
= 15 − (11 + 12 − 11)
αβ 13
= 3,
= 9 − (11 + 10 − 11)
αβ 21
= −1,
= 14 − (11 + 11 − 11)
αβ 22
= 3,
= 10 − (11 + 12 − 11)
αβ 23
= −2,
= 13 − (11 + 10 − 11)
αβ 31
= 3,
= 9 − (11 + 11 − 11)
αβ 32
= −2,
= 11 − (11 + 12 − 11)
αβ 33
= −1.
f. SS AB = ∑ (αβ
all obs
jk )2
value of αβ
In general, with equal n, a total of n subjects will have an αβ , n will have a value of
jk 11
, and so forth. Thus,
αβ 12
SS AB = 8 (−2) 2 + (−1) 2 + (3) 2 + (−1) 2 + (3) 2 + (−2) 2 + (3) 2 + (−2) 2 + (−1) 2
= 336.
5. a.
a
SS A = ER − EF = nb∑ (Y j ⋅ − Y.. ) 2 (7.25)
j =1
b.
a
SS A = ER − EF = n∑ (Y j − Y ) 2
(7.26)
j =1
c. The answers to parts (a) and (b) are the same. This equivalence provides an empirical demonstra-
tion of the assertion in Chapter 7 that in equal-n designs, the sum of squares for the A main effect
in a factorial design equals the sum of squares due to A in a single factor design when the data are
analyzed as if the B factor never existed.
50 CHAPTER 7
φ = .40 n .
Assuming n = 10,
φ = .40 10
= 1.26.
The chart for dfnumerator = 1 must be consulted. Also, notice that dfdenominator = 18. With α = .05, the
power appears to be approximately .40.
b. With five groups and n = 10, φ again equals 1.26. Now, however, the chart for dfnumerator = 4 must be
consulted with dfdenominator = 45. With α = .05, the power appears to be approximately .53.
c. In the factorial design, ϕ is calculated from
df denominator
φ = .40 + 1.
df effect + 1
For a main effect in a 2 × 2 design with n = 10,
so
36
φ = .40 +1
2
= 1.74.
The chart for dfnumerator = 1 must be consulted with dfdenominator = 36. With α = .05, the power appears
to be approximately .67.
d. The power will be the same as in part (c), because dfeffect = 1 for the interaction effect in a 2 × 2 design
(see Equation 7.34).
e. Once again, φ is calculated from
df denominator
φ = .40 + 1.
df effect
so
81
φ = .40 +1
2 +1
= 2.12.
The chart for dfnumerator = 2 must be consulted with dfdenominator = 81. With α = .05, the power appears
to be approximately .91.
f. Yet again, φ is calculated from
df denominator
φ = .40 + 1.
df effect + 1
Two-Way Between-Subjects Factorial Designs 51
The chart for dfnumerator = 4 must be consulted with dfdenominator = 81. With α = .05, the power appears
to be approximately .85.
g. Only two of the six effects in parts (a) through (f) would have a power as high as .8 for detecting a
large effect with 10 subjects per cell.
h. Two comments are pertinent here. First, the power of a test is a function not only of the sample size
and the effect size (that is, small, medium, or large) but also of the type of design and the type of
effect to be tested in that design. Thus, n = 10 may be sufficient for some effects in some designs, but
not for others. Second, in many cases, n = 10 per cell may be too few subjects to have a power of .8
to detect even a large effect, much less a medium or small effect.
11. a. Notice that this student has performed tests of the simple effect of therapy for females and males
separately. In each case, the sum of squares for therapy can be found from
a
SSsimple effect = n∑ (Y j − Y ) 2. (7.52)
j =1
where the sample means refer to the means for the specific individuals under consideration. For
females, Y1 = 60, Y2 = 80, and n = 10. Thus, for females,
As a result,
SS therapy df therapy
F=
MS W
4000 1
=
800
= 5.00.
As in part (a), the critical F with 1 and 36 degrees of freedom is approximately 4.17. Now, however, the
difference between the therapies is statistically significant at the .05 level. Incidentally, for later parts of
this problem, it is helpful to note that the interaction sum of squares for these data is exactly zero.
d. As we saw in Chapter 3, the t test can be regarded as a comparison of models of the following form:
Group membership is solely a function of form of therapy, so any effects due to sex appear in the
error term of both models. An F test to compare these models would be
( ER − EF ) (df R − df F )
F= .
EF df F
In this situation,
df F = N − a
= 40 − 2
= 38
and
df R − df F = 1.
Further,
a
ER − EF = n∑ (Y j − Y ) 2
j =1
To calculate the observed F, we must find the sum of squared errors of the full model. However, the
errors of this model will include the within-cell errors of the 2 × 2 factorial design as well as any
effects due to sex. Specifically,
Thus,
The critical F value with 1 and 38 degrees of freedom is approximately 4.17 (see Appendix Table 2),
so the difference between the therapies is significant at the .05 level. Alternatively, as a t-test, the
observed t value of 2.15 exceeds the critical t of approximately 2.04.
e. Testing the therapy main effect in the 2 × 2 design produced the largest F value for these data, reflect-
ing the fact that this approach will often provide the most powerful test of the difference between the
therapies. Tests of simple effects are less powerful than the test of the main effect when there is no
interaction, so, generally speaking, main effects should be tested instead of simple effects when the
interaction is nonsignificant (see Figure 7.2). In general, it is true that
SS A within B1 + SS A within B2 = SS A + SS AB .
In our example,
Specifically,
Because there is no interaction here, the sum of squares for each simple effect is only one-half as
large as the sum of squares for the main effect. The same ratio occurs for the F values, making the
main effect test considerably more powerful than the simple effects tests. It should also be noted that
the test of part (d) will be less powerful than the main effect test to the extent that the other factor
(in this case, sex) has any effect on the dependent variable. This tendency is illustrated in these data,
where the observed F value for the therapy main effect is 5.00, but the observed F corresponding to
the t-test approach is 4.63.
17. a. Unweighted marginal means would reflect personality type effects for individuals at a particular
stress level. The unweighted row marginal means for these data are
b
Y1⋅(U ) = ∑ Y1k b
k =1
= (170 + 150) / 2
= 160,
b
Y2⋅(U ) = ∑ Y2 k b
k =1
= (140 + 120) / 2
= 130.
54 CHAPTER 7
b. If the effect of stress is not taken into account, personality type effects are reflected in weighted
marginal means. The weighted row marginal means for these data are
b b
Y1⋅( W ) = ∑ n1kY1k ∑n 1k
k =1 k =1
b b
Y2⋅( W ) = ∑ n2 kY2 k ∑n 2k
k =1 k =1
Thus, the estimated magnitude of the mean blood pressure difference between personality types,
ignoring level of stress, is 38 units. We saw in part (a) that the comparable difference when the effect
of stress is taken into account is 30 units. Thus, taking the effect of stress into account lowers the esti-
mated difference between personality types. The reason is that Type A individuals are predominantly
found in high-stress environments, while Type B’s are more likely to be in low-stress environments,
so some of the 38-unit difference found overall between A’s and B’s may reflect differences in their
environments.
20. a. Table 5.17 (in Chapter 5) provides formulas for forming a confidence interval for a contrast. We can
conceptualize the current problem as one of planning to test a single contrast, so that C = 1. In this
situation, Table 5.17 shows that a 95% confidence interval for ψ has the form
a
ψˆ ± F.05;1, N −a MSW ∑ (c 2j n j ).
j =1
Then,
Substituting these values, along with MSW = 19 (which we were given in the problem), into the
formula for the confidence interval yields
which reduces to
4.00 ± 3.93.
Two-Way Between-Subjects Factorial Designs 55
Equivalently, we can be 95% confident that the population difference in unweighted marginal means
is between 0.07 and 7.93. Given an equal number of females and males, we are 95% confident that
CBT is between 0.07 and 7.93 points better than CCT.
b. We can use the same formula as in part (a), but the contrast coefficients are now defined to be
n11n21 n+1
c1 =
n11n21 n+1 + n12 n22 n+2
(6)(8) (14)
=
(6)(8) (14) + (4)(5) 9
= .61,
n11n22 n+1
c3 = −
n11n21 n+1 + n12 n22 n+2
(6)(8) 14
=−
(6)(8) (14) + (4)(5) 9
= −.61,
Substituting these values along with MSW = 19 into the formula for the confidence interval yields
which reduces to
4.00 ± 3.84.
56 CHAPTER 7
We can be 95% confident that CBT is between 0.16 and 7.84 points better than CCT in the popula-
tion, if we are willing to assume that the difference is the same for females as for males.
c. The contrast corresponding to the Type II sum of squares can be estimated slightly more precisely
than the contrast corresponding to the Type III sum of squares. This advantage is the reason the Type II
sum of squares is preferable to Type III if there is known to be no interaction in the population.
d. Once again, the confidence interval has the form
a
ψˆ ± F.05;1, N −a MSW ∑ (c 2j n j ) .
j =1
ψ̂ is given by
Substituting these values, along with MSW = 19, into the formula for the confidence interval yields
Thus, we can be 95% confident that CBT is between 0.75 units worse and 8.75 units better than CCT.
e. The interval computed in part (d) is considerably wider than the intervals we found in parts (a) and
(b). In particular, based on the equal-n approach, we could not confidently rule out the possibility
that CBT is worse than CCT, as we could with the two nonorthogonal approaches. Randomly delet-
ing observations decreases precision and hence lowers the power to detect a true effect.
8
Higher-Order Between-Subjects
Factorial Designs
There is a Drug main effect in the population, since mean blood pressure is lower when the Drug is
present than when it is absent.
8. The correct answer is (c). Notice that the contrast coefficients for ψ represent an AB interaction, because
the A effect at B1 is compared to the A effect at B2 (see Equation 7.1 for a reminder). Specifically, ψ
equals the difference between A1 and A2 at B1 minus the difference between A1 and A2 at B2 Thus, the
fact that the estimated value of ψ at C1 is –8 implies that A1 minus A2 is smaller at B1 than at B2, for the
first level of C. However, because the estimated value of ψ at C2 is +8, A1 minus A2 is larger at B1 than at
B2, for the second level of C. As a result, the AB interaction at C1 differs from the AB interaction at C2,
suggesting the possibility of a three-way ABC interaction (see Table 8.8). In contrast, the AB interaction
would average ψ at C1 and ψ at C2 (see Table 8.8), resulting in a value of 0. Thus, there is no evidence
here for an AB interaction. Finally, it is impossible to tell whether the simple two-way interactions of A
and B at the two levels of C would be significant, without knowing the sample size and ΜSW.
10. a. From Table 8.9,
α j = µ j ⋅⋅ − µ,
so
α̂ j = Y j⋅⋅ − Y⋅⋅⋅ .
57
58 CHAPTER 8
so
SS A (a − 1)
F= .
MS W
From Table 8.11,
a b c n
SSA = ∑ ∑ ∑ ∑ αˆ 2j
j =1 k =1 l =1 i =1
a
= bcn∑ αˆ 2j
j =1
= (3)(2)(10)[(2.5)2 + (−2.5)2 ]
= 750.0.
We know that a = 2, so a – 1 = 1, and
SS W
MS W =
N − abc
86, 400
=
120 − 12
= 800.
HIGHER-ORDER BETWEEN-SUBJECTS FACTORIAL DESIGNS 59
The critical F with 1 and 108 degrees of freedom is approximately 4.00 (see Appendix Table 2), so
the A main effect is nonsignificant at the .05 level.
The observed F value for the B main effect is given by
SS B (b − 1)
F= .
MS W
We know that b = 3, so b – 1 = 2, and we found that MSW = 800. Thus, the F value for the B main effect is
8000 / 2
F=
800
= 5.00 .
The critical F with 2 and 108 degrees of freedom is approximately 3.15 (see Appendix Table 2), so
the B main effect is statistically significant at the .05 level.
c. The following plot is probably the clearest way to picture the three-way interaction:
This plot reveals that the AC interaction is the same at every level of B, so there is no evidence of a
three-way interaction. To see this explicitly, we will consider the magnitude of the AC interaction at
60 CHAPTER 8
each level of B. Specifically, the following table shows the magnitude of the C1 mean minus the C2
mean for both A1 and A2, separately for each level of B:
Difference Between
C1 – C2 at A1 and
Level of B C1 – C2 at A1 C1 – C2 at A2 C1 – C2 at A2
B1 5 35 –30
B2 15 45 –30
B3 –5 25 –30
As the rightmost column shows, the AC interaction is the same at each level of B, so there is no three-
way interaction.
11. a.
ψC1 = µ111 − µ121 − µ211 + µ221
ψC 2 = µ112 − µ122 − µ212 + µ222
b.
ψB1 = µ111 − µ112 − µ211 + µ212
ψB 2 = µ121 − µ122 − µ221 + µ222
c.
ψA1 = µ111 − µ112 − µ121 + µ122
ψA 2 = µ211 − µ212 − µ221 + µ222
d. The contrast coefficients of parts (a), (b), and (c) are identical to one another. This implies that the
three interpretations of a three-way interaction are indeed equivalent to one another.
HIGHER-ORDER BETWEEN-SUBJECTS FACTORIAL DESIGNS 61
12. a. To find interaction contrasts, corresponding coefficients must be multiplied times one another.
AB:
1 1 0 0 –1 –1 –1 –1 0 0 1 1
0 0 1 1 –1 –1 0 0 –1 –1 1 1
AC:
1 –1 1 –1 1 –1 –1 1 –1 1 –1 1
BC:
1 –1 0 0 –1 1 1 –1 0 0 –1 1
0 0 1 –1 –1 1 0 0 1 –1 –1 1
The three-way interaction contrasts can be found in any of several equivalent ways. For example, the
AB contrast coefficients can be multiplied by the C contrast coefficients.
ABC:
1 –1 0 0 –1 1 –1 1 0 0 1 –1
0 0 1 –1 –1 1 0 0 –1 1 1 –1
b. (i) AB at C1
1 0 0 0 –1 0 –1 0 0 0 1 0
0 0 1 0 –1 0 0 0 –1 0 1 0
AB at C2
0 1 0 0 0 –1 0 –1 0 0 0 1
0 0 0 1 0 –1 0 0 0 –1 0 1
SS AB at C 1 + SS AB at C 2 = SS AB + SS ABC .
Thus, the four contrasts for AB at C1 and AB at C2 would replace the four contrasts for AB and ABC.
9
2. The primary considerations are (1) that the covariate correlate with the dependent variable and (2) that
the covariate be independent of the treatment factor(s). The first consideration is critical inasmuch as
the covariate is being used to reduce within-cell variability and the strength of the correlation deter-
mines the extent of error reduction. The second consideration is important for facilitating interpretation.
With the covariate and treatment factor independent, one is assured that the estimate of the treatment
effect in a randomized study is unbiased. When the treatment and covariate happen to be correlated,
one cannot generally know if the extent of the adjustment for differences on the covariate is too large,
too small, or just right.
Two other, secondary considerations that make a covariate desirable are (1) the covariate can be
obtained easily and economically, and (2) the process of obtaining the covariate scores does not affect
the scores on the dependent variable. The latter possible effect of “testing,” if it occurs, can limit the
external validity of the study but does not threaten the internal validity of the research.
4. a. As indicated in Equations 9.1 and 9.2, the models being compared are
Full: Yij = µ + α j + β X ij + εij
Restricted: Yij = µ + β X ij + εij
62
Designs with Covariates 63
b.
As shown in the preceding plot, there is a very strong positive relationship between the pretest and
posttest within each of the two groups. Further, the group with the higher mean on the pretest also
has the higher mean on the posttest. Thus, the data follow the pattern discussed in the text (see Fig-
ure 9.5(a)) where an apparent treatment effect is due primarily to preexisting differences. It appears
in fact that a single regression line would fit all the data nearly as well as separate regression lines
for the two groups. Because group membership does not add much to the pretest as a predictor of
posttest scores, it appears that the ANCOVA test of the treatment effect would not be significant.
c. Designating Groups C and T as groups 1 and 2, and the pretest and posttest as variables X and Y,
respectively, we may determine the slope for a group by using Equation 9.11. That is, for each group,
we compute the ratio of the sum of cross products of deviations from the variables’ means to the sum
of squared deviations on the pretest.
Group C
1 –1 5 0 0 1
3 1 8 3 3 1
3 1 7 2 2 1
1 –1 2 –3 3 1
2 0 3 –2 0 0
X1 = 2 Y1 = 5 Σ=8 Σ=4
64 CHAPTER 9
∑( X i1 − X 1 )(Yi1 − Y )
8
b1 = i
= = 2.
4
∑( X − X1)
2
i1
5 –1 14 0 0 1
7 1 17 3 3 1
7 1 16 2 2 1
5 –1 11 –3 3 1
6 0 12 –2 0 0
X2 = 6 Y2 = 14 Σ=8 Σ=4
Since the deviations from the mean in Group Τ are identical here to those in Group C, the slopes
necessarily are the same:
∑( X i2 − X 2 )(Yi 2 − Y )
8
b2 = i
= =2.
4
∑( X − X2)
2
i2
Although normally these slopes would differ, in this particular example, because b1 = b2 = 2, it fol-
lows that their weighted average bW must also be 2 (see Equation 9.12). Thus, the intercept for each
group’s regression line in this case will be the same as the intercept computed using the common
within-group slope—that is,
a1 = Y1 − b1 X 1 = Y1 − bW X 1 = 5 − 2(2) = 5 − 4 = 1,
a2 = Y2 − b2 X 2 = Y2 − bW X 2 = 14 − 2(6) = 14 − 12 = 2.
The pooled within-group slope we have computed is the estimate of the population slope β in the full
model, and the intercepts are the estimates of the combination of parameters μ + αj for that group (see
Equation 9.18)—that is,
βˆ = bW = 2,
µˆ + αˆ1 = a1 = 1,
µˆ + αˆ 2 = a2 = 2.
Note here that μ is the mean of the intercepts, so µ̂ = (1+ 2)/2 = 1.5 and that αj is the effect of the
treatment as indicated by the vertical displacement of the regression line. Here, group 1 results in the
regression line’s intercept being .5 units lower than the average of the intercepts, and group 2 results
in its regression line’s intercept being .5 units above the average of the intercepts—that is, α̂1 = –.5
and α̂2 = +.5.
d. We can use our parameter estimates to form the prediction equation for our full model:
That is,
Yˆ1 = a1 + bW X ij = 1 + 2X ij ,
Yˆ2 = a2 + bW X ij = 2 + 2X ij .
Substituting the observed values of X, we obtain the following predictions and errors of prediction,
eij = Yij – Yˆij :
X Y Yˆ e e2
1 5 3 2 4
3 8 7 1 1
3 7 7 0 0
1 2 3 –1 1
2 3 5 –2 4
5 14 12 2 4
7 17 16 1 1
7 16 16 0 0
5 11 12 –1 1
6 12 14 –2 4
Σ = 20 = EF
e. As indicated in Equation 3, the predictions of the restricted model are just a linear transformation of
the X scores:
Yˆij = µˆ + βˆ X ij . (9.3)
Thus, the overall correlation between Yij and Xij will be identical to the correlation between Yij and
the prediction of the restricted model. This implies that the proportion of variance accounted for by
the restricted model is
RR2 = rXY
2
= (.95905) 2 = .91978.
f. We can readily perform the ANCOVA test of treatment effects, using these results for the errors of
our models and dfR = N – 2 = 10 – 2 = 8 and dfF = N – (a + 1) = 10 – 3 = 7. Thus our test statistic is
Clearly, this is nonsignificant, and we conclude we cannot reject the null hypothesis of no treatment
effect here, once we take the pretest scores into account.
9. Power is of course related to the absolute magnitude of the effect you are trying to detect, and this in
turn is indicated by the standard deviation of the population means (the numerator of the formula for
66 CHAPTER 9
φ given in the problem). Given the population group means provided, the population grand mean is 20,
and we have the following standard deviation of means σm:
∑ (µ − µ) a
2
σm = ∑α 2
j a= j
This fixed characteristic of the population will be the same regardless of the method used to analyze
the data. In addition, the degrees of freedom are practically the same for the three approaches. The
numerator degrees of freedom are a – 1 = 3 – 1 = 2 in each case. The denominator degrees of free-
dom are N – a = 30 – 3 = 27 for the posttest-only and gain-score analyses, and N – a – 1 = 26 for
ANCOVA, which requires estimation of a slope parameter as well as a parameter for each of the a
groups.
What can vary across the analyses, depending on the correlation between the pretest and the post-
test, is the error variance, as indicated in the problem. The error variance in the posttest-only analy-
sis, σε2 = 202 = 400, is unaffected by this correlation, since information about the pretest is ignored.
However, the error variance in the other approaches can be quite different from that in the posttest-
only approach and in some cases can in fact be larger. For example, when ρ = 0, as in part (a), the
error variance in the gain-score analysis is
On the other hand, the error variance in ANCOVA will be no larger than that in the posttest-only
analysis and will generally be smaller. For example, when ρ = .7, the error variance in ANCOVA is
Carrying out these calculations for the error variances for the various analyses and values of the
pretest-posttest correlation yields the following values.
Given these error variances, we can readily calculate the values of φ that we need to determine power
using the Pearson-Hartley chart in Table A-11 for dfnum = 2. For example, for the posttest-only design
we have
φ=
∑α 2
j a
=
8.165
=
8.165
= (.408)(3.162) = 1.291.
σε n 400 / 10 20 / 3.162
Designs with Covariates 67
Going up vertically from the point on the horizontal axis corresponding to φ = 1.291 for α = .05 in
the chart, we see that the lines for dfdenom = 20 and dfdenom = 30 are around a height of .45 for this φ.
Thus, power is .45 in the posttest-only design for ρ = 0; and because is σε2 not affected by ρ for this
analysis, this is the power for all values of ρ.
In the gain-score analysis for ρ = 0, the inflated error variance results in a smaller φ and hence in
less power. Specifically,
φ=
∑α 2
j a
=
8.165
=
8.165
=
8.165
= .913.
σε n 800 / 10 20.284 / 3.162 8.944
This value of φ is so small that it does not appear on the Pearson-Hartley charts for α = .05. However,
at the smallest value that does appear of φ = 1.0, the power for 30 denominator degrees of freedom
is only barely above .30. Thus the power to detect a smaller effect with 27 degrees of freedom would
be even less, although projecting the power curve out to φ = .9 indicates that the power is still only
a little below .3.
At the other extreme, when the correlation is .7 and an ANCOVA approach to analysis is used, the
relatively small value of σε2 translates into a large φ value and high power:
φ=
∑α 2
j /a
=
8.165
=
8.165
=
8.165
= 1.808.
σε / n 204 / 10 14.283 / 3.162 4.517
Visually interpolating between the curves for 20 and 30 degrees of freedom results in an estimate of
.75 power for df = 26.
Using these same methods for the other values of ρ yields the following values of φ and corre-
sponding estimates of power.
Approach
e. There are two principal conclusions suggested by these power results. First, the power of the
ANCOVA approach is in general larger than that of the posttest-only approach, with the extent of the
power advantage increasing with the pretest-posttest correlation. Second, the gain-score analysis is
in general less powerful than the ANCOVA approach and requires a pretest-posttest correlation of .5
to be as powerful as a procedure that ignores the pretest entirely. As the correlation increases beyond
+.5, the power of the gain-score analysis exceeds that of the posttest-only approach and approaches
that of the ANCOVA approach.
Though not suggested by these results, there are two extreme cases where minor exceptions
to these general principles arise. First, if there is no pretest-posttest correlation, the posttest-only
approach is slightly more powerful than the ANCOVA approach, as a result of having one more
degree of freedom. Second, if the pretest-posttest correlation is approximately +1.0, the gain-score
analysis can be slightly more powerful than ANCOVA, again by virtue of having an additional
degree of freedom for error.
68 CHAPTER 9
CHAPTER 9 EXTENSION
2. a. An ANOVA of hours spent drinking per week at follow-up did not approach significance, F(1,166)
= .149, p = .700. If this were the only test performed, one would conclude that the group assignment
did not affect time spent drinking.
b. (i) An ANOVA of change in hours spent drinking, in contrast to the analysis in part a, would clearly
be statistically significant, F(1,166) = 5.710, p = .018. If this were the only test performed, one
would conclude that the group assignment did affect change in time spent drinking.
(ii) An ANCOVA of hours spent drinking, assuming homogeneity of regression, would not approach
significance, F(1,165) = 2.292, p = .132. The adjusted group means would be 3.52244 for the
control group and 2.70256 in the experimental group, for a difference of .8199. This would lead
one to conclude that, if the two groups had been spending the same time drinking at baseline,
the treatments would not have resulted in different mean hours spent drinking at the one-month
follow-up.
c. As discussed in Chapter 9, stratifying on a relevant covariate has clear benefits including assuring
groups will be quite similar in mean levels on the covariate. This in turn implies that the precision
of estimates of differences in adjusted means will be greater than with simple random assignment.
Although random assignment assures you that in the long run there would be no differences across
groups in the population means on the covariate, it is possible that groups can be different in sample
means on the covariate in any one replication. In the current data, if one were to test for differences
at baseline in hours spent drinking, the result would be statistically significant, F(1,166) = 4.085, p =
.045. Although one knows that over replications with random assignment the means would not differ
across groups, the fact that there is a moderately large (d = .312) difference at baseline implies that
the groups were starting at substantially different levels, which constrained the amount of change
that was possible. For example, if participants in both groups had stopped drinking entirely after
the baseline assessment, nonetheless a test of the change in amount of drinking would have been
significant.
d. (i) There is a strong relationship overall between hours spent drinking at baseline and follow-up,
as evidenced by the test of regression in the standard ANCOVA of part b(2), F(1,165) = 48.645,
p < .001. This corresponds to a within-group correlation of .477 (which here could be computed
as the partial correlation of baseline and follow-up hours, controlling for group). However, this
means the best single weight of the baseline measure is bW, the within-group regression coeffi-
cient for an ANCOVA assuming homogeneity of regression, which here is much less than 1 and
in fact is .456. The benefit in terms of precision of using this weight rather than 1 is indicated
by comparing the ANCOVA mean square error of 11.978 to that for the ANOVA of gain scores
of 16.917 (the latter value is in fact slightly larger than the mean square error of 16.909 for an
ANOVA of follow-up hours ignoring the baseline because the within-group correlation is a little
less than .5 here).
(ii) The test of heterogeneity of regression approaches statistical significance, F(1,164) = 3.700, p =
.056. A scatterplot of the data in the two groups with the separate regression lines for the control
and experimental groups is shown below:
As the plot indicates, the slope of the regression line in the control group is considerably steeper
than in the experimental group. One reason this might obtain here is that, given the two groups
ended up with similar means on the dependent variable, the higher mean in the treatment group
at baseline tends to reduce the slope of the best fitting line in that group.
(iii) As argued in the Chapter 9 extension, in such a situation it would be preferable to allow for
heterogeneity of regression in order to have a more precise test of the treatment effect and to be
able to examine it at different points on the scale of the covariate.
Designs with Covariates 69
e. Estimates of the expected mean on the dependent variable for different values of the covariate
in the two groups may be easily obtained using the separate regression equations, as indicated in
Equation E.21:
Yˆij = a j + b j X ij . (E.21)
For the control group (group 1) and the experimental group (group 2) these equations here are
Determining the standard error of each prediction is also relatively straightforward once one has esti-
mates of the residual variance and has computed the within-group sum of squares of the covariate—
e.g., by performing an ANOVA on the covariate. As suggested by Equation E.15, the variance of the
prediction at a particular value of the covariate, X p, for group j may be estimated as
1 ( X p − X j )2
σˆ Y2ˆ = σˆ 2 + .
pj n j ∑ ( X ij − X j )2
i
For the current data, using the mean square error for the ANCOHET model as the estimate of σ 2
and the sum of squares within each group on the covariate as the denominator of the second term
70 CHAPTER 9
within brackets in the above equation, these squared standard errors of the predictions at Xp may be
computed for the two groups as
1 ( X − 3.449)2
σˆ Y2ˆ = 11.785 + p ,
p1
89 1035.522
1 ( X − 4.734)2
σˆ Y2ˆ = 11.785 + p .
p2
79 1771.416
The standard error of the difference in predictions needed to carry out the test at a particular point
may be obtained using Equation E.26:
1/ 2
1 1 ( X p − X 1 )2 ( X p − X 2 )2
σˆ Yˆ ˆ = shet + + + . (E.26)
p1 −Y p 2 n1 n2 ∑ ( X i1 − X 1 )2 ∑ ( X i 2 − X 2 )2
i i
Alternatively, if one has the standard error of the prediction for each of the groups, the standard error
of the difference in predictions could be obtained by using the principle that the variance of the dif-
ference of two independent random variables is equal to the sum of the variances of the individual
variables, which here implies
(i) The grand mean on the baseline measure was 4.05357 hours drinking per week. Thus, the pre-
dicted hours drinking at follow-up for participants starting at this grand mean in the two groups
would be
Thus, the expectation is that someone whose hours drinking at baseline was at the grand mean
would be spending 3.6209 – 2.7674 = .8535 more hours drinking at follow-up if in the control group
instead of the experimental group. The squared standard errors of these two predictions here are
1 ( 4.0536 − 3.449)2
σˆ Y2ˆ = 11.785 + = 11.785[.01124 + .00035] = 0.1366,
p1 89 1035.522
1 ( 4 .0536 − 4 .734 ) 2
σˆ Y2ˆ = 11.785 + = 11.785[.01266 + .00026] = 0.1523.
p2 79 1771.416
That is, the standard error of the prediction for the control group is .1366 = .3696 and for the
experimental group is .1523 = .3902. This in turn implies that the squared standard error of the
difference in predictions for participants at the grand mean on the covariate would be:
Thus, we see that the test of the difference in hours drinking post for those at the grand mean on
the covariate would yield, using Equation E.25
Fortunately, the separate predictions and their standard errors can be obtained easily from stan-
dard statistical packages. In SPSS, the predictions and standard errors in this case could be
obtained even without using syntax by specifying a model including both the main effect of
group, the covariate, and the group by covariate interaction, and requesting the option of display-
ing expected marginal means for group, which by default are computed at the grand mean on the
covariate. This would generate the following output, confirming our previous computations (to
get four digits of accuracy in SPSS as in our previous computations, one can double-click on the
table in the SPSS output viewer, highlight the cells of interest, right-click, choose “Cell Proper-
ties”> “Format Value” > Category “Number,” and set Decimals to “4”):
Estimated Marginal Means
The group to which the participant was randomized.
Dependent Variable: Follow-up assessment of total number of hours spent drinking
during a typical week
Given this, one might be tempted to think that the test of Group in the accompanying source
table that follows would be the test of the difference in these expected marginal means.
Tests of Between-Subjects Effects
Dependent Variable: Follow-up assessment of total number of hours spent drinking during a
typical week
Type III Sum
Source of Squares df Mean Square F Sig.
However, the test of Group shown here should be thought of as the test of the intercepts of the
two regression lines (i.e., 1.113 in the control group vs. 1.307 in the experimental group) or
comparing the estimated hours drinking post for participants who had zero hours drinking in the
72 CHAPTER 9
typical week at baseline. Given all participants reported some drinking in the previous month
(even if a very few reported no drinking in a “typical” week), the test of Group shown above
is certainly not representative of the expected difference across groups for a typical subject. If
one wants to rely on default tests such as this to test the difference in drinking post for a typical
subject it would be imperative to express the covariate in “centered” or deviation score form.
(ii) and (iii) The center of accuracy is as noted in the text a weighted average of the group means on
the covariate where the weight of a group mean is the sum of squares within on the covariate for
the other group. This may be expressed as follows:
∑ (X − X 2 ) X 1 +∑ i ( X i1 − X 1 ) X 2
2 2
i i2
Ca = .
∑ ∑( X − X j)
2
ij
j i
The computation for the current data would yield a value somewhat closer to the mean of group
1 than group 2 given that the variability at baseline in group 2 was somewhat greater than that
in group 1, implying that the slope in group 1 would be slightly less precisely estimated than in
group 2, which in turn means that greatest precision of the difference could be achieved by estimating
the group difference at a point somewhat closer to group 1’s mean. The center of accuracy here is
∑ (X − X 2 ) X 1 +∑ i ( X i 1 − X 1 ) X 2
2 2
i2
Ca = i
∑ ∑( X − X j)
2
ij
j i
To obtain predictions and standard errors at covariate values other than at the grand mean in
SPSS, one needs to use syntax. This can be done most easily by requesting expected marginal
means at particular values of the covariate, as shown in the following:
This generates the following three sets of expected marginal means (to six digits of accuracy):
Using the estimated means and standard errors from the second table that were computed at the
center of accuracy—i.e. at 3.92 hours drinking at baseline—for the test of group differences at
the center of accuracy at follow-up we would have
Note that the change in precision from that at the grand mean on the covariate is rather trivial (the
standard error decreased from .5374 in part e(1) to .5372 here). This typically will be the case given
the only terms in Equation E.26 that are changing are the final two terms which have the sum of
squares on the covariate in the denominator, and these two terms will generally be very small when
Xp is at the grand mean or the center of accuracy. However the numerator is considerably smaller
than that in part e(1) because the shift away from the grand mean in this case is toward the point
where the regression lines intersect. Note also that the difference in the predicted scores at the cen-
ter of accuracy, that is, 3.5404 – 2.7205 = .8199, is equal to the difference in adjusted means found
in the standard ANCOVA assuming homogeneity of regression, as found in part b(2).
It is possible to have SPSS not only compute predictions and their standard errors but also
carry out the pick-a-point tests. This requires use of the LMATRIX subcommand in UNI-
ANOVA, which is less straightforward than simply requesting expected marginal means. The
syntax needed to carry out the tests and also produce estimates of predictions for each group for
the three covariate values requested in part e of this exercise would be:
/CRITERIA=ALPHA(.05)
/LMATRIX “test for diff at x=xbar = 4.05357 “ Group 1-1
Group*BaseHrsDrkTypWk 4.0535714 -4.0535714
/LMATRIX “test for diff at x=Ca” Group 1-1
Group*BaseHrsDrkTypWk 3.923382 -3.923382
/LMATRIX “test for diff at x=xbar + ~1 SD = 8.0” Group 1 -1
Group*BaseHrsDrkTypWk 8.0 -8.0
/LMATRIX “estimate y for control group at x_bar =4.05357” intercept 1
Group 1 0 Group*BaseHrsDrkTypWk 4.0535714 0
/LMATRIX “estimate y for treatment group at x_bar = 4.05357” intercept 1
Group 0 1 Group*BaseHrsDrkTypWk 0 4.0535714
/LMATRIX “estimate y for control group at x=Ca” intercept 1 Group 1 0
Group*BaseHrsDrkTypWk 3.923382 0
/LMATRIX “estimate y for treatment group at x=Ca” intercept 1 Group 0 1
Group*BaseHrsDrkTypWk 0 3.923382
/LMATRIX “estimate y for control group at x=8.0” intercept 1 Group 1 0
Group*BaseHrsDrkTypWk 8.0 0
/LMATRIX “estimate y for treatment group at x=8.0” intercept 1 Group 0 1
Group*BaseHrsDrkTypWk 0 8.0
/DESIGN=Group Group*BaseHrsDrkTypWk.
Note that the DESIGN statement now omits the main effect for the covariate, and the coeffi-
cients after the effects in the LMATRIX commands are the coefficients of the parameters in the
overparameterized ANOVA model (cf. Green et al., 1999). The first three LMATRIX commands
request the test of the difference in predictions at the points specified, and the next six LMATRIX
commands request the estimate of the value of the prediction for each group separately at each
of the specified covariate values. Output is shown next just for the test where the baseline hours
is 8.0 (generated by the third LMATRIX command above), and, subsequently, the corresponding
predictions are displayed (generated by the eighth and ninth LMATRIX commands noted above).
Contrast Results (K Matrix)a
Dependent Variable
Follow-up assessment of total
number of hours spent drinking
Contrast during a typical week
Test Results
Dependent Variable: Follow-up assessment of total number of
hours spent drinking during a typical week
Sum of
Source Squares df Mean Square F Sig.
The test shows that for participants who were at 8 hours drinking per week at baseline, the two
groups would in fact be expected to differ significantly in hours spent drinking at follow-up:
1.873
t (164) = = 2.442, p = .016.
.767
Although the standard error at this point is considerably larger than those at or near the grand mean on
the covariate, because the regression lines are diverging, the expected difference between groups is at
this point large enough to claim statistical significance. (This presumes one is content to use an unad-
justed alpha level for the test, as requested in this exercise; to obtain a more conservative test, one could
compare the observed t against an adjusted critical value of W, as explained in the text, permitting
tests at an arbitrarily large number of covariate values. Here W = 2 F (2,164) = 2 3.06 = 3.50,
meaning that the more conservative adjusted test would be nonsignificant.)
As shown in the output that follows, the expected difference of 1.873 hours drinking post for
those spending 8 hours drinking at baseline is a consequence of the fact that such participants in the
control group would be expected to be drinking 6.062 hours a week whereas those in the experi-
mental group would be expected to be drinking 4.189 hours a week, i.e. 6.062 – 4.189 = 1.873.
Dependent Variable
Follow-up assessment of total
number of hours spent drinking
Contrast during a typical week
Test Results
Dependent Variable: Follow-up assessment of total number of hours spent
drinking during a typical week
Sum of
Source Squares df Mean Square F Sig.
Sig. .000
95% Confidence Interval Lower Bound 3.263
for Difference Upper Bound 5.116
a
Based on the user-specified contrast coefficients (L’) matrix: estimate y for treatment group at x = 8.0
Test Results
Dependent Variable: Follow-up assessment of total number of
hours spent drinking during a typical week
Sum of
Source Squares df Mean Square F Sig.
f. The preferred approach to analyzing these data would be one that covaries the hours spent drinking
at baseline but which allows for heterogeneity of regression. The ANOVA of part a is less desirable
because it ignores both individual differences among participants in hours spent drinking at baseline
and the preexisting differences across groups on this variable. The analysis of change scores and a
standard ANCOVA both consider the individual differences at baseline. However, the change scores
analysis that found a significant difference might be said to present an overly optimistic view of
the effect of the treatment here, and the ANCOVA, which found a nonsignificant difference, might
be said to present an overly pessimistic view of the treatment effect. In fact, the substantial het-
erogeneity of regression, although nonsignificant, clearly suggests that the effect of the treatment
does depend on how much time the participant was spending drinking at baseline. For the typical
participant who was spending only about 4 hours a week drinking pre, the expected outcome would
be about the same in the two groups. However, for more frequent drinkers, such as those who were
spending 8 hours a week drinking at baseline, which was about one standard deviation above the
mean, the expected outcome would be significantly less time spent drinking post if assigned to the
goal-setting condition as opposed to the control condition.
10
7. a. Combining the 15 scores for each therapy method into one group, we obtain the following means,
standard deviations, and sums of squared deviations from group means.
∑ (Y
2
Method Yj sj ij − Yj )
i
Thus, the grand mean Y is 42, and the sum of squares for the method (A) effect here is
15
SS A = ∑ ∑ (Y j − Y ) 2 = 15∑ (Y j − Y ) 2 = 15[(−2) 2 + 02 + 22 ]
j i =1 j
= 15[4 + 4] = 1200.
The degrees of freedom for the method effect is a –1 = 3 – l = 2. The sum of squares within (or error
of the full model) is here
and is based on a(n – 1) = 3(15 – 1) = 3·14 = 42 degrees of freedom. Thus, the F for the method
effect, analyzed as a one-way design, is
SS A / df A 120 / 2 60
F= = = = 5.9155.
SS W / df W 426 / 42 10.1429
Comparing against the critical F for α = .01, F(2, 42) = 5.16, the results are declared significant.
77
78 CHAPTER 10
b. Approaching the data as a two-way, fixed-effects design, we have the following cell means, standard
deviations, and sums of squared deviations from cell means.
1 2 3 Yj
Y jk 38 42 40 40
RET sjk 2.9155 2.9155 3.3912
34 34 46
∑
2
i (Yijk − Yijk )
Y jk 41 44 41 42
CCT sjk 2.4495 3.3912 3.0822
24 46 38
∑
2
i (Yijk − Y jk )
Y jk 46 44 42 44
BMOD sjk 2.3452 3.5355 2.3452
22 50 22
∑ i (Yijk − Y jk )
2
df A = a − 1 = 3 − 1 = 2,
SS W = ∑ ∑ ∑ (Yijk − Y jk ) 2 = 34 + 34 + 46 + 24 + 46 + 38 + 22 + 50 + 22 = 316,
i j k
Combining these values to compute the F for the method effect, we have
SS A / df A 120 / 2 60
FA = = = = 6.8354.
SS W / df W 316 / 36 8.7778
We compare this against a critical F for α = .01, F(2, 36) = 5.26 and again would declare the result
statistically significant.
c. Treating the therapist factor as random would imply that the method effect should be compared with
a denominator error term corresponding to the method × therapist interaction. The method × thera-
pist interaction sum of squares may be computed from the effect parameters calculated as
) = Y −Y −Y + Y .
(αβ jk jk j⋅ ⋅k ⋅⋅
Designs with Random or Nested Factors 79
For example, (αβ ) = 38 – 40 – 41.6667 + 42 = –1.6667. The sum of squares for the method ×
11
therapist effect is then
)2 = n
SS AB = ∑ ∑ ∑ (αβ
i j
jk
k
∑ ∑ (Y jk − Y j⋅ − Y⋅k + Y⋅⋅ )2
j k
This is based on (a – 1)(b – 1) = (3 – 1)(3 – 1) = 4 degrees of freedom. Thus the test of the method
effect, analyzing the data as a two-factor mixed design, yields
SS A / df A 120 / 2 60
FA = = = = 3.60.
SS AB / df AB 66.6666 / 4 16.6667
However, this is now compared against a critical F with only 2 and 4 degrees of freedom. For α =
.05, the critical F is 6.94; for .01, it would be 18.00. Thus, treating therapists as a random factor, the
method effect does not approach significance.
d. The sum of squares for the method effect was 120 in each of the three analyses.
e. The denominator mean square error terms were 10.1429 for the one-way approach, 8.7778 for the
two-way fixed-effects approach, and 16.6667 for the two-way mixed-effects approach. The sum of
squares within for the two-way approach removes from the sum of squares within for the one-way
approach any variability that can be attributed to B or AB effects. In fact, it is the case that
SS B + SS AB + SS W(part b)
Mean square error term in (a) =
df B + df AB + df W(part b)
43.3316 + 66.6666 + 316 425.9982
= = = 10.1428.
2 + 4 + 36 42
The error term for the random effects approach uses the mean square for the AB interaction, which
here happens to be larger than either MSW value.
f. It is reasonable that Kurtosis obtained different results than Skewed. Kurtosis obtained a smaller
F for the method effect yet had to compare it against a larger critical F value since she had fewer
degrees of freedom for her error term. In addition, the rationale for evaluating the method effect is
quite different in the mixed-effects case than in the fixed effect case. In the mixed-effects case, the
question is, Is the effect of methods large relative to the variability that we would expect to result
from randomly selecting therapists for whom the methods effects differ? Because the magnitude of
the method effect varies somewhat over therapists (from an 8-point mean difference for therapist 1
to a 2-point mean difference for therapist 3), it is reasonable to conclude that the variability among
the marginal means for methods may just be the result of which therapists happened to be used in the
study. However, it is the case (as Skewed found) that, if outcomes with these three therapists are the
only ones to which we want to generalize, we can conclude that the three methods would result in
different mean outcomes for the population of possible subjects.
80 CHAPTER 10
10. a. (1) The design may be diagrammed as follows, where C designates cleaning and F filling.
The factor of dental procedure (P) is crossed with the random factor of the specific tape (t),
which in turn is nested within levels of the factor of kind of tape (K). “Subjects” of course
constitutes a random factor that is nested within combinations of levels of all the other factors.
Thus, the basic structure of the design may be labeled as Ρ × t/K.
(2) With three factors, if the factors were completely crossed, there would be seven effects that could
be tested (three main effects, three two-way interactions, and one three-way interaction). However,
because here t is nested within K, the t main effect and the tK and PtK interactions cannot be tested.
Instead, we can examine the simple effects of factor t within levels of K (that is, t / K ) and the
simple interactions of factors Ρ and t within levels of K (that is, Pt / K ). Thus the testable effects are
Testable Verbal
Effect Label Status
The degrees of freedom associated with the main effects of the crossed factors are as usual
one less than the number of levels of each factor, and the degrees of freedom for their interac-
tion is equal to the product of the degrees of freedom for the main effects of the factors involved
in the interaction. Thus, if Ρ and K are crossed factors with p levels of factor Ρ and k levels of
factor K, the main effect of Ρ would have p – 1 degrees of freedom, the main effect of K would
have k – 1 degrees of freedom, and the PK interaction would have (p – 1)(k – 1) degrees of free-
dom. The tests of the nested factor are carried out as pooled tests of simple effects. For example,
carrying out the test of the simple main effects of a factor having t levels nested within each of
the k levels of factor K is like carrying out k one-way ANOVAs, each of which would have t – 1
degrees of freedom. Thus the pooled test of the simple main effects of factor t within each level
of factor K, or the test of the t / K effect, has k(t – 1) degrees of freedom. Similarly, in consider-
ing the Pt / K effect, one is pooling k simple interaction effects each of which has (p – 1)(t – 1)
degrees of freedom, and so the Pt / K effect has k(p – 1)(t – 1) degrees of freedom.
The error terms for testing these effects can be determined by reference to the preceding dia-
gram and the flowchart in Figure 10.7. Considering first the Ρ main effect, there are no random
factors nested under levels of P; but since Ρ is crossed with t, there is an interaction of factor Ρ
Designs with Random or Nested Factors 81
with the random factor t within each level of K. There is only one such effect, so Pt / K suffices
as the error term for testing P. For the main effect of kind of tape, factor K, there is a random
factor nested within its levels at the next lowest level of the hierarchy, and thus t/K is selected as
the error term. In considering the KP interaction, as explained in the discussion in the main text
of the Figure 10.7 flowchart’s rule (i), Pt / K is considered to be an interaction of the effect to be
tested with a random factor and thus is selected as the error term. Both effects t / K and Pt / K
involve all the random factors and so the flowchart implies MSW is the correct error term. MSW
in a design of this structure is the average of the ptk within-group variances, each of which is
based on n – 1 degrees of freedom, and so has ptk(n – 1) degrees of freedom.
Thus, the summary of effects, error terms, and degrees of freedom is as follows:
Denomin-
ator
Effect dfEffect Error Term dfError
b. Although usually increasing the number of subjects will increase df for the denominator, that is not
the case for testing the effect of the kind of tape here. The denominator term for testing the main
effect of factor K is t / K and its df depends solely on the number of levels of factors t and K. To
increase df for testing the effect of kind of tape, one would need to increase the number of specific
tapes of each kind, not the number of subjects. However, increasing n will result in more precise
estimates of the means of the levels of K and will thus cause power to increase, even though the criti-
cal F value is not affected.
12. a. The type of Feedback (factor F), immediate or delayed, is a fixed factor crossed with the other fac-
tors in the design. The type of Concept (factor C), either disjunctive or conjunctive, is also a fixed
factor and is crossed with factor F. The specific problem (factor p) is a random factor nested within
levels of factor C but crossed with factor F. That is, each problem appears together with only one
type of Concept but each problem appears together with all types of feedback. Thus, the basic struc-
ture of the design may be labeled F × p/C.
b. Following the same logic as was explained in the answer to Problem 10, we arrive at the following
five testable effects in this design, and the logic of the flowchart of Figure 10.7 leads to the error
terms indicated.
c. Let us designate specific levels of the factor of Feedback by j = 1, 2, specific levels of the Concept
factor by k = 1, 2, and the specific problems within a type of Concept by l = 1, 2, 3, 4. Thus, we would
have the following table of cell and marginal means:
The effects of the completely crossed factors of Feedback and Concept can be handled by esti-
mating effect parameters as discussed in Chapters 7 and 8, squaring them, and summing over all
observations. Applying this logic to the effects involving the nested factor of problems will require
estimating a different set of effects for each of the levels of the Concept factor within which it is
nested. Letting αj, βk, and (αβ)jk refer to the main effects of Feedback, Concept, and their interaction,
respectively, we can refer to the simple effect of problems within the kth level of Concepts by γl / k
and the simple interaction of Feedback by problems within the kth level of Concepts by αγ jl / k . Thus
we can state the full model here as
Yijkl = µ + α j + βk + αβ jk + γl / k + αγ jl / k + εijkl .
Let us begin our computations with the most straightforward effects of Feedback and Concept. The
sums of squares for these completely crossed effects can be obtained very easily from the following
summary table of marginal means for the Feedback and Concept factors.
Concept
) = Y . − (µˆ + α
(αβ ˆ + βˆ k ).
jk jk
Designs with Random or Nested Factors 83
The sum of squares for the pooled simple effects of the nested factor can be obtained by first
estimating the effect parameters for this factor separately at each level of the factor within which
it is nested (see Equation 28). We will in essence be computing the sum of squares for a one-way
ANOVA of the effects of problem at each level of the Concept factor, and then pooling these sums
of squares—i.e.,
SS p / C = SS p at C1 + SS p at C 2 .
The first step is to compute, at each level of the Concept factor, the estimated effects of the nested
factor as the difference between the mean for each problem and the marginal mean for that kind of
Concept. We will thus be computing
γˆ l /1 = Y⋅1l − Y⋅1⋅
and
γˆ l / 2 = Y⋅2 l − Y⋅2⋅
Disjunctive Conjunctive
Y⋅1l γˆ l /1 Y⋅2l γˆ l / 2
3 –1 1.5 –.5
4 0 1.0 –1.0
4 0 4.5 2.5
5 +1 1.0 –1.0
Y⋅1⋅ = 4 Y⋅2⋅ = 2
Then we can compute the sum of squares for the pooled simple effects of our nested factor of prob-
lems as
SS p /C = ∑ ∑ ∑ ∑ γˆl2/ k = ∑ ∑ 2 ⋅ 2 γˆl2/ k
l k j i l k
2 2
= 4[(−1) + 0 + 0 + 1 ] + 4[(−.5)2 + (−1)2 + 2.52 + (−1)2 ]
= 4[(1 + 1) + 4(.25 + 1 + 6.25 + 1) = 4(2) + 4(8.5)
= 8 + 34 = SS p at C1 + SS p at C 2
= 42 = SS p /C .
84 CHAPTER 10
Similarly, the simple interaction effects of Feedback × Problem are estimated separately at each level
of the Concept factor. That is, we compute
= Y −Y −Y + Y
αγ j l /1 j1l j1i i1l i1i
and
= Y −Y −Y + Y .
αγ jl/2 j 21 j 2i i 2l i 2i
Estimates of these simple interaction effects are shown in the table below next to the corresponding
cell means:
Y11l
αγ Y21l
αγ Y12l
αγ Y22l
αγ
1l /1 2 l /1 1l / 2 2l / 2
Then we can compute the pooled sum of squares for the simple interactions of Feedback with Prob-
lems as follows:
2 2
=
SS Fp / C = ∑ ∑ ∑ ∑ αγ
l k
jl/k
j
∑ ∑ ∑ ∑ [Y jkl − (Y jk ⋅ + Y⋅kl − Y⋅k ⋅ )]2
i l k j i =1
= ∑ 2∑ ∑ [Y jkl − (Y jk ⋅ + Y⋅kl − Y⋅k ⋅ )]2
k
l j
= 2[(3 − 2) 2 + (3 − 4) 2 + (3 − 3) 2 + (5 − 5) 2 + (2 − 3) 2 + (6 − 5) 2
+ (4 − 4) 2 + (6 − 6) 2 ] + 2[(1 − 1) 2 + (2 − 2) 2 + (1 − .5) 2
+ (1 − 1.5) 2 + (4 − 4) 2 + (5 − 5) 2 + (0 − .5) 2 + (2 − 1.5) 2 ]
= 2[1 + 1 + 0 + 0 + 1 + 1 + 0 + 0] + 2[0 + 0 + .25 + 0 + 0 + .25 + .25]
= 2(4) + 2(1) = 8 + 2 = SS Fp at c1 + SS Fp at c2 = 10 = SS Fp / C .
To carry out the tests against the error terms outlined in part (b), we only need to determine the
degrees of freedom for the various effects. Using lowercase letters to designate the number of levels
of a factor, we have the following values:
Designs with Random or Nested Factors 85
Source df
F ( f – 1) = (2 – 1) = 1
C (c – 1) = (2 – l) = l
FC ( f – 1)(c – l) = (2 – 1)(2 – l) = l
p/C c(p – 1) = 2(4 – 1) = 2 · 3 = 6
Fp / C c( f – 1)(p – l) = 2(2 – 1)(4 – 1) = 2 · 1 · 3 = 6
MSW fcp(n – 1) = 2 · 2 · 4(2 – 1) = 16(1) = 16.
Thus, we have the following test statistics and critical values at α = .05 for our five testable effects:
SS F / df F 18 / 1
FF = = = 10.8; Fcrit = F1, 6 = 5.99; p < .05
SS Fp / C / df Fp / C 10 / 6
SSC / df C 32 / 1 32
FC = = = = 4.57; Fcrit = F1, 6 = 5.99; n.s.
SS Fp / C / df p / C 42 / 6 7
SS FC / df FC 2 /1
FFC = = = 1.2; Fcrit = F1.6 = 5.99; n.s.
SS Fp / C / df Fp / C 10 / 6
SS p / C / df p / C 42 / 6 7
Fp / C = = = = 1.4; Fcrit = F6,16 = 2.74; n.s.
MS W 5 5
SS Fp / C / df Fp / C 10 / 6
FFp / C = = = .33; Fcrit = F6,16 = 2.74; n.s.
MS W 15
It follows that the only effect for which we have grounds for rejecting the null hypothesis is the main
effect of Feedback. Delayed feedback results in more errors being required to reach the criterion
performance than immediate feedback.
III
Model Comparisons for Designs
Involving Within—Subjects Factors
11
3. a. From Equation 24, predicted scores for the full model are of the form
For example,
Yˆ11 = (3 + 4 + 2 + 4 + 7) / 5 + (3 + 6 + 4 + 5) / 4 − 4.5
= 4 + 4.5 − 4.5
= 4,
Yˆ = (6 + 7 + 1 + 5 + 6) / 5 + (3 + 6 + 4 + 5) / 4 − 4.5
12
= 5 + 4.5 − 4.5
= 5,
Yˆ21 = (3 + 4 + 2 + 4 + 7) / 5 + ( 4 + 7 + 4 + 8) / 4 − 4.5
= 4 + 5.75 − 4.5
= 5 .25,
and so forth. Completing similar calculations for all other rows and columns, we find that the pre-
dicted scores for the full model are as follows.
Location
Subject 1 2 3 4
89
90 CHAPTER 11
Location
Subject 1 2 3 4
1 –1.00 1.00 1.00 –1.00
2 –1.25 .75 –.25 .75
3 .75 –1.25 .75 –.25
4 .75 .75 –1.25 –.25
5 .75 –1.25 –.25 .75
c. From Equation 25, predicted scores for the restricted model are of the form
For example,
Yˆ11 = Y1⋅
= (3 + 6 + 4 + 5) / 4
= 4.50,
Yˆ = Y
12 1⋅
= (3 + 6 + 4 + 5) / 4
= 4.50,
Yˆ21 = Y2⋅
= ( 4 + 7 + 4 + 8) / 4
= 5.75,
and so forth. Completing similar calculations for all other rows and columns, we find that the pre-
dicted scores for the restricted model are as follows.
Location
Subject 1 2 3 4
1 4.50 4.50 4.50 4.50
2 5.75 5.75 5.75 5.75
3 1.75 1.75 1.75 1.75
4 3.75 3.75 3.75 3.75
5 6.75 6.75 6.75 6.75
Location
Subject 1 2 3 4
1 –1.50 1.50 –.50 .50
2 –1.75 1.25 –1.75 2.25
3 .25 .75 –.75 1.25
4 .25 1.25 –2.75 1.25
5 .25 –.75 –1.75 2.25
ONE-WAY WITHIN-SUBJECTS: UNIVARIATE APPROACH 91
( ER − EF ) (df R − df F )
F= .
EF df F
df R − df F = a − 1
= 4 −1
= 3,
df F = (n − 1)(a − 1)
= (5 − 1)(4 − 1)
= 12.
In addition,
(40.00 −15.00) / 3
F=
15.00 / 12
= 6.67.
For the unadjusted test, there are 3 numerator and 12 denominator degrees of freedom. The critical
value at α = .05 is 3.49 (see Appendix Table 2), so there is a statistically significant difference among
the locations.
f. The degrees of freedom for the Geisser-Greenhouse lower bound correction are
df num = 1,
df den = n − 1
(11.33)
= 5 −1
= 4.
The critical F value for 1 numerator and 4 denominator degrees of freedom is 7.71 at the .05 level.
Thus, the difference among locations is nonsignificant with the Geisser-Greenhouse lower bound
correction.
92 CHAPTER 11
g. Obviously, the simplest way to obtain the value of ε̂ is to rely on a computer program. Neverthe-
less, we will illustrate its calculation here. The first step is to calculate the covariance matrix for
the data. If we let Yij represent the score of subject i in condition j (i.e., the score in row i and
column j of the original data matrix), the element in row j and column k of the covariance matrix
is given by
n
E jk = ∑ (Yij − Y⋅ j )(Yik − Y⋅k ) n.
i =1
Performing this calculation for our data yields the following matrix:
a 2 ( E jj − E⋅⋅ )2
εˆ = ,
( a −1) (∑ ∑ E 2jk ) − (2 a ∑ E 2j⋅ ) + (a 2 E ⋅⋅2 )
where
a = 4,
E jj = (2.8 + 4.4 + 2.8 + 4.8) / 4 = 3.70,
E⋅⋅ = (2.8 + 2.0 + 1.8 + 3.2 + 2.0 + 4.4 + + 3.0 + 4.8) / 16 = 2.95,
∑ ∑ 2jk = (2.8)2 + (2.0)2 + (1.8)2 + + (3.0)2 + (4.8)2 = 150.48,
E
∑ E 2j⋅ = (2.45)2 + (3.15)2 + (2.55)2 + (3.65)2 = 35.75,
so that
Rounding down (to be conservative), the critical F value with 2 and 9 degrees of freedom is 4.26 at the .05
level. Thus, the difference among locations is statistically significant at the .05 level with the ε̂-adjusted
test, just as it was with the unadjusted test. Using a computer program such as SPSS, we can find that the
ρ value for the ε̂-adjusted test is .0125, corroborating the statistical significance at the .05 level.
h. Now that we have calculated ε̂, the value of ε follows easily from Equation 34:
n( a −1)εˆ − 2
ε = (11.34)
( a −1)[n −1− ( a −1)εˆ ]
5( 4 −1)(.81) − 2
=
( 4 −1)[5 −1− ( 4 −1)(.81)]
10.15
=
4.71
= 2.15.
Because ε exceeds 1.00 for these data, it is shrunk back to 1.00. As a consequence, the use of the ε
adjustment simply duplicates the unadjusted test for these data.
4. To calculate SSA and SSA × S, it is helpful to represent the data as in Table 11.3, adding row and column
marginal means, in which case we have the following values.
1 8 10 9.00
2 3 6 4.50
3 12 13 12.50
4 5 9 7.00
5 7 8 7.50
6 13 14 13.50
Marginal Mean 8.00 10.00 9.00
= 6[(8 − 9) 2 + (10 − 9) 2 ]
= 12.00.
n a
SS A × S = ∑ ∑ (Yij − Y⋅ j − Yi⋅ + Y⋅⋅ ) 2
i =1 j =1
SS A / (a − 1)
F=
SS A × S / (n − 1)(a − 1)
12 / 1
=
4 / (5)(1)
= 15.00.
1 10 12 14 12
2 2 5 5 4
3 5 6 10 7
4 12 15 18 15
5 16 17 18 17
Marginal Mean 9 11 13 11
a
SS A = ∑ n(Y. j − Y.. ) 2
j =1
n
SSS = ∑ a (Yi ⋅ − Y⋅⋅ ) 2
j =1
n a
SS A × S = ∑ ∑ (Yij − Y⋅ j − Yi⋅ + Y⋅⋅ ) 2
i =1 j =1
SS A / (a − 1)
F=
SS A × S / (n − 1)(a − 1)
40 / 2
=
8 / (4)(2)
= 20.00.
The critical F value at α = .05 with 2 numerator and 8 denominator degrees of freedom is 4.46. Thus,
the null hypothesis can be rejected.
c. If these data came from a between-subject design, the between-group sum of squares would be cal-
culated as
a
SS A = n∑ (Y j − Y ) 2 (3.51)
j =1
First, notice that the main effect sum of squares SSA is the same whether the data come from a within-
subjects design or from a between-subjects design. However, the within-group sum of squares SSW
in the between-subjects design equals the sum of SSS and SSA × S in the within-subjects design. In
general, it is true that
SS W = SSS + SS A×S.
362 = 354 + 8.
SS A / (a − 1)
F=
SS W / ( N − a )
40 / 2
=
362 / 12
= 0.66.
The critical F value would be 3.89 (see Appendix Table 2, for 2 numerator degrees of freedom, 12
denominator degrees of freedom, and α = .05), so the null hypothesis could not be rejected.
e. The consistent individual differences among subjects are captured by SSS in the within-subjects
design. This source of variance does not contribute to the error term, as it would in a between-subjects
96 CHAPTER 11
design. As a result, the within-subjects design provides appreciably greater power than the between-
subjects design, when large individual differences exist. Notice that, in this numerical example, the
observed F value of 20.00 in the within-subjects design was drastically reduced to a mere 0.66 in the
between-subjects design.
6. a. The new scores on the adjusted dependent variable Yij – Yi⋅ are as follows.
30 36 42 48
–1 –13 1 13
–17 –3 7 13
–8 3 2 3
–6 –5 2 9
–3 4 4 –5
9 6 –5 –10
2 1 –4 1
–7 –13 4 16
–10 10 6 –6
–5 –1 2 4
–4 5 –4 3
–10 –6 9 7
–5.00 –1.00 2.00 4.00
Mean (Y j )
n 474.00 584.00 220.00 728.00
∑ (Y
i =1
ij − Y j )2
The observed F value for a one-way, between-subjects ANOVA on these data would be
( ER − EF ) (df R − df F )
F= .
EF df F
In this design,
a
ER − EF = n∑ (Y j − Y ) 2 (3.51)
j =1
= 12[(−5 − 0) 2 + (−1 − 0) 2 + (2 − 0) 2 + (4 − 0) 2 ]
= 552,
a n
EF = ∑ ∑ (Yij − Y j ) 2
j =1 i =1
df R = N − 1,
df F = N − a,
so
df R − df F = a −1.
ONE-WAY WITHIN-SUBJECTS: UNIVARIATE APPROACH 97
df F = 48 − 4
= 44,
df R − df F = 4 − 1
= 3.
552 / 3
F=
2006 / 44
= 4.04.
With 3 numerator and 44 denominator degrees of freedom, the critical F value at α = .05 would be
approximately 2.84 (see Appendix Table 2), so the null hypothesis would be rejected.
b. The F value of 4.04 in part (a) is larger than the F value of 3.03 obtained from Equation 28 in the
text. However, notice from Tables 11.8 and 11.9 that, when we regarded these data as coming from
a within-subjects design, the error sums of squares were
so
ER – EF = 552.
Thus, EF and ER – EF as calculated in the within-subjects design are identical to the value obtained
in part (a). However, in the within-subjects design, the degrees of freedom for the restricted and full
models are
df R = n(a − 1)
= 12(4 − 1)
= 36,
df F = (n − 1)(a − 1)
= (12 − 1)(4 − 1)
= 33.
Thus, dfR – dfF = 3 in both approaches, but dfF = 33 for the within-subjects design, whereas dfF = 44
in part (a). We can resolve this apparent inconsistency by realizing that the first step in part (a) was
to subtract each subject’s row marginal mean from each original score. In effect, we have calculated
a new dependent variable of the form
Yij − Yi⋅ ,
which equals
Yij − πˆ i .
98 CHAPTER 11
However, there are n – 1 independent πi parameters, so we must increase the number of estimated
parameters by n – 1. In part (a), we had said
df R = N − 1,
df F = N − a.
However, if we count the n – 1 additional independent πi parameters we estimated, the new degrees
of freedom become
df R = N − 1 − (n − 1)
= N −n
= an − n
= n(a − 1),
df F = N − a − (n − 1)
= N − a − n +1
= an − a − n + 1
= (a − 1)(n − 1).
df R = n(a − 1)
= 12(4 − 1)
= 36,
df F = (a − 1)(n − 1)
= (4 − 1)(12 − 1)
= 33.
As a result, dfR – dfF = 3 and dfF = 33. Applying these adjusted degrees of freedom in part (a) would
have given us
552 / 3
F=
2006 / 33
= 3.03
in agreement with the F value calculated in the within-subjects design. Thus, the within-subjects
ANOVA is identical to a between-subjects ANOVA on Yij – Yi once we make the proper adjustment
in degrees of freedom.
c. The answers to parts (a) and (b) show that the within-subjects ANOVA can be duplicated by per-
forming a between-subjects ANOVA on Yij – Yi .. However, by subtracting Yi. from each score, the
new scores treat each subject’s average score as a baseline. Each new score reflects a subject’s per-
formance at level j compared to his or her average performance. In this sense, each subject serves as
his or her own control.
9. a. The theoretical minimum value of ε is 1/(a – 1). For a = 3, this minimum value is .50.
b. Minimum ε = 1/(a – 1)
= 1/(4 − 1)
= .33
ONE-WAY WITHIN-SUBJECTS: UNIVARIATE APPROACH 99
c. Minimum ε = 1/(a – 1)
= 1/(5 – 1)
= .25
d. Minimum ε = 1/(a – 1)
= 1/(2 – 1)
= 1.00
11. Unadjusted Adjusted (see 11.33)
a. F.05;2,28 = 3.34 F.05; 1,14 = 4.60
b. F.05; 3,33 = 2.92 (for dfden = 30) F.05; 1,11 = 4.84
c. F.05; 4,60 = 2.53 F.05; 1,15 = 4.54
d. F.05; 1,9 = 5.12 F.05; 1,9 = 5.12
12
6.
Testable Effects Error Term
Notice that all three of the error terms have the general form MSeffect × S (see Equation 12.5 as well as
12.2, 12.3, and 12.4). The degrees of freedom for the three effects are a – 1 for A, b – 1 for B, and (a – 1)
(b – 1) for A × B. The degrees of freedom for each error term equal n – 1 times the degrees of freedom
of the effect to be tested. In general, the minimum theoretical value of ε for an effect is
Minimum ε = l/dfeffect.
These theoretical minimum values follow from the dimensions of the covariance matrices used to test
these different effects (for example, see Tables 12.8, 12.10, and 12.12).
8. Tests of comparisons using a separate error term always have n – 1 denominator degrees of freedom,
whether the comparison involves marginal means, cell means, or an interaction contrast (see Equations
12.8 and 12.12, and the subsequent discussion of statistical packages). Thus, with 15 subjects, the
denominator degrees of freedom for testing a contrast with a separate error term will equal 14. Thus,
the correct answers here are as follows.
a. 14
b. 14
c. 14
d. 14
100
HIGHER-ORDER WITHIN-SUBJECTS: UNIVARIATE APPROACH 101
12. a. This source table would be appropriate for a 2 × 2 design with 80 subjects, where both factors were
between-subjects. The actual design, however, is a “split-plot” design, where 40 subjects have each
been tested twice.
b. The proper sources, error terms, and degrees of freedom should be as follows (see Tables 12.16 and
12.19).
Source df
Between-Subjects
Mood (A) 1
S/A 38
Within-Subjects
Difficulty (B) 1
Mood × Difficulty 1
Β × S/A 38
Notice that a = 2, b = 2, and N = 40 (see Table 12.16). Thus, the total degrees of freedom sum to 79,
as shown in the student’s table, but the student’s “Within” term fails to distinguish S/A from Β × S/A.
In addition, MSS/A is the proper error term for testing the B main effect, while MSB×S/A is the proper
error term for testing the B main effect and the Α × Β interaction.
c. The sums of squares for the effects shared in common by the two designs will be the same. Thus, the
sums of squares for Mood, Difficulty, and Mood × Difficulty are all correct (presuming, of course,
that they were calculated correctly in the between-subjects design). Further, it is true that SSwithin as
calculated by the student equals the sum of SSS/A and SSB×S/A:
However, it is impossible to tell from the student’s analysis the magnitude of either SSS/A or SSB×S/A
individually. Thus, F values cannot be calculated for any of the effects.
13. a. Mean reaction time scores are as follows.
530 560
470 710
620 560
610 690
600 590
420 750
610 730
650 590
610 780
570 670
b. The F statistic for testing the difference between the mean of the younger subjects and the mean of
the older subjects is
SS between / (a − 1)
F= .
MS W
102 CHAPTER 12
a
SS between = n∑ (Y j − Y ) 2 (see 3.51)
j =1
∑s
j =1
2
j
MS W = (see 3.56)
a
5410.00 + 6734.44
=
2
= 6072.22.
44,180 / 1
F= = 7.28.
6072.22
The critical value with 1 numerator and 18 denominator degrees of freedom is 4.41 (see Appendix
Table 2), so the Age difference is significant at the .05 level.
c. The two F values are identical (see Table 12.19).
d. Yes. The test of the between-subjects main effect in a “split-plot” design is equivalent to a between-
subjects ANOVA on mean scores (averaged over levels of the within-subjects factor). No sphericity
assumption need be made in purely between-subjects designs, so the F test here does not assume
sphericity.
14. a. The source table that results from this analysis is as follows.
Source SS df MS F
Of course, the observed F of 109.48 for the Angle main effect is statistically significant at the .05
level.
b. The F value in part (a), although still large, is only about three-fourths as large as the F value
reported in Table 12.19 for the within-subjects main effect.
c. Both numerator sums of squares equal 435,090.
d. The error sum of squares in part (a) equals 75,510. The value reported in Table 12.19 is 54,420.
The difference in these two values is 21,090, which equals the sum of squares for the Age by Angle
interaction.
e. No. The degrees of freedom for the Angle by Subjects interaction is 38, whereas the degrees of
freedom for Angle by Subjects within Age is 36. The difference equals the degrees of freedom for
the Age by Angle interaction. (It also equals the degrees of freedom for the Angle main effect here,
because there are only two levels of Age.)
HIGHER-ORDER WITHIN-SUBJECTS: UNIVARIATE APPROACH 103
f. No. The F value for the within-subjects main effect in a “split-plot” design cannot be obtained
by simply ignoring the between-subjects factor and then performing a one-way, within-subjects
ANOVA. This latter approach does yield the proper numerator sum of squares for the main effect.
However, the denominator sum of squares is not properly calculated with this approach. The reason
is that the denominator sum of squares in the “split-plot” design represents the inconsistency of
subjects across treatments within each group. Ignoring the between-subjects factor, SSA×S represents
inconsistencies across treatments of subjects within groups and between the groups. As we saw in
part (d),
SS A × S = SS B × S/A + SS A × B .
To the extent that the between-subjects and the within-subjects factors interact, SSA×S of the one-way
repeated measures design will overestimate the proper measure of inconsistency for assessing the
main effect of the within-subjects factor. Instead, the Β × S/A effect of the “split-plot” design will
generally provide the proper error term.
15. The following answers can be found in Table 12.20.
a. Yes, sphericity is assumed.
b. No, sphericity is not assumed.
c. Yes, sphericity is assumed.
d. No, sphericity is not assumed.
e. Yes, sphericity is assumed.
f. No, sphericity is not assumed.
20. a. The data are as follows.
1 6 9 3 6.00
2 18 6 12 12.00
3 15 5 12 10.67
4 11 8 14 11.00
5 17 9 9 11.67
6 7 7 7 7.00
SS A / (a − 1)
F= ,
SS A×S / (n − 1)(a − 1)
where
a
SS A = ∑ n(Y⋅ j − Y⋅⋅ ) 2
j =1
n a
SS A×S = ∑ ∑ (Yij − Y⋅ j − Yi ⋅ + Y⋅⋅ ) 2
i =1 j =1
75.44 / 2
F=
127.89 / ((5)(2))
= 2.95.
The critical F value with 2 numerator and 10 denominator degrees of freedom is 4.10 (see Appendix
Table 2), so the null hypothesis cannot be rejected at the .05 level.
b. The F value in part (a) is considerably smaller than the F value of 7.46 reported in the chapter.
c. These values are identical.
d. The denominator sum of squares in part (a) equals 127.89, as compared to a value of only 40.44 in
the Latin square analysis.
e. The two denominator sums of squares in part (d) differ by 87.45, which is the same (except for
rounding error) as the sum of squares for the time main effect:
or, equivalently,
f. As the last equation for part (e) shows, the sum of squared errors for the Latin square analysis will
be smaller than the sum of squared errors for the ordinary repeated measures design to the extent
that time (that is, sequence) has an effect on subjects’ scores. Indeed, the general purpose of a Latin
square design and analysis is to control for such effects of time. In the numerical example, the
increased statistical power of the Latin square analysis produces a statistically significant treatment
effect that would have gone undetected in an ordinary repeated measures analysis.
13
7. a.
Subject D1 D2 D3
1 3 1 2
2 3 0 4
3 –1 –1 1
4 1 –3 1
5 –1 –2 2
b. In the full model, the predicted score for each subject on a D variable is the mean of that variable.
Thus, here we have
Dˆ 1i = D1 = 1,
Dˆ 2i = D2 = −1,
Dˆ = D = 2.
3i 3
e1i ( F ) = D1i − D1 ,
e2i ( F ) = D2i − D2 ,
e3i ( F ) = D3i − D3 .
The following table presents the errors, squared errors, and cross products for each subject.
1 2 2 0 4 4 0 4 0 0
2 2 1 2 4 1 4 2 4 2
3 –2 0 –1 4 0 1 0 2 0
4 0 –2 –1 0 4 1 0 0 2
5 –2 –1 0 4 1 0 2 0 0
Sum 0 0 0 16 10 6 8 6 4
105
106 CHAPTER 13
c. In the restricted model, the predicted score for each subject on a D variable is zero. Thus, the error
for a variable is just the score itself:
e1i ( R ) = D1i ,
e2i ( R ) = D2i ,
e3i ( R) = D3i .
The following table presents the errors, squared errors, and cross products for each subject.
1 3 1 2 9 1 4 3 6 2
2 3 0 4 9 0 16 0 12 0
3 –1 –1 1 1 1 1 1 –1 –1
4 1 –3 1 1 9 1 –3 1 –3
5 –1 –2 2 1 4 4 2 –2 –4
Sum 5 –5 10 21 15 26 3 16 –6
d. To find the determinant of E(F), we first write E(F) in the form of a matrix:
16 8 6
E(F ) = 8 10 4
6 4 6
21 3 16
E(R ) = 3 15 −6
16 −6 26
so
(| E(R ) | − | E(F ) |) / (a − 1)
F= (13.24)
| E(F ) / (n − a + 1)
(2784 − 344) / (4 − 1)
=
344 / (5 − 4 + 1)
= 4.73.
ONE-WAY WITHIN-SUBJECTS: MULTIVARIATE APPROACH 107
The critical F value with 3 numerator and 2 denominator degrees of freedom is 19.2 (see Appendix
Table 2), so the null hypothesis cannot be rejected at the .05 level.
f. (i) From part (b),
∑e
i =1
2
1i = 16.
(1 − r )∑ e
2
e1e2
2
i2 = (1.00 − .40)(10)
i =1
= 6.
(1 − R 2
e3 i e1 , e2 )∑ e 2
i3 = (1 − .402778)(6)
i =1
= 3.58333.
(iv) The value of the determinant |E(F)| equals (except for rounding error) the product of the three
values computed in (i), (ii), and (iii):
| E(F ) | = 344
= (16)(6)(3.58333)
n n n
= ∑ e12i (1 − re21e2 )∑ ei22 (1 − Re23 ⋅e1 , e2 )∑ ei23 .
i =1
i =1 i =1
The determinant reflects simultaneously the extent to which the full model fails to explain scores
on D1, D2, and D3. Specifically, the determinant equals the product of three sum of squared error
terms:
(a) The sum of squared errors for D1
(b) The unexplained sum of squared errors for D2 predicted from D1
(c) The unexplained sum of squared errors for D3 predicted from D1 and D2
In this way, the determinant takes into account the correlations among D1, D2, and D3, and avoids
overcounting areas of overlap (see Figure 13.1), in arriving at an index of error for the model.
(v) The value of the determinant |E(R)| equals (except for rounding error) the product of uncor-
rected residual sums of squares:
| E(R )| = 2784
= (21)(14.57143)(9.09804).
Thus, the same type of relationship holds for the restricted model as for the full model. As a result,
the determinant serves the same purpose for representing the overall magnitude of error in the
restricted model as it does in the full model.
g. Equation 6 provides an appropriate test statistic for testing a comparison:
F = nD 2 / sD2 . (13.6)
108 CHAPTER 13
The D3 variable we formed earlier compares Locations 1 and 4. From the table we constructed in
part (a), we can see that D3 = 2 and sD2 = 1.5. Thus, the observed F value is
F = (5)(2) 2 / 1.5
= 13.33.
If this is the only planned comparison to be tested, an appropriate critical F value can be found in
Appendix Table 2. With 1 numerator and 4 denominator degrees of freedom, the critical F value is
7.71, so the mean difference between EEG activity at Locations 1 and 4 is statistically significant at
the .05 level.
8. a. From Equation 24, the test statistic for the omnibus null hypothesis is
(| E(R ) | − | E(F ) |) / (a − 1)
F= . (13.24)
| E(F ) | /(n − a + 1)
We are told that n = 12. The fact that the E(F) and E(R) matrices have two rows and columns implies
that a – 1 = 2—that is, a = 3. Thus, the determinants of |E(F)| and E(R) are
Substituting these values into the formula for the F statistic yields
The critical F value with 2 numerator and 10 denominator degrees of freedom is 4.10 (see Appendix
Table 2), so the null hypothesis can be rejected at the .05 level.
b. Given orthonormal contrasts, the mixed-model F can be written as
(tr (E * (R )) − tr (E * (F ))) / (a − 1)
F= . (13.34)
tr (E * (F )) / (n − 1)(a − 1)
tr (E * (R )) = 2784 + 1136
= 3920,
tr (E * (F )) = 1584 + 704
= 2288.
Substituting these values, along with a = 3 and n = 12, into Equation 34 yields
(3920 − 2288) / 2
F=
2288 / (11)(2)
= 7.85.
ONE-WAY WITHIN-SUBJECTS: MULTIVARIATE APPROACH 109
The critical F value for 2 numerator and 22 denominator degrees of freedom is 3.44, so the null
hypothesis is rejected at the .05 level, using the mixed-model approach.
c. The test statistic for testing a single D variable is given by
( ER − EF ) (df R − df F )
F= .
EF df F
df R = n,
df F = n − 1,
so
df R − df F = 1.
Further, ER and EF are the entries in row 1 and column 1 of the E(R) and E(F) matrices, respectively.
For these data,
(2784 − 1584) / 1
F=
1584 / 11
= 8.33.
The critical F value for αPC = .05 with 1 numerator and 11 denominator degrees of freedom is 4.84,
so the null hypothesis can be rejected.
14. a. The observed F value using the multivariate approach is 7.19. The associated p value is .010, so the
null hypothesis is rejected at the .05 level.
b. The observed F value using the mixed-model approach is 3.23. The associated p value is .057, so the
null hypothesis cannot be rejected at the .05 level.
c. As discussed at the end of the chapter, the multivariate approach may be more powerful than the
mixed-model approach when the homogeneity assumption is violated. It is possible for the mixed-
model test to be liberal if the null hypothesis is true, and yet the mixed-model test can be less power-
ful than the multivariate test when the null hypothesis is false.
16. a. Equation 6 provides the test statistic for testing this contrast:
F = nD 2 / sD2 . (13.6)
To work this problem by hand, it is necessary to calculate a D score for each subject. For example,
D for subject 1 is
D = .56(2) − .54(4) − .02(7)
= −1.18.
Using the same formula for all 13 subjects yields the following scores:
–1.18, .58, –1.64, –1.06, 0, –.10, –2.72, –.52, –1.18, –1.58, –1.56, –2.82, –.52.
110 CHAPTER 13
The mean of these 13 scores is D = –1.10, and the estimated population variance is sD2 = 1.003.
Thus, the observed F value is
F = (13)(−1.10) 2 / 1.003
= 15.68.
Thus, the null hypothesis can be rejected for this contrast, as we know it should, since this is the
maximum contrast, and the omnibus null hypothesis was rejected with the multivariate approach.
b. This contrast is essentially a comparison of Time 1 versus Time 2. In fact, we might want to test a
contrast with coefficients of 1, –1, and 0, to enhance interpretability.
c. No. We saw in Problem 12 that the mixed-model omnibus test is nonsignificant for these data. This
result would seem to suggest that it would be fruitless to search for a post hoc contrast to test. In fact,
however, we saw in part (a) that it is possible to find a statistically significant post hoc contrast by
using a separate error term. Thus, we cannot necessarily trust the mixed-model test to inform us as
to whether we should pursue tests of post hoc contrasts if we use a separate error term. However, the
multivariate test will be statistically significant if and only if a significant contrast exists when we
use a separate error term (remember from Problem 14 that the multivariate test was significant for
these data). This agreement (or “coherence”) between the multivariate test and the use of a separate
error term is a major reason for preferring the multivariate approach to the mixed-model approach.
14
5. a. The omnibus effects are the A main effect, the B main effect, and the Α × B interaction.
b. A main effect requires a – 1 D variables, or 2 D variables in this particular design. B main effect requires
b – 1 D variables, or 3 D variables in this particular design. Α × Β interaction requires (a – 1)(b – 1)
D variables, or 6 D variables in this particular design.
c.
Degrees of Freedom
General Form of Degrees of Freedom in This Design
8. a. The appropriate multiple comparison procedure for testing all pairwise comparisons of a within-
subjects factor is the Bonferroni method. With 3 levels of A, there are 3 pairwise comparisons of the
marginal means, so C = 3. The denominator degrees of freedom equal n – 1, or 19. From Appendix
Table 3, the value of the critical Bonferroni F is 6.89.
b. If post hoc complex comparisons were also to be tested, the Roy-Bose procedure would be used, in
which case the critical value would be
Notice that the larger critical value here than for the Bonferroni procedure in part (a) reflects the
greater protection needed for testing complex comparisons.
111
112 CHAPTER 14
c. Equation 33 provides the appropriate critical value for testing a post hoc interaction contrast:
CV = (n −1)(a −1)(b −1) FαFW ; ( a−1)( b−1),n−[( a−1)( b−1)] / (n − [a −1)(b −1)]) (14.33)
= (20 −1)(3 −1)(4 −1) F.05;( 3−1)( 4−1),20−[( 3−1)( 4−1)] / (20 − [(3 −1)(4 −1)])
= (19)(2)(3) F.05;6,14 / 14
= (19)(2)(3)(2.85) / 14
= 23.21.
9. a. At first glance, the answer might seem to be “yes,” because the multivariate and mixed-model
approaches can yield the same answer for tests involving 1 numerator degree of freedom. However,
this agreement occurs only when the error term of the mixed-model approach is MSeffect × S . In this
problem, the use of this error term would lead to 14 denominator degrees of freedom (that is, (2 – 1)
times (15 – 1)), the same as the multivariate approach. However, the F value reported by the com-
puter program has 98 denominator degrees of freedom. In all likelihood, the computer has used an
error term of the form
which indeed leads to 98 denominator degrees of freedom. However, this form of error term is not
generally recommended, because it requires a stringent sphericity assumption, even for single degree
of freedom tests (see the discussion of Equation 13 in Chapter 12 for further information). The
important practical point here is that the multivariate test will give a somewhat different result from
the reported result, and the multivariate test is generally to be preferred. In general, then, the mixed-
model test will differ from the multivariate test unless the numerator degrees of freedom equal 1 and
the denominator degrees of freedom equal n – 1.
11. a. They will always be the same, since the A main effect is a between-subjects effect.
b. They will necessarily be the same only when b = 2, because then there is a single D variable, so the
multivariate approach yields a univariate test.
c. The answer here is the same as for part (b). Once again, when b = 2, the multivariate approach yields
a univariate test, and the two approaches yield identical answers.
13. a. The test statistic for the A main effect is given by Equation 40:
∑ n (M
j =1
j j − M ) 2 / (a − 1)
F= a nj
. (14.40)
∑ ∑ (M ij − M j )2 / ( N − a)
j =1 i =1
For these data, we know that a = 3 and n1 = n2 = n3 = 20. Further, the group means on the M variable are
M 1 = (10 + 12) / 2
= 11
M 2 = (16 + 20) / 2
= 18
M 3 = (16 + 16) / 2
= 16,
HIGHER-ORDER WITHIN-SUBJECTS: MULTIVARIATE APPROACH 113
so
M = (11 + 18 + 16) / 3
= 15.
a n ∑s 2
j
∑ ∑ (M
j =1 i =1
ij
2
− M j ) / ( N − a) = j =1
a
(see 3.56)
Making the appropriate substitutions into the formula for the F statistic yields
The critical F value with 2 numerator and 57 denominator degrees of freedom is approximately 3.23
(see Appendix Table 2), so the A main effect is significant at the .05 level.
b. The test statistic for the B main effect is given by Equation 48:
ND 2
F= a nj
. (14.48)
∑ ∑ ( Dij − D j )2 / ( N − a)
j =1 i =1
= 20 + 20 + 20
= 60.
D1 + D2 + D3
D=
3
(12 − 10) + (20 − 16) + (16 − 16)
=
3
= 2.00.
a n ∑s 2
j
∑ ∑ (D
j =1 i =1
ij
2
− D j ) / ( N − a) = j =1
a
(see 3.56)
( 6) 2 + ( 4) 2 + ( 4) 2
=
3
= 22.67.
114 CHAPTER 14
60(2.00) 2
F=
22.67
= 10.59.
The critical F value with 1 numerator and 57 denominator degrees of freedom is approximately 4.08
(see Appendix Table 2), so the B main effect is significant at the .05 level.
c. The test statistic for the Α × B interaction is given by
∑ n (D
j =1
j j − D) 2 / (a − 1)
F= a nj
.
∑ ∑ (D
j =1 i =1
ij
2
− D j ) / ( N − a)
For these data, we know that a = 3 and that n1 = n2 = n3 = 20. Further, as we saw in part (b),
D1 = 12 − 10
= 2,
D2 = 20 − 16
= 4,
D3 = 16 − 16
= 0,
so
D = ( 2 + 4 + 0) / 3
= 2.
a n ∑s 2
j
∑ ∑ (D
j =1 i =1
ij
2
− D j ) / ( N − a) = j =1
a
(see 3.56)
( 6) 2 + ( 4) 2 + ( 4) 2
=
3
= 22.67.
Substituting these values into the formula for the F statistic yields
20[(2 − 2) 2 + (4 − 2) 2 + (0 − 2) 2 ] / (3 − 1)
F=
22.67
= 3.53.
The critical F value with 2 numerator and 57 denominator degrees of freedom is approximately 3.23
(see Appendix Table 2), so the Α × Β interaction is significant at the .05 level.
HIGHER-ORDER WITHIN-SUBJECTS: MULTIVARIATE APPROACH 115
18. It is necessary to realize several facts in order to arrive at the proper critical value. First, this contrast is a
within-subjects comparison of cell means. As such, the error term can either be based on those particular
cells (see Equation 67), or pooled over levels of the between-subjects factor (see Equation 68). The student
decided to pool over levels of the between-subjects factor, so Equation 68 was used to calculate an observed
F value. Second, the appropriate critical value to accompany Equation 68 is given by Equation 70:
From Problem 17, we know that N = 45, since the student had 15 subjects in each of his 3 groups.
We also know that a = 3 and b = 4. Finally, we know that αFW = .05 here, since he wants to maintain
his alpha level at .05 within this level of A. Making these substitutions into Equation 70 yields
Because the observed F value of 4.13 is less than the critical value of 8.95, the contrast is
nonsignificant.
20. a. The three-way interaction requires the formation of difference variables for the within-subjects fac-
tor. With 4 levels of the within-subjects factor, there will be 3 such D (that is, difference) variables.
b. Suppose that we label the first D variable as D1 and that we represent the score for subject i at level
j of the first between-subjects factor and level k of the second between-subjects factor as D1ijk.Then
the full model can be written as
The three-way interaction is tested by restricting the two-way (αβ )1jk parameters for each within-
subjects difference variable to be equal to zero. As a consequence, the restricted model for D1 is
given by
c. From Table 14.14, the numerator degrees of freedom will equal pdH, where p is the number of depen-
dent variables, and dH is the number of independent restricted parameters per dependent variable.
From part (a), we know that there are three dependent variables, so p = 3. From part (b), the restricted
model omitted the (αβ)1jk parameters. With three levels of one factor and two levels of the other, the
number of independent restricted interaction parameters is (3 – 1)(2 – 1), or 2. Thus, dH = 2. The
numerator degrees of freedom equal (3)(2), or 6. Notice that this is the same value we would obtain
if all three factors were between-subjects, or if all three were within-subjects, or any other combina-
tion. Although the denominator degrees of freedom will depend on the particular design (that is, the
specific combination of between- and within- subjects factors), the numerator degrees of freedom will
be the same regardless of which factors are between-subjects and which are within-subjects.
d. From Table 14.14, we find that the denominator degrees of freedom for the three-way interaction
will equal
df den = mq − .5 pd H + 1,
116 CHAPTER 14
where m is defined as
m = N − g + d H − .5( p + d H + 1),
and q is defined as
( pd H ) 2 − 4
q= .
p 2 + d H2 − 5
We know that p = 3 and dH = 2 (from part (c)), the total number of subjects is N = 60 (that is, 10
subjects for each of the 3 × 2 cells), and the number of distinct groups of subjects is g = 6 (that is,
3 × 2). Making the appropriate substitutions yields
m = N − g + d H − .5( p + d H + 1)
= 60 − 6 + 2 − .5(3 + 2 + 1)
= 53,
( pd H ) 2 − 4
q=
p 2 + d H2 − 5
[(3)(2)]2 − 4
=
(3) 2 + (2) 2 − 5
32
=
8
= 2.
df den = mq − .5 pd H + 1
= (53)(2) − .5(3)(2) + 1
= 104.
IV
Mixed-Effects Models
15
An Introduction to
Mixed-Effects Models
Within-Subjects Designs
4. A model whose only random effect is an intercept term is consistent with a belief that if errors of mea-
surement could be eliminated, each individual’s trajectory would be parallel to all other individual’s
trajectories. In other words, this model implies that everyone is changing at the same rate, because
there are no true individual differences in slope (or any other aspect of change). When error variances
are assumed to be equal at each time point, this conceptualization also corresponds to the statistical
assumption of compound symmetry, which in turn is a special case of sphericity. Thus, from this per-
spective, the traditional mixed-model analysis methods of Chapters 11 and 12 can be viewed as appro-
priate when it is plausible that each individual is changing at the same rate over time.
8. The text states that when random as well as fixed trends are included in the model, we generally need
two more time points than the order of the highest trend. A linear trend is a first order polynomial,
which implies that we need three time points. As the text explains, two time points would allow us to
estimate an intercept and a slope for each person, but our estimates would perfectly fit the observed
data. While at first glance this might sound like a benefit, the problem is that a straight line will fit any
two data points regardless of the form of the true underlying function. In particular, two data points
would not allow us to estimate the error in our model, which may reflect measurement error and/or
error in our specification of the true functional form. In any case, without an estimate of error, we can-
not test hypotheses and form confidence intervals. Because we are almost always interested in testing
hypotheses and/or forming confidence intervals, we need three time points even for a simple case of
straight-line growth.
11. The multivariate approach usually works well when (a) we have no missing data, (b) all individuals are
measured at the same time points, and (c) sample size is not too small. However, if any of these condi-
tions do not hold, other approaches such as those described in this chapter can offer significant benefits.
In addition, the multilevel framework we introduce in this chapter also provides a foundation for more
complex designs, some of which extend beyond the standard multivariate approaches of Chapters 13
and 14.
13. a. No, the difference in means is nonsignificant, F(3, 44) = .94, which corresponds to a p value of .43.
b. No, once again the difference in means is nonsignificant, F(3, 44) = .94, which again corresponds to
a p value of .43.
c. The results in parts a and b are identical to one another. By excluding any random effects in our
model in part a, we are not allowing for any systematic individual differences between subjects at
all. In other words, we are assuming that obtaining four measurements from one child is no different
119
120 CHAPTER 15
from obtaining one measurement from each of four different children. In particular, in our data set,
we are acting as if we had obtained scores from 48 distinct children. Notice, however, that this is
exactly what we would have if the design had been a between-subjects design. Thus, it is not surpris-
ing that the results of parts a and b are identical. In general, including random effects in models for
longitudinal data allows for systematic individual differences between persons.
d. The graph would be exactly the same for all 12 individuals. In other words, the model in part a
assumes that every person’s trajectory is the same (except for error of measurement) as every other
person’s. Instead of having 12 distinct trajectories, this model assumes that there is only 1 trajectory,
and this trajectory is correct for everyone. Notice then that not only are we assuming that everyone
is changing at the same rate but in addition that everyone has the same intercept.
e. This type of graph and thus this type of model rarely seem plausible for longitudinal data in the
behavioral and social sciences. Instead, it is virtually guaranteed that systematic individual differ-
ences exist between persons, which is typically one of our main reasons for adopting a longitudinal
(or within-subjects) design in the first place. Thus, we need to be certain that our statistical models
allow for systematic individual differences. A major strength of multilevel models is that their flex-
ibility allows researchers to choose from a variety of ways of modeling such systematic individual
differences.
16. a. Fitting a model allowing each day to have its own mean but allowing only a random intercept term
yields F(3, 39) = 1.84, p = .16. Thus, the difference between days is nonsignificant at the .05 level.
This is exactly the same F value we obtained using the unadjusted univariate approach to repeated
measures in Chapter 11.
b. When no data are missing and every subject is measured at the same time points, the analysis in part
a is always identical to the unadjusted approach to repeated measures as described in Chapter 11.
c. Fitting this model to our data yields F(3, 26) = .80, p = .51. The observed F from this model is
smaller than the observed F when we allowed only a random intercept term. These two types of
analyses will often compare to one another in this way. When we allow the slope as well as the inter-
cept to differ from subject to subject, we are admitting that there may be another source contributing
to variability of scores at each time point. To the extent that changes really vary across subjects, we
will need stronger evidence to be relatively certain that the changes we observe in our sample will
generalize to an entire population of individuals. From a complementary perspective, we know that
the F value from the random intercept model will tend to be inflated when individuals differ beyond
their intercepts. By also including a random slope term in the model, we reduce the risk that our F
value is inflated.
d. Both the Hotelling-Lawley-McKeon and Hotelling-Lawley-Pillai-Samson statistics yield F(3,11) =
.81, p = .52. This is the same value we obtained using the multivariate approach to repeated measures
as described in Chapter 13. This equivalence will hold as long as there are no missing data and each
person is measured at the same time points.
e. These models differ from one another in how they model the covariances among scores over days.
However, the models all specify the same fixed effects. Thus, we can use information criteria to
help inform the best model for our data. Here we find that the model with a random intercept and
slope has the lowest AIC value—namely, 253.7—and thus on this basis is the preferred model. In
contrast, the AIC for the model specifying an unstructured covariance matrix is 261.5, and the AIC
for the random intercept model is 294.3. The model with a random intercept and slope also has the
lowest BIC value of these three models. Blind adherence to these criteria is not advisable, because
the criteria themselves are prone to sampling error. For this reason, it also makes sense to incorporate
theoretical knowledge into our judgment of which model provides the best fit to our data. Of course,
that is difficult to do in a hypothetical example, but in many longitudinal studies in the behavioral
sciences, it is sensible to expect at least random intercept and slope effects.
MIXED-EFFECTS MODELS: WITHIN-SUBJECTS DESIGNS 121
f. This model yields F(l, 13) = 1.96, p = .18 for the test that mean weight is changing over days. Notice
that this result agrees, as it must (when the design is balanced and no data are missing), with the test
of the linear trend we obtained from the multivariate approach of Chapter 13. Thus, it is plausible
that the population mean weight is not changing. From the model, our best estimate is that the mean
weight gain per day is .5643 ounces. The estimated variance of the slope random effect is 2.1897,
which implies a standard deviation of 1.48. Thus, our best guess is that slopes have a mean of .56 and
a standard deviation of 1.48. To find an estimated percentage of infants whose weights are increas-
ing, we need to calculate the z score corresponding to a value of 0. In a distribution with a mean of
.56 and a standard deviation of 1.48, a raw score of 0 corresponds to z score of –.38. If we assume a
normal distribution, approximately 65% of scores are above a z score of –.38. Thus, our best estimate
is that just under two-thirds of infants are gaining weight. Of course, in such a small sample, we need
to realize that there is considerable uncertainty associated with this estimate.
16
An Introduction to Mixed-Effects
Models
Nested Designs
2. a. In principle, “plant” should be regarded as a fixed effect when the organizational psychologist’s
research question pertains specifically to the two plants included in the study. In this case, the reason
for statistical inference (e.g., significance tests and confidence intervals) is that the researcher wants
to use the observed sample of workers to infer what would be true for a population of workers. The
key point is that the population in this fixed-effects design would refer to a population of workers
at these two plants. On the other hand, “plant” should be regarded as a random effect when the psy-
chologist wants to make an inference to a broader population of plants—i.e., the desired conclusions
apply not just to the two plants in question, but to a broader representation of plants. In the latter
case, sampling of plants becomes relevant, because of the desire to use the sample of plants actually
included in the study as the basis for inference to a population of plants.
b. No, this is not a good design. Including only one plant in each condition renders it impossible to
estimate variability between plants within each condition. In other words, any difference we observe
between the conditions may in reality may reflect a preexisting difference between the plants. Ran-
dom assignment might seem to solve this problem. It would largely solve the problem (at least with
regard to internal validity) if individual workers were randomly assigned to each plant. But more
likely in this situation is that at best the plants themselves are randomly assigned to condition. As a
result, with a total of only two plants, there is essentially only one “subject” per condition. In par-
ticular, recall the F statistic for testing condition effects in a nested design:
MS A
F= . (16.1)
MS B A
If we have only one plant per condition, it is impossible to calculate MS B/A . As a consequence, it is
impossible to test the condition effect with only one plant per condition.
c. All other things being equal, obtaining a larger sample of workers at each plant will increase power
and precision (of course it is important that these additional workers be sampled at random from
the population). However, as we have seen in parts a and b, the critical issue in this design is not
the number of workers, but instead the number of plants. If inference is intended to refer to these
two specific plants, the study can proceed and including additional workers in the sample would be
helpful. However, if inference is intended to refer to a population of plants, increasing the number of
122
MIXED-EFFECTS MODELS: NESTED DESIGNS 123
workers sampled from these two plants fails to address the problem. The only solution in this case is
to sample from more than two plants.
7. As of this writing, the following SAS PROC MIXED syntax provides the desired analysis:
proc mixed;
class room cond;
model induct = cond cskill cond*cskill/ s;
random int/subject = room(cond);
estimate ‘condition’ cond 1 –1;