FPCR Jasa
FPCR Jasa
Regression of a scalar response on signal predictors, such as near-infrared (NIR) spectra of chemical samples, presents a major challenge
when, as is typically the case, the dimension of the signals far exceeds their number. Most solutions to this problem reduce the dimen-
sion of the predictors either by regressing on components [e.g., principal component regression (PCR) and partial least squares (PLS)] or
by smoothing methods, which restrict the coefficient function to the span of a spline basis. This article introduces functional versions of
PCR and PLS, which combine both of the foregoing dimension-reduction approaches. Two versions of functional PCR are developed, both
using B-splines and roughness penalties. The regularized-components version applies such a penalty to the construction of the principal
components (i.e., it uses functional principal components), whereas the regularized-regression version incorporates a penalty in the regres-
sion. For the latter form of functional PCR, the penalty parameter may be selected by generalized cross-validation, restricted maximum
likelihood (REML), or a minimum mean integrated squared error criterion. Proceeding similarly, we develop two versions of functional
PLS. Asymptotic convergence properties of regularized-regression functional PCR are demonstrated. A simulation study and split-sample
validation with several NIR spectroscopy data sets indicate that functional PCR and functional PLS, especially the regularized-regression
versions with REML, offer advantages over existing methods in terms of both estimation of the coefficient function and prediction of future
observations.
KEY WORDS: B-splines; Functional linear model; Linear mixed model; Multivariate calibration; Signal regression; SIMPLS.
984
Reiss and Ogden: Functional PCR and PLS 985
(Sec. 3) and PLS (Sec. 4). Section 5 discusses how smoothing elements of D in descending order. Let UA , DA , and VA be
parameters are selected, and Section 6 gives asymptotic conver- the truncated-at-A versions of the three SVD matrices; that is,
gence results for a version of functional PCR. Section 7 com- UA and VA consist of the first A columns of U and V, whereas
pares the performance of the various methods, and Section 8 DA is the A × A upper left submatrix of D, which we assume to
concludes with some discussion and directions for future re- be nonsingular. The columns of VA are the first A eigenvectors
search. of the covariance matrix XT X, that is, the loadings of the first A
PCs of the signal data. Thus if we restrict ω in (1) to the column
2. BUILDING BLOCKS space of VA or, equivalently, find the unconstrained minimizing
2.1 Penalized B –Spline Expansion ζ ∈ RA of
Marx and Eilers (1999) proposed overcoming the multi- "y − α1 − XVA ζ "2 , (3)
collinearity problem by projecting ω onto a B-spline basis and then our new design matrix XVA has as its columns the first A
adding a roughness penalty to the criterion to be minimized, PCs of the original regressors. The unconstrained minimization
which then becomes of (3) is therefore referred to as A-component PCR.
"y − α1 − XBβ"2 + λβ T PT Pβ. (2) The A-component PCR model that we have described in-
cludes the A PCs with the largest variances—what is sometimes
Here B is an N × K B-spline basis design matrix with (i, j ) called choosing PCs “from the top.” Thus PCs are chosen with-
entry Bj (ti ), where Bj is the j th basis function, ti is the ith in- out regard to how well they predict the response. Such a prac-
dex value or site, P is a full-rank r × K matrix with r ≤ K tice might be justified on the grounds that “fortunately, there
such that β T PT Pβ provides a measure of the roughness of is often a tendency. . . for the components with the largest vari-
the function ω = Bβ, and λ is a parameter controlling the ex- ances to best explain the dependent variables” (Mardia, Kent,
tent to which roughness is penalized (i.e., the higher the value and Bibby 1979). Nevertheless, many authors have been un-
of λ, the smoother the fitted function). Marx and Eilers took P willing to assume that matters will work out so fortuitously and
to be a (K − d) × K differencing matrix Pd such that Pd w have proposed ways to take the responses into account when
gives the dth-order differences of w. The resulting difference choosing which PCs to include in the regression. One such
penalty yields a method that Marx and Eilers termed P -spline method (Massy 1965) includes those PCs that are most highly
signal regression
! (PSR). Alternatively, if P is chosen such that correlated with the response and thus explain the most variation
PT P = ( Bi&& (t)Bj&& (t) dt)1≤i,j ≤K (or at least a discrete approx- therein. Such alternatives to choosing PCs from the top are not
imation to that integral), then β T PT Pβ equals (approximately) considered further here.
the integrated squared second derivative of the weight function
2.3 Partial Least Squares
ω = Bβ. The latter roughness penalty was used by, for exam-
ple, Cardot, Ferraty, and Sarda (2003) and is used herein. We PLS regression may be thought of as a counterpart method
refer to the resulting model as the penalized B-spline expan- to PCR that seeks to improve on the latter. A potential draw-
sion (PBSE). back of PCR is that the regressed-on components are chosen
Whereas PBSE yields consistent estimates of the true coef- based solely on how much of the predictor variance they ex-
ficient function (Cardot et al. 2003), Marx and Eilers (2002) plain, without reference to how well they explain the responses.
reported that their original version of PSR (i.e., PBSE with a PLS, in contrast, seeks components that are most relevant to
difference penalty) yielded coefficient function estimates that predicting the outcome.
were “too enthusiastic, with magnitude too large for success- PLS is often presented as an iterative algorithm that approx-
ful stability.” They consequently added a ridge penalty to their imately decomposes both X and y in terms of latent variables
model to shrink the coefficients toward 0. Although this appears or score vectors ta . The following version of the algorithm was
to improve the performance of PSR, it might be argued that un- given by Goutis and Fearn (1996). Suppose that we have an
like the squared second derivative, this ridge term has no natural n × N predictor matrix X and an n-dimensional response vec-
interpretation as a measure of roughness, and necessitates op- tor y, both of which are centered. Set E0 = X and f0 = y, and
timizing over an additional continuous parameter—something let M+ denote the Moore–Penrose inverse of matrix M. For
that we might wish to avoid, especially with very large data a = 1, . . . , A, let
sets. The methods developed in Sections 3 and 4, in contrast, ETa−1 fa−1
require optimizing over one discrete parameter and one contin- pa = ,
uous parameter. "ETa−1 fa−1 "
ta = Ea−1 pa ,
2.2 Principal Component Regression
Ea = Ea−1 − ta pTa ,
In this section and the following one, we consider the two
major component-selection approaches to minimizing "y − and
α1 − Xω"2 over the N -dimensional weight function ω. The
fa = fa−1 − ta tTa (Ea−1 ETa−1 )+ fa−1 .
first such approach, PCR (Massy 1965), regresses not on the
N columns of X, but rather on a small number of regressors From the loadings pa and scores ta (a = 1, . . . , A), we can
accounting for most of the variability of the signal data. derive weight vectors r1 , . . . , rA of unit length, such that
To give this a more precise formulation, we consider UDVT , cov(y, Xri ) is maximized successively subject to orthogonal-
the singular value decomposition (SVD) of X, with the diagonal ity of the Xri . The PLS solution with A components then
986 Journal of the American Statistical Association, September 2007
minimizes "y − α1 − XRA ζ "2 over ζ ∈ RA [cf. (3)], where The two methods can be thought of as differing in terms of
RA = (r1 , . . . , rA ). The SIMPLS algorithm of de Jong (1993), what the adjective “functional” modifies. In the first method,
which is equivalent to PLS (for univariate y, the only case that the principal components are “functional” in the sense that their
we consider here) but more computationally efficient, derives construction uses a roughness penalty (as in Silverman 1996).
the weight vectors r1 , . . . , rA directly. In the second method, it is the regression that incorporates a
roughness penalty and thus shares the “functional” character
3. FUNCTIONAL PRINCIPAL COMPONENT of Ramsay and Silverman’s (2005, chap. 15) functional regres-
REGRESSION WITH B –SPLINES sion. Hence the two methods correspond to two ways of pars-
ing the term “functional principal component regression.” The
3.1 XB as the Design Matrix; B –Spline Principal
first can be viewed as {functional principal component} regres-
Component Regression
sion; the second, as functional {principal component regres-
Although PCR often alleviates the ill-posed nature of the sion}. In what follows we denote the two methods by FPCRC
original regression problem, it is invariant under permutation and FPCRR , to indicate, respectively, application of a rough-
of the regressors—in other words, it fails to take their ordering ness penalty to the components and to the regression.
into account—and produces a nonsmooth weight function esti-
mate ω̂. A proposal of Cardot et al. (2003) works around these 3.3 FPCRC
difficulties by projecting ω̂ onto a B-spline basis, that is, replac- 3.3.1 Functional Principal Components. The first FPCR
ing it with B(BT B)−1 BT ω̂, where B is as in (2). Whereas this method, FPCRC , incorporates the roughness penalty into the
device does produce an at least minimally smooth weight func- construction of the PCs. We seek to minimize "y − α1 −
tion, it assumes that the eigendecomposition or SVD used for XBṼA ζ "2 , which is the same as criterion (4) except with VA
PCR is trustworthy despite the large number N of predictors. replaced by a new matrix, ṼA , the construction of which we
Yet part of the reason for smoothing is our reluctance to trust now describe.
such a decomposition for large N . Indeed, Cardot et al. (2003) In (4), the columns of VA are the loadings of the first A
found this method somewhat less effective than PBSE. PCs of the regressor data XB. Equivalently, they are the first A
Instead of projecting ω̂ onto a B-spline basis only after PCR eigenvalues of the covariance matrix BT XT XB and thus are the
fitting, we might consider fitting PCR after projection, that is, successive maximizers of vT BT XT XBv/vT v subject to ortho-
with design matrix XB taking the place of X. (Note that the ap- normality. The columns of ṼA , on the other hand, are functional
plication of PCR with design matrix XB makes sense because, PC loading vectors (Silverman 1996; Ramsay and Silverman
assuming the size of the basis to be large, this new design matrix 2005). These are the orthonormal vectors v that successively
again presents an ill-posed problem, and moreover it is readily maximize
seen to have mean-0 columns.) The “B-spline PCR” solution
with A components will minimize vT BT XT XBv
,
vT (I + λPT P)v
"y − α1 − XBVA ζ "2 (4)
where P is chosen so that vT PT Pv measures the roughness of v
over ζ ∈ RA , where VA is now derived from the SVD XB = [cf. (2)].
UDVT . 3.3.2 Choice of λ. In their discussion of functional princi-
pal component analysis, Ramsay and Silverman (2005,
3.2 Two Approaches to Functional Principal sec. 9.3.3) proposed choosing the smoothing parameter λ by
Component Regression cross-validation (CV). Essentially, given A, we minimize the
The degree of smoothing achieved by the foregoing B-spline sum of squared L2 distances between the signals xi (i =
PCR method depends on the richness of the B-spline basis and 1, . . . , n) and the projections of each signal on the first A func-
thus may be quite minimal. We would prefer a method that lets tional PCs that would be formed if the signal matrix X were
the data choose the level of smoothing through an appropriate replaced with X(−i) , the (n − 1) × N matrix of all signals ex-
roughness penalty. This goal accords with the functional data cept the ith.
analysis paradigm of Ramsay and Silverman (2005), and thus The foregoing CV procedure was proposed in the context of
such a method might be called functional PCR (FPCR). Such an functional principal component analysis, and accordingly does
approach—first projecting onto a B-spline basis, then smooth- not relate to prediction of external variables. For principal com-
ing further through truncation by PCs and the use of a roughness ponent regression, we might instead choose λ based on how
penalty—can be carried out in at least two ways: well the PCs that it determines predict the outcome of our re-
gression model. " To do this, we would choose λ to minimize the
1. Apply a roughness penalty to the PCs themselves, then CV criterion ni=1 (yi − ŷ(i) )2 , where ŷ(i) is the predicted value
choose an appropriate number of these functional princi- for the ith response, based on a model constructed with all but
pal components (Silverman 1996; Ramsay and Silverman the ith observation. Our implementation uses the more compu-
2005) for the weight function. tationally efficient m-fold CV, in which the data are divided into
2. Add a roughness penalty to criterion (4) and minimize the m equal parts, each of which takes a turn serving as a validation
resulting penalized sum of squares. set for the model derived from the rest of the data.
Reiss and Ogden: Functional PCR and PLS 987
methodology for mixed models. In particular, we can estimate estimate of ω. Thus it seems appropriate to plug the minimally
λ through restricted maximum likelihood (REML) estimation smooth estimate ω̂0 into (8). An explicit equality defining the
of the variance parameters. Moreover, because the variance ma- plug-in value of λ was given by Reiss (2006).
trix of the random-effect coefficient vector u is a multiple of the
6. ASYMPTOTIC RESULTS FOR FPCRR
identity, this estimation can be carried out with standard mixed-
model software; compare model 3 of Wang (1998), which sim- To derive asymptotic results for FPCRR , we begin with sig-
ilarly reduces the curve fitting problem to a mixed model with nals x∗i = (xi1 , . . . , xiN )T , i = 1, 2, . . . , which are iid ran-
dom vectors with E(x1j ∗ ) = 0 and E(x ∗4 ) < ∞ for each
variance proportional to I. 1j
If n ≥ K, then, as noted in Section 3.4, PBSE can be seen j = 1, . . . , N . Throughout this section, we assume that the
as FPCRR with all K components. Thus the foregoing mixed- outcomes are generated by the model y = α1 + X∗ Bβ + ε,
model formulation can be used to find λ for PBSE. (See Reiss where X∗ = (x∗1 , . . . , x∗n )T ; ε is a vector of iid errors with
2006 for a mixed-model formulation for PBSE that remains mean 0 and finite variance, independent of X∗ ; and B is a
valid even for n < K.) Finally, by considering the SVD of fixed N × K B-spline basis matrix. For estimation, we replace
RTA PT PRA rather than that of VTA PT PVA , we can use the same X∗ with X = (I − n−1 11T )X∗ , which has mean-0 columns as
argument to find λ for FPLS by REML estimation. before. For given n, λ, and A, the FPCRR estimate of β is
β̂ n = BVA (VTA BT XT XBVA + λVTA PT PVA )−1 VTA BT XT y. In
5.4 The Minimum Mean Integrated Squared Error this section we use the notation λn to emphasize that λ may
Value of λ vary with n.
Let V∗A be the K × A matrix whose columns are the eigen-
For PBSE and FPCRR , we can alternatively choose λ by
expressing the mean integrated squared error (MISE) of ω̂, vectors of E(BT x1 xT1 B) corresponding to its leading eigenval-
"N ues ξ1 > · · · > ξA > 0 (i.e., we assume the first A eigenvalues
E[( ω̂ − ωi )2 ], as a function of λ and finding the min-
to be distinct and positive). Then V∗A can be seen as a popu-
i=1 i
imizer of this function. The following proposition, proved by
lation version of VA . Let %A = diag(ξ1 , . . . , ξA ). Some theory
Reiss (2006), gives a general formula encompassing PBSE and
on eigenvalues and eigenvectors of random matrices leads to
FPCRR as special cases.
the following result. Detailed proofs of this and the two theo-
Proposition 1. Suppose that y − ȳ1 ∈ Rn has mean Xω and rems that follow are available on the American Statistical As-
variance σ 2 I, where 1T X = 0, and that ω is estimated by sociation website at [Link]
supplemental_materials.
ω̂ = ω̂(λ) = GT−1
λ G X y,
T T
(7)
Theorem 1. Suppose that
where G is an N × A matrix of rank A < N and Tλ =
GT XT XG + λQ for some A × A symmetric penalty matrix Q, β = V∗A ζ for some ζ = (ζ1 , . . . , ζA )T . (9)
with G and Q not depending on y. Then the MISE of ω̂ attains If β̂ n denotes an A-component FPCRR estimate for which λn
a critical point at any value of λ satisfying d
is chosen to be oP (n1/2 ), then n1/2 (β̂ n − β) → Z1 + Z2 , where
−1
ωT [(I − WTλ )(Wλ − W2λ )]ω = σ 2 tr[GT−1 2
λ G (Wλ − Wλ )],
T Z1 ∼ NK (0, σ 2 V∗A %A V∗T A ) and Z2 ∼ NK (0, W) for a K × K
matrix W not depending on σ 2 .
(8)
The variance of Z1 in Theorem 1, σ 2 V∗A %−1 ∗T
A VA , is what
where Wλ = GT−1
λ G X X.
T T
the asymptotic variance of n1/2 (β̂ n − β) would be if we could
The critical point referred to in Proposition 1 is not necessar- substitute the population-based V∗A for the sample-based VA
ily the global minimum. Ordinarily, however, we would expect in calculating the estimate. The added variance associated with
the MISE to be either a U-shaped function of λ (in which case Z2 represents the price of having to use VA rather than V∗A .
the proposition allows us to find the global minimum) or an Note that because W does not depend on σ 2 , this price be-
increasing function of λ ∈ [0, ∞) (so that it is minimized by comes smaller (in a relative sense) as σ 2 increases. More details
taking λ = 0). about Z2 are given in the proof of Theorem 1.
Proposition 1 applies to PBSE with G = B, Q = PT P, and Under mild assumptions, the λn = oP (n1/2 ) condition im-
to FPCRR with G = BVA , Q = VTA PT PVA . In simulation set- posed in Theorem 1 is met if λn is chosen by GCV or REML;
tings for which the true ω and σ 2 are known, we can find the indeed, a stronger condition then holds.
MISE-minimizing λ by simply solving (8). We can then use Theorem 2. Let λn be the GCV or REML value associ-
this λ to construct an “oracle” estimator of ω against which ated with the A-component FPCRR estimate. Assume that
data-dependent estimators can be compared. V∗T ∗ ∗T
A P PVA is nonsingular and that VA β -= 0. Then there ex-
T
In real data settings, if it is reasonable to assume that ω̂0 , the ists M > 0 such that P (λn > M) → 0 as n → ∞.
estimate derived by setting λ = 0, is close to the true ω, then
Next, consider the choice of the number of components by
Proposition 1 allows us to construct a plug-in estimate of ω by
any multifold CV scheme in which we form Dn divisions of
(a) calculating ω̂0 , (b) estimating the error variance by σ̂02 =
the n observations into training and validation sets of size nt
"y − ȳ1 − Xω̂0 "2 /(n − A − 1), (c) substituting ω̂0 , σ̂02 into (8), and nv = n − nt ; sum (over the Dn divisions) the prediction
and (d) inserting the resulting root λ̂ into (7). Using ω̂0 as a errors obtained by applying each training-set model to the cor-
“stand-in” for the true ω is motivated by the empirical result responding validation set; and choose the number of compo-
that when an estimate of ω and the associated estimate of σ 2 nents yielding the smallest sum. Using Theorem 1 to derive the
are substituted into (8), the resulting λ̂ gives rise to a smoother asymptotic prediction error leads to the following result.
Reiss and Ogden: Functional PCR and PLS 989
Theorem 3. Assume that (9) holds with ζ A -= 0. Suppose that 7.1 Simulation Studies
the number of FPCRR components is chosen by multifold CV,
Both the simulations and the real data validation described
as described in the previous paragraph, and that λn = oP (n1/2 ).
later were based on spectroscopic data sets described by Kali-
If nt , nv → ∞ and Dn = oP [(min{nt , nv })1/2 ], then for any
vas (1997) and publicly available at Phil Hopke’s ftp site
positive integer A1 < A, the A-component model will be cho-
([Link] ). The
sen over the A1 -component model with probability tending to 1
wheat data set consists of NIR spectra of 100 wheat samples,
as n → ∞.
measured in 2-nm intervals from 1,100 nm to 2,500 nm, and two
This result ensures that a “too-small” model will not be cho- response variables: the samples’ moisture content and protein
sen in the limit, but leaves open the possibility of a “larger- content. The gasoline data set consists of spectra of 60 gasoline
than-needed” model (cf. Shao 1993). We would argue that in samples, measured in 2-nm intervals from 900 to 1,700 nm, and
the context of presenting FPCRR as an alternative to the PBSE a response variable, octane number, available for each sample.
model, which is equivalent to FPCRR with all components (see To correct for a baseline shift observed in the wheat spectra
Sec. 5.3), ruling out “too-small” models is the more pressing [Fig. 1(a)], we used the once-differenced spectra [Fig. 1(b)].
concern. As is common practice for PCR/PLS with predictor data (such
Informally, the foregoing theorems say that FPCRR using as signals) with uniform units of measurement, the predictors
GCV or REML produces a consistent estimate of the spline were not scaled to unit variance.
coefficients if enough components are used, and that the latter Each set of simulations was conducted four times, with
condition will be met in the limit if the number of components each of the two data sets and each of two true coefficient
is chosen by multifold CV. functions. The two true coefficient functions were chosen to
represent different degrees of roughness. The first of these
7. COMPARISON OF MODELS
was obtained from the relatively smooth function f1 (t) =
Three sets of models were tested: (a) PBSE models with λ 2 sin(.5πt) + 4 sin(1.5πt) + 5 sin(2.5πt), t ∈ [0, 1], used in
chosen by GCV, with λ chosen by REML and with MISE- the simulations of Cardot et al. (2003), by transforming its do-
minimizing (oracle) λ (the plug-in model was excluded due to main to that of the spectra. The second, more “bumpy” func-
computational problems and very poor performance); (b) PCR tion was obtained
" by appropriately transforming the domain
models: ordinary (unsmoothed) PCR, the unpenalized B-spline of f2 (t) = 4j =1 aj exp[bj (t − cj )2 ], t ∈ [0, 1], a sum of four
PCR of Section 3.1, FPCRC , and FPCRR with λ chosen by Gaussian curves differing significantly from 0 on disjoint do-
GCV, with λ chosen by REML, and with true (oracle) and esti- mains (similar to a function used for simulations in Cardot
mated (plug-in) MISE-minimizing λ; and (c) PLS models: or- 2002). This function was constructed to have two peaks in re-
dinary PLS; unpenalized B-spline PLS, FPLSC , and FPLSR gions where the variance was high for the gasoline signals and
with λ set by GCV and by REML. low for the wheat signals and two troughs in regions where the
(a) (b)
(c) (d)
Figure 1. Wheat spectra and estimated ω with moisture as outcome. (a) Raw wheat spectra; (b) differenced wheat spectra. Plots (c) and (d)
overlay the estimates obtained for the five training data sets by PCR and FPCRR –REML, respectively.
990 Journal of the American Statistical Association, September 2007
reverse was true. This was intended to facilitate comparisons of Table 2. Mean of L2 norm of error (root integrated squared error)
estimation accuracy at wavelengths of high versus low signal in estimating ω
variance.
Smooth function Bumpy function
To test the methods with both high and low signal-to-noise
ratios, two sets of responses, y = Xω + ε, were created in each Wheat Gasoline Wheat Gasoline
simulation by first generating iid standard normal error vectors .9 .6 .9 .6 .9 .6 .9 .6
and then multiplying these by error standard deviations σε cho-
PBSE–GCV .97 2.04 1.07 2.17 2.87 4.34 2.54 3.90
sen so that R 2 = var(Xω)/(var(Xω) + σε2 ) (i.e., the squared
PBSE–REML .65 .72 .36 .59 2.45 3.10 1.17 1.21
multiple correlation coefficient of the true model) would equal PBSE-oracle .42 .67 .33 .48 2.17 2.51 1.06 1.07
.9 and .6.
PCR 1.01 1.16 .81 1.01 .92 1.24 1.00 1.10
For each combination of the three factors (data set, true co-
B-spline PCR .98 1.14 .77 1.27 .91 1.36 1.20 1.63
efficient function, and R 2 ), we carried out 300 simulations FPCRC 1.08 1.43 1.57 2.76 1.71 2.82 2.03 3.36
for each method except FPLSC (by far the slowest method), FPCRR –GCV .80 1.01 .53 .73 1.08 1.50 1.12 1.19
for which 100 simulations were done. Spline-based methods FPCRR –REML .66 .75 .29 .42 .98 1.06 .95 .98
used cubic B-splines with 40 equally spaced internal knots. FPCRR -oracle .85 .86 .54 .65 .66 .72 .92 .94
For PCR- and PLS-type methods, the candidates for number FPCRR -plug-in .94 1.07 .69 1.06 .80 1.03 1.09 1.40
of components were 1–10, 12, and 15–40 at intervals of 5. PLS 1.01 1.18 .78 .94 .92 1.18 .99 1.12
We chose the number of components by eightfold CV. For the B-spline PLS .94 1.04 .74 1.15 .83 1.08 1.18 1.52
plug-in version of FPCRR , however, we simply set the number FPLSC 1.11 1.51 .82 1.46 1.00 1.22 1.36 1.80
of components to the number chosen by unpenalized B-spline FPLSR –GCV .87 1.04 .59 1.05 1.14 1.56 1.24 1.54
PCR, on the assumption that ω̂0 for this number of components FPLSR –REML .68 .74 .31 .44 .97 1.06 .95 .98
should serve as a reasonable surrogate for the true ω. The sim- NOTE: Scaled by the L2 norm of the true coefficient function.
ulations were programmed in R version 2.1.0 (R Development
Core Team 2005).
performers, except with the wheat spectra and smooth ω, for
7.2 Simulation Results which they were bested by PBSE–REML. The latter method
also did very well with the gasoline spectra and smooth ω, but
7.2.1 Prediction. Table 1 presents the average over all sim-
only moderately well with bumpy ω. FPCRR and FPLSR with
ulations of the empirical mean squared error of prediction
λ chosen by GCV did less well than the REML variants but
(1/n)(ω̂ − ω)T XT X(ω̂ − ω) for each method, true ω, data set,
consistently outperformed FPCRC and FPLSC . PCR and PLS
and signal-to-noise ratio. To make the columns comparable,
were always among the six worst nonoracle methods.
each was scaled by the appropriate value of var(Xω). Not sur-
prisingly, both oracle methods performed well. The PBSE ora- 7.2.2 Estimation. Table 2 presents the mean L2 norm of
cle method consistently had the lowest value with smooth ω, but the difference between the true and estimated coefficient func-
did poorly with bumpy ω. The FPCRR oracle method had one tions (mean root integrated squared error), scaled by the L2
1 "M
of the five lowest values in each column. Among the nonoracle norm of the true function, that is, M m=1 ((ω̂ − ω)T (ω̂ −
methods, FPCRR and FPLSR with REML were always the top ω)/(ωT ω))1/2 , where M is the number of simulations. Rela-
tive MISE (the foregoing expression without taking the square than the peaks. In this case, the plug-in version of FPCRR was
root) is sometimes used for comparisons of this type, but we the only method shown whose 90% confidence limits surround
took square roots to reduce the skewness of the distributions. the troughs quite closely.
The PBSE oracle method did very well with the smooth ω, and These figures indicate that all of the methods have difficulty
the FPCRR oracle method was always the best method with with estimation. Sections 8.2 and 8.3 provide more discussion
the bumpy ω. That the latter method did better at estimation focusing on PBSE and FPCRR .
than at prediction (see Table 1) is unsurprising, because it is ex-
pressly designed for optimal estimation. Among the 13 nonor-
7.3 Application to Real Data
acle methods, FPCRR –REML and FPLSR –REML appeared to
do well most consistently; they were always among the best 7.3.1 Split-Sample Validation. Split-sample validation was
three, except for the wheat spectra with the bumpy ω, for which conducted for each of the nonoracle methods with the gasoline
they fell in the middle. As shown in Table 1, PBSE–REML did and wheat signals and associated outcome measures. The in-
very well with the smooth coefficient function but less well with dices of the samples were divided into five sets of equal size
the bumpy one. FPCRC and FPLSC were generally relatively (samples 1, 6, 11, . . . ; samples 2, 7, 12, . . . ; and so on). For
unsuccessful, as were PCR and PLS. each such set V and each method, the sum of squared errors
Figures 2 and 3 display 90% empirical pointwise confidence "
of prediction i∈V (yi − yˆi )2 was calculated based on a model
intervals, based on the 300 simulations with the wheat spec-
fitted with the remaining samples as a training set. This quan-
tra, for four methods: PBSE–REML, ordinary PCR, FPCRR –
tity was then averaged over the five validation sets. The results
REML, and FPCRR with plug-in estimate of λ. These figures
are displayed in Table 3; to facilitate comparisons, each entry is
clearly show the instability of the PCR estimates. Whereas Ta-
bles 1 and 2 show that PBSE–REML fared relatively well for divided by the column minimum.
the wheat spectra with smooth function, the confidence inter- For the gasoline data, the REML versions of FPCRR and
vals (CIs) in Figure 2 do not indicate consistently accurate es- FPLSR had the best results. For prediction of protein content
timation of this function. For the bumpy ω, the other three from the wheat spectra, the top performers were B-spline PLS
methods did a much better job of estimating the bumps than and FPLSR –GCV. The protein values are known to have poor
PBSE–REML. As noted earlier, the bumpy ω has two troughs precision (Centner et al. 2000) and to be less closely related
in regions of high variance and two peaks in regions of low vari- to the spectra compared with the moisture values (Brenchley,
ance for the wheat spectra. Accordingly, as shown in Figure 3, Hörchner, and Kalivas 1997). These properties evidently made
the methods were much more effective at detecting the troughs it difficult to model protein by any of the methods.
Figure 2. Estimates of smooth ω based on simulations with wheat data, R 2 = .9 (—–, true ω; - - - - - -, empirical median; · · · · · ·, pointwise
90% CIs).
992 Journal of the American Statistical Association, September 2007
Figure 3. Estimates of bumpy ω based on simulations with wheat data, R 2 = .9 (—–, true ω; - - - - - -, empirical median; · · · · · ·, pointwise
90% CIs).
7.3.2 Moisture Content of Wheat. Because high moisture Section 7.1—in contrast to the other two validations, for which
content can lead to storage problems for wheat, the ability to FPCRR –REML markedly outperformed PCR—Figure 1 illus-
predict moisture in a wheat sample by spectroscopic meth- trates why FPCRR –REML represents an advance over PCR
ods, as opposed to the much more time-consuming methods of even for the moisture analysis. Figures 1(c) and 1(d) display
traditional “wet” chemistry, is particularly valuable. Whereas overlaid plots of the five training set estimates of ω for standard
FPCRR –REML is seen in Table 3 to have 9% higher prediction PCR [Fig. 1(c)] and for FPCRR –REML [Fig. 1(d)]. FPCRR
error than ordinary PCR for the moisture outcome described in produced estimates of ω that are more stable across training sets
and also more interpretable; a trough around 2,040 nm, which
Table 3. Split-sample validation results
seems to coincide with one of the minor peaks appearing in the
differenced spectra in Figure 1(b), emerges clearly as the coef-
Wheat–Moisture Wheat–Protein Gasoline–Octane ficient function’s most prominent feature. FPCRR may be said
to attain a more parsimonious representation for ω, in that 9-
PBSE–GCV 1.11 1.12
to 12-component models are selected for the five training sets,
PBSE–REML 1.24 2.35 1.12
versus 30- to 40-component models for PCR.
PCR 1 2.02 1.44
B-spline PCR 1.35 1.08 1.40 8. DISCUSSION
FPCRC 1.36 1.30 1.31
FPCRR –GCV 1.29 1.08 1.19 8.1 Regularized Components versus
FPCRR –REML 1.09 1.66 1 Regularized Regression
FPCRR -plug-in 2.66 2.76 1.29 A major goal of this study was to evaluate the relative
PLS 1.09 1.90 1.44 merits of regularized-component versus regularized-regression
B-spline PLS 1.24 1 1.14 versions of FPCR and FPLS. Regularized-component meth-
FPLSC 1.09 1.13 1.16 ods tend to represent ω̂ more parsimoniously in the sense of
FPLSR –GCV 1.27 1 1.17
choosing fewer components. However, this advantage appears
FPLSR –REML 1.11 1.52 1.06
to be offset by several advantages of regularized-regression
NOTE: Each data set was split into five equal subsets; for each, SSE of prediction was methods. The latter are faster, because λ can be chosen with-
computed based on a model fit with the remaining data. The mean SSE values are shown,
expressed as ratios with respect to the column minimum. The PBSE–GCV prediction er- out recourse to the double-cross (see Sec. 5.1). Moreover,
ror for moisture is based on only four training set models, because for one training set, regularized-regression methods offer improved performance for
the GCV criterion chose λ = 0, resulting in a computational singularity. The same error
occurred for all five training sets for PBSE–GCV with the protein data; hence the missing both estimation of ω and prediction of y, especially with λ cho-
entry. sen by REML. Thus, although FPCRC is related to the trun-
Reiss and Ogden: Functional PCR and PLS 993
cated Karhunen–Loève expansion estimator, for which conver- hand, FPCRR /FPLSR failed to consistently improve on PBSE
gence rates have been derived for both prediction error (Cai for smooth ω; evidently, the span of the leading components
and Hall 2006) and estimation error (Hall and Horowitz 2007), was not always sufficiently rich to approximate the smooth
FPCRC appears to not be the most successful method for our function well.
small samples. Figures 4 and 5 provide some insight into the variation in
8.2 Building on the Penalized B –Spline Expansion relative performance of PBSE and FPCRR . These plots com-
pare the true coefficient functions used in the simulations with
Cardot et al. (2003) showed that under reasonable assump- oracle estimates—by FPCRR with various numbers of compo-
tions, the MISE of the PBSE estimator is of order n−2p/(4p+1) nents and by PBSE—derived from a set of random outcomes.
in probability, where p depends on the specific assumptions but Also shown are the projections of the true functions on the span
is bounded above by the degree of the B-splines used. One of of the FPCR components, or on the span of the B-spline basis
the assumptions is that λ grows at a certain rate as n → ∞.
for PBSE. Such a projection represents the most accurate pos-
In our simulations, however, λ often was chosen to be essen-
sible estimate given the subspace of RN to which the estimate
tially infinite, in the sense that the resulting estimator was in-
is restricted.
distinguishable from a straight line. This occurred more often
with REML than with GCV, but the latter method’s tendency to For the bumpy function, Figure 4 shows that the troughs are
undersmooth in some cases caused it to fare less well overall. well estimated with as few as two components because, as men-
Improved methods for the choice of λ (perhaps along the lines tioned earlier, these occur in regions with high variation in the
of Kou and Efron 2002) may help optimize the performance of signals. The peaks, occurring in regions with little variation in
PBSE. the signals, are recovered to some extent only with a large num-
FPCRR /FPLSR with REML performed better than PBSE in ber of components or by PBSE, which comes at the price of less
the simulations for bumpy ω. Intuitively, this may be because accurate estimation of the troughs. But evidently, because the
PBSE with λ = 0 leads to an extremely bumpy function, so that trough regions are more important for prediction, estimating the
if some bumps are real, then the roughness penalty has diffi- troughs well is preferable to estimating both the troughs and the
culty distinguishing these from spurious bumps. On the other peaks somewhat well. On the other hand, Figure 5 shows that a
Figure 4. Estimating the bumpy coefficient function by PBSE and FPCRR . The solid line in each plot represents the true coefficient function;
the dashed line is the projection of this function on the span of the FPCR components (or of the B-spline basis, for PBSE); and the dotted line
is the oracle estimate (i.e., the estimate using the MISE-minimizing value of λ) based on a set of random outcomes generated using the wheat
spectra with R 2 = .9.
994 Journal of the American Statistical Association, September 2007
Figure 5. Estimating the smooth coefficient function by PBSE and FPCRR . See Figure 4 for an explanation.
large number of components is needed to estimate the smooth smooth ω), and correspondingly, FPCRR –REML outperformed
function well, but apparently, exceeding the required number PBSE-oracle. However, for the wheat data with smooth ω,
of components (as PBSE does) necessitates a very conservative FPCRR –REML usually chose a small number of components,
amount of smoothing to counter the risk of overfitting, resulting notwithstanding the large-sample result of Theorem 3. This
in inferior estimation. suboptimal choice may explain why FPCRR –REML did less
Figure 6 shows plots of f (x) = MISE(FPCRR with x compo- well than PBSE-oracle in this case.
nents)/MISE(PBSE-oracle) where λ for FPCRR was chosen The idea that CV-type criteria cannot always be counted on
to give the same degrees of freedom [trace of the hat matrix to choose the optimal number of components is reinforced by
defined in (6)] as the PBSE-oracle fit. (By definition, the ora- some preliminary findings with a positron emission tomogra-
cle choice of λ for FPCRR would be more advantageous for phy (PET) data set. Parsey et al. (2006) measured binding po-
FPCRR , but the equal-degrees-of-freedom choice allows for a tential (BP) of serotonin 1A receptors, using PET studies with
cleaner comparison.) Within each of the subfigures, f is plot- the radioligand [carbonyl-11 C]WAY 100635, in 28 depressed
ted for R 2 = .3, .6, and .9. The shape of f seems to depend subjects and 43 controls. BP is an index of the density of sero-
primarily on ω, secondarily on the data set, and least on R 2 . In tonin receptors, which are believed to play a key role in de-
agreement with Figures 4 and 5, these plots suggest that with pression. It is of interest to use such BP maps as predictors of
bumpy (smooth) ω, FPCRR tends to do best with a small (large) depression-related outcomes, such as the Hamilton depression
number of components. score. Marx and Eilers (2005) have extended their implementa-
tion of PBSE to multidimensional signals or images. With this
8.3 Cross-Validation-Type Criteria May Choose Too
data set, the number of images (n) is much smaller than the
Few Components
number of basis elements (K) needed to capture a reasonable
Although in practice the degrees of freedom are not the level of detail, whereas the PBSE convergence result of Car-
same for FPCRR –REML as for PBSE-oracle, Figure 6 may dot et al. (2003) assumes that K/n → 0. Partly for this reason,
shed some light on the varying performance of FPCRR –REML we expected FPCRR –REML to be more suitable than PBSE–
in the simulations. For three of the four data set/true coef- REML. To test this expectation, we carried out a simulation
ficient function combinations, CV usually chose the number study with outcomes generated using two-dimensional slices
of components well in the aforementioned sense for FPCRR – obtained from 68 of the 71 BP maps along with the true co-
REML (a small number for bumpy ω, and a large number for efficient function described by Reiss (2006). With the number
Reiss and Ogden: Functional PCR and PLS 995
Figure 6. Comparing MISE for PBSE-oracle and FPCRR . The MISE for FPCRR divided by the MISE for PBSE-oracle is plotted as a
function of number of components used for FPCRR , with R 2 = .3 (1), .6 (!), and .9 (P).
of components chosen by GCV rather than by CV, FPCRR re- Cardot, H. (2002), “Local Roughness Penalties for Regression Splines,” Com-
quired only about 25% more computation time than PBSE. putational Statistics, 17, 89–102.
Cardot, H., Ferraty, F., and Sarda, P. (2003), “Spline Estimators for the Func-
The relative performance of signal regression methods de- tional Linear Model,” Statistica Sinica, 13, 571–591.
pends in a nontrivial way on the eigenstructure of the signals Centner, V., Verdú-Andrés, J., Walczak, B., Jouan-Rimbaud, D., Despagne, F.,
Pasti, L., Poppi, R., Massart, D.-L., and de Noord, O. E. (2000), “Comparison
(Cardot et al. 2003; Hall and Horowitz 2007). In view of this, a of Multivariate Calibration Techniques Applied to Experimental NIR Data
key difference between the spectra studied earlier and the PET Sets,” Applied Spectroscopy, 54, 608–623.
images is that for the latter, a much larger number of PCs is de Jong, S. (1993), “SIMPLS: An Alternative Approach to Partial Least Squares
Regression,” Chemometrics and Intelligent Laboratory Systems, 18, 251–263.
needed to account for most of the variation. Thus we would ex- Goutis, C., and Fearn, T. (1996), “Partial Least Squares Regression on Smooth
pect that FPCRR would need a large number of components to Factors,” Journal of the American Statistical Association, 91, 627–632.
work well; but nevertheless, GCV often chose a small number Hall, P., and Horowitz, J. L. (2007), “Methodology and Convergence Rates for
Functional Linear Regression,” The Annals of Statistics, 35, 70–91.
of components. Imposing a minimum of 30 components im- Kalivas, J. H. (1997), “Two Data Sets of Near-Infrared Spectra,” Chemometrics
proved the results. Thus FPCRR had lower prediction error than and Intelligent Laboratory Systems, 37, 255–259.
PBSE in 135 out of 200 simulations, but with the 30-component Kou, S., and Efron, B. (2002), “Smoothers and the Cp , GML, and EE Criteria:
A Geometric Approach,” Journal of the American Statistical Association, 97,
minimum, this number increased to 156. Similarly, FPCRR had 766–782.
lower estimation error than PBSE in 160 of the simulations Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, New
York: Academic Press.
without, and 186 with, the 30-component minimum. Marx, B. D., and Eilers, P. H. C. (1999), “Generalized Linear Regression on
We conclude that FPCRR /FPLSR with REML may often Sampled Signals and Curves: A P –Spline Approach,” Technometrics, 41,
outperform not only other forms of FPCR/FPLS, but also ex- 1–13.
(2002), “Multivariate Calibration Stability: A Comparison of Meth-
isting approaches to signal regression, such as PBSE and un- ods,” Journal of Chemometrics, 16, 129–140.
smoothed PCR/PLS. At the same time, further research is (2005), “Multidimensional Penalized Signal Regression,” Technomet-
needed on the optimal choice of both the number of compo- rics, 47, 13–22.
Massy, W. F. (1965), “Principal Components Regression in Exploratory Statis-
nents and the smoothing parameter. tical Research,” Journal of the American Statistical Association, 60, 234–256.
Parsey, R. V., Oquendo, M. A., Ogden, R. T., Olvet, D. M., Simpson, N., Huang,
[Received October 2005. Revised March 2007.] Y., Van Heertum, R. L., Arango, V., and Mann, J. J. (2006), “Altered Serotonin
1A Binding in Major Depression: A [Carbonyl-C-11]WAY100635 Positron
REFERENCES Emission Tomography Study,” Biological Psychiatry, 59, 106–113.
R Development Core Team (2005), R: A Language and Environment for Sta-
Brenchley, J. M., Hörchner, U., and Kalivas, J. H. (1997), “Wavelength Selec- tistical Computing, Vienna, Austria: R Foundation for Statistical Computing,
tion Characterization for NIR Spectra,” Applied Spectroscopy, 51, 689–699. [Link]
Cai, T. T., and Hall, P. (2006), “Prediction in Functional Linear Regression,” Ramsay, J. O., and Silverman, B. W. (2005), Functional Data Analysis
The Annals of Statistics, 34, 2159–2179. (2nd ed.), New York: Springer-Verlag.
996 Journal of the American Statistical Association, September 2007
Reiss, P. T. (2006), “Regression With Signals and Images as Predictors,” un- Stone, M. (1974), “Cross-Validatory Choice and Assessment of Statistical Pre-
published doctoral dissertation, Columbia University, Dept. of Biostatistics. dictions,” Journal of the Royal Statistical Society, Ser. B, 36, 111–147.
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003), Semiparametric Regression, Wahba, G. (1990), Spline Models for Observational Data, Philadelphia: Society
Cambridge, U.K.: Cambridge University Press. for Industrial and Applied Mathematics.
Shao, J. (1993), “Linear Model Selection by Cross-Validation,” Journal of the Wand, M. P. (1999), “On the Optimal Amount of Smoothing in Penalised Spline
American Statistical Association, 88, 486–494. Regression,” Biometrika, 86, 936–940.
Silverman, B. W. (1996), “Smoothed Functional Principal Components Analy- Wang, Y. (1998), “Smoothing Spline Models With Correlated Random Errors,”
sis by Choice of Norm,” The Annals of Statistics, 24, 1–24. Journal of the American Statistical Association, 93, 341–348.