Key Wavelength Selection via CARS Method
Key Wavelength Selection via CARS Method
a r t i c l e i n f o a b s t r a c t
Article history: By employing the simple but effective principle ‘survival of the fittest’ on which Darwin’s Evolution Theory
Received 12 February 2009 is based, a novel strategy for selecting an optimal combination of key wavelengths of multi-component
Received in revised form 18 June 2009 spectral data, named competitive adaptive reweighted sampling (CARS), is developed. Key wavelengths
Accepted 18 June 2009
are defined as the wavelengths with large absolute coefficients in a multivariate linear regression model,
Available online 24 June 2009
such as partial least squares (PLS). In the present work, the absolute values of regression coefficients of
PLS model are used as an index for evaluating the importance of each wavelength. Then, based on the
Keywords:
importance level of each wavelength, CARS sequentially selects N subsets of wavelengths from N Monte
Wavelength selection
Monte Carlo
Carlo (MC) sampling runs in an iterative and competitive manner. In each sampling run, a fixed ratio (e.g.
Adaptive reweighted sampling 80%) of samples is first randomly selected to establish a calibration model. Next, based on the regression
Model sampling coefficients, a two-step procedure including exponentially decreasing function (EDF) based enforced
Near infrared wavelength selection and adaptive reweighted sampling (ARS) based competitive wavelength selection
Multivariate calibration is adopted to select the key wavelengths. Finally, cross validation (CV) is applied to choose the subset
with the lowest root mean square error of CV (RMSECV). The performance of the proposed procedure
is evaluated using one simulated dataset together with one near infrared dataset of two properties. The
results reveal an outstanding characteristic of CARS that it can usually locate an optimal combination of
some key wavelengths which are interpretable to the chemical property of interest. Additionally, our study
shows that better prediction is obtained by CARS when compared to full spectrum PLS modeling, Monte
Carlo uninformative variable elimination (MC-UVE) and moving window partial least squares regression
(MWPLSR).
© 2009 Elsevier B.V. All rights reserved.
0003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved.
doi:10.1016/[Link].2009.06.046
78 H. Li et al. / Analytica Chimica Acta 648 (2009) 77–84
combination optimization [17,18], simulated annealing (SA) [16,19], which are of high adaptability regardless of the variation of training
and ant colony optimization (ACO) [20] have been applied to select samples.
the optimal subset of wavelengths. All these studies suggest that
better prediction can be obtained using the selected wavelengths 2.3. PLS and weights of variables
rather than the full spectrum, which is an indication of the impor-
tance of wavelength selection. But one should know that this kind PLS is a widely used procedure for modeling the linear relation-
of methods based on optimization methods is usually computation- ship between X and y based on latent variables (LVs). Suppose that
ally intensive and sensible to the initialized solution. the scores matrix is denoted by T, which is a linear combination of
Besides, a series of more direct methods have been proposed to X with W as combination coefficients [38], and c is the regression
conduct wavelength selection, such as iterative partial least squares coefficient vector of y against T by least squares. Thus we have the
(iPLS) [21], uninformative variable elimination (UVE) [22], Monte following formula:
Carlo based UVE (MC-UVE) [23,24], moving window partial least
T = XW (1)
squares (MWPLS) [25], successive projection [26,27], Bayesian lin-
ear regression (BLR) [28] and so on. y = Tc + e = XWc + e = Xb + e (2)
In essence, the developed wavelength-reduced model by wave-
]T
where e is the prediction error and b = Wc = [b1 , b2 , . . . bp is the
length selection is much more interpretable for the sake of some
p-dimensional coefficient vector. The absolute value of the ith
scientific insight into the relationship between digitalized spectra
element in b, denoted |bi | (1 ≤ i ≤ p) reflects the ith wavelength’s
and the property to be investigated, e.g. concentration. The under-
contribution to y. Thus, it is natural to say that the larger |bi | is, the
lying assumption behind wavelength selection may be that the
more important the ith variable is. For evaluating the importance
regression model will be biased from the ‘true’ one due to the
of each wavelength, we define a normalized weight as:
distortion caused by the wavelengths which are irrelevant with
respect to the property under investigation. Based on the reports |bi |
wi = p , i = 1, 2, 3, . . . , p (3)
[5–25,28–33], one can conclude that wavelength selection is a |bi |
i=1
key factor for constructing a reliable and interpretable calibration
model with good prediction accuracy. Additional attention should be paid to that the weights of the
In this study, we present a new strategy, termed competitive eliminated wavelengths by CARS are set to zero manually so that
adaptive reweighted sampling (CARS), which has the potential to the weight vector w is always p-dimensional.
select an optimal combination of the wavelengths existing in the
full spectrum coupled with partial least squares regression by using 2.4. Exponentially decreasing function
the simple but effective principle ‘survival of the fittest’ on which
Darwin’s Evolution Theory is based. With applications to one sim- Suppose the full spectrum contains p wavelengths and N sam-
ulated dataset and one real NIR spectral dataset of two properties, pling runs are performed in CARS. As mentioned before, the
CARS proves to be a promising procedure to conduct wavelength wavelength selection in CARS consists of two steps. In the first step,
selection for building a high performance calibration model. Addi- EDF is utilized to remove the wavelengths which are of relatively
tionally, it should be pointed out that CARS is not designed for small absolute regression coefficients by force. In the ith sampling
spectral data only. It is a general strategy and thus can be used run, the ratio of wavelengths to be kept is computed using an EDF
for variable selection of other kinds of data, such as genomic, pro- defined as:
teomic and metabolomic data. Moreover, it can also be coupled with ri = ae−ki (4)
discriminant analysis for biomarker discovery.
where a and k are two constants determined by the following two
conditions: (I) in the first sampling run, all the p wavelengths are
2. Theory and algorithms taken for modeling which means that r1 = 1, (II) in the Nth sampling
run, only two wavelengths are reserved such that we have rN = 2/p.
2.1. Notation With the two conditions, a and k can be calculated as:
p 1/(N−1)
The data matrix X contains m samples in rows and p variables in a= (5)
columns. Vector y with order m × 1 denotes the measured property 2
of interest. The superscript T denotes vector or matrix transpose. ln(p/2)
When modeling, both X and y are mean-centered. k= (6)
N−1
Suppose the number of MC sampling runs of CARS is set to N.
where ln denotes the natural logarithm.
With this setting, CARS will sequentially select N subsets of wave-
Fig. 1 illustrates an example of EDF. As can be seen clearly, the
lengths. Briefly speaking, in each sampling run, CARS works in four
process of wavelength reduction can be roughly divided into two
successive steps: (1) Monte Carlo for model sampling. (2) Employ
stages. In the first stage, wavelengths are eliminated rapidly which
EDF to perform enforced wavelength selection. (3) Adopt ARS to
performs a ‘fast selection’, whereas in the second stage, wave-
realize a competitive selection of wavelengths and (4) cross val-
lengths are reduced in a very gentle manner, which is instead called
idation [34–37] is utilized to evaluate the subset. CARS will be
a ‘refined selection’ stage in our study. Therefore, wavelengths of
discussed in great detail in the following sections.
little or no information in a full spectrum can be removed in a step-
wise and efficient way because of the advantage of EDF. That is the
2.2. Monte Carlo for model sampling reason why we choose EDF. Its advantage will be demonstrated by
our experiments in the following sections.
Like uninformative variable elimination [22,23], in each sam-
pling run of CARS, a PLS model is built using the randomly selected 2.5. Adaptive reweighted sampling
samples (usually 80–90% of the calibration set) not all the sam-
ples in the calibration set. From the point of view of sampling, this Following EDF-based enforced wavelength reduction, adaptive
process can be regarded as sampling in the model space combined reweighted sampling (ARS) is employed in CARS to further elim-
with Monte Carlo strategy. We are intended to select the variables inate wavelengths in a competitive way. This step mimics the
H. Li et al. / Analytica Chimica Acta 648 (2009) 77–84 79
Table 1
The results on the simulated dataset.
Fig. 7. As illustrated, ˛1 denotes the angle between X1 and y. ˛2 denotes the angle
between X2 and y. ˇ denotes the angle between y and its projection on the space
spanned by X1 and X2 . ˇ is very small. The condition ˛2 ˇ and ˛2 ˇ holds in this
case.
Fig. 9. Plots a and b show the changing of the number of sampled wavelengths
and 10-fold RMSECV values. Plot c records the regression coefficient path of each
for water content, from the point of view of either RMSECV or band
wavelength. The vertical asterisk line denotes the optimal point where 10-fold CV
assignment to chemical bond. As known, MWPLS is a procedure values achieve the lowest.
which takes a series of size-changing moving windows to identify
and select a local wavelength band or several separate local bands
hence the model’s variance can be reduced with fewer wavelengths.
in terms of the residuals and the number of latent variables. Thus
More interestingly, for each run of CARS, both the wavelength 1908
it can only work well if the meaningful wavelength band exists in
and 2108 nm are selected. Therefore, for this data, one can treat
a narrow region. But for this case, Band 1 and Band 2 are so far
1908 and 2108 nm, of very large absolute regression coefficients in
away from each other that their combination cannot be detected
calibration model, as the key wavelengths in terms of the selection
by MWPLS. The results prove that MWPLS cannot deal with this
of CARS.
situation well.
Fig. 9c shows the regression coefficient path of each wavelength
As mentioned before, both MC-UVE and CARS adopt Monte Carlo
from one execution of CARS with the number of sampling runs set
strategy to perform wavelength selection. Therefore, it is neces-
to 100. It can be seen in the first sampling run, that the absolute
sary to run the programmes many times to obtain statistically
value of regression coefficient of each wavelength is very small.
stable results. In our study, we run MC-UVE and CARS programmes
But with the number of sampling runs increased, the coefficients
500 times, respectively. Both the mean and standard deviation are
of some wavelengths get larger and larger while others become
given in Table 2. The results demonstrate that better prediction is
smaller and smaller. Specially, the coefficients even drop to zero if
obtained by CARS combined with PLS. Moreover, the number of
the corresponding wavelengths are eliminated by CARS because
both latent variables and the selected wavelengths are significantly
of their incompetence. Thus, the larger the absolute coefficient
lower, which may be seen as a proof for Occam Razor Theory [44,45].
is, the more probable the corresponding wavelength can survive.
The reason why better prediction can be achieved using fewer
This selection mechanism in CARS is somewhat like ‘survival of
wavelengths may be that wavelengths are heavily collinear and
the fittest’ in Darwin’s Evolution Theory. Each wavelength can be
treated as an individual, and all the other wavelengths are naturally
Table 2 seen as its ‘environment’. Based on this, CARS algorithm realizes
The results on corn moisture data. the process of selecting the fittest individual by utilizing adaptive
Methods RMSECV nLVs nVAR
reweighted sampling technique. As Fig. 9c shows, the coefficients
of wavelength 1908 and 2108 nm grow up first slowly, then quickly
PLSa 0.0229 10 700
and finally reach the maximal absolute values above 100 (multiple
PLSb 0.2394 4 15
PLSc 0.2747 4 13 runs of CARS lead to similar results, data not shown). These two
PLSd 0.0058 4 28 wavelengths thus can be considered to be key wavelengths for this
MC-UVE-PLS 0.0032 ± 0.0004 10 ± 0 55 ± 6 data. The optimal subset chosen by CARS can be extracted from the
MWPLSe 0.0383 10 119
position denoted by the vertical asterisk line corresponding to the
CARS-PLS 0.0006 ± 0.0008 3±2 3±3
minimal 10-fold RMSECV value.
a
Results using full spectrum in the range 1100–2498 nm. Further, we also statistically compute the selected frequency of
b
Results using the range 1894–1922 nm (Band 1, in Fig. 7).
c
each wavelength by running CARS 500 times. The result is shown in
Results using the range 2098–2122 nm (Band 2, in Fig. 7).
d
Fig. 10a. From Fig. 10a, one can find that the wavelengths 1908 and
Results using the combination of 1894–1922 and 2098–2122 nm (Band 1 + Band
2, in Fig. 7). 2108 nm are not selected by chance because the frequencies of both
e
Results from the combination of four regions 1378–1438, 1558–1598, 1828–1868 are selected 500, which further prove that these two wavelengths
and 1988–2078 nm. are key to the calibration model. Generally, CARS can select an opti-
H. Li et al. / Analytica Chimica Acta 648 (2009) 77–84 83
Table 3
The results on corn protein data.
Acknowledgements
References