Encrypted Text Analysis
Encrypted Text Analysis
Abstract
Quantitative Structure-Property Relationship (QSPR) neural networks (RBFNNs) were used to construct the
models based on molecular descriptors derived from QSPR models. The root mean square errors in liquid heat
molecular structures have been developed for the predic- capacity predictions for the training, test and overall data
tion of liquid heat capacity at 25 8C using a diverse set of sets are 16.857, 18.744 and 17.141 heat capacity units,
871 organic compounds. The molecular descriptors used to respectively. The prediction results are in agreement with
represent molecular structures include constitutional and the experimental values, but the RBFNN model seems to
topological indices and quantum chemical parameters. be better than stepwise regression method.
Forward stepwise regression and radial basis function
QSAR Comb. Sci. 22 (2003) WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1611-020X/03/0104-0029 $ 17.50+.50/0 29
X. Yao et al.
Although many reported QSPR methods have been logical descriptors. Constitutional descriptors are basically
successfully used to predict a diverse set of physicochemical related to the number of atoms and bonds in each molecule.
properties, their use in predicting heat capacity is rather Topological descriptors include valence and non-valence
limited [9 ±10]. Gakh et al. obtained a QSPR model for heat molecular connectivity indices calculated from the hydrogen-
capacity of alkanes using graph theory descriptors and neural suppressed formula of the molecule, encoding information
networks with a RMS of 4.04 heat capacity unit [9]. Liu et al. about the size, composition and the degree of branching of a
developed several QSPR models of alkanes using their molecule. The quantum chemical descriptors include infor-
molecular electronic edge vectors (MEDV). The RMS for the mation about binding and formation energies, partial atom
heat capacity of 134 alkanes was 3.81 heat capacity unit [10]. charge, dipole moment, and molecular orbital energy levels.
But their works are limited to the investigations of a same After the calculation of molecular descriptors, forward
family of compounds (alkanes). In our previous works we stepwise regression method was then used to select the
have successfully developed several QSPR models based on significant descriptors to develop QSPR model of the liquid
RBFNNs for the prediction of physicochemical properties heat capacity.
[11 ±15] and HPLC retention indices of N-Benzylideneani-
lines [16]. The goal of the present study is to extend our
2.3 Radial basis function neural networks
previous investigations in order to, for the first time, establish
a QSPR model that can predict the liquid heat capacity at The theory of RBFNNs and its local modeling ability have
258C for a diverse set of organic compounds dependent only been well described in a recent paper by Walczak and
upon their molecular structures. MLR and RBFNNs are Massart [19]. RBFNNs can be described as a three-layer
applied to establish quantitative linear and non-linear feedforward structure. As presented schematically in Fig-
relationships between heat capacity and molecular descrip- ure 1, the RBFNNs consist of three layers: input layer,
tors respectively. The data set used in our work is more diverse hidden layer and output layer. The input layer does not
and the models developed are more general and practical process the information; it only distributes the input vectors
with respect to the models reported early by other authors. to the hidden layer. The hidden layer of RBFNNs consists of
a number of RBF units (nh) and bias (bk). Each hidden layer
unit represents a single radial basis function. Each neuron
on the hidden layer employs a radial basis function as non-
2 Materials and Methods
linear transfer function to operate on the input data. The
most often used RBF is a Gaussian function that is
2.1 Data set
characterized by a center position (cj) and width (rj). The
The data used in this investigation were collected from RBF functions by measuring the Euclidean distance be-
literature [8]. The set consists of 871 organic compounds. tween input vector (x) and the radial basis function center
The data set includes hydrocarbons, fluorocarbons, chlor- (cj) and performs the non-linear transformation with RBF in
ocarbons, bromocarbons, iodocarbons, alcohols, acids, ke- the hidden layer as given in below:
tones, aldehydes, ethers, esters, amines, nitriles, etc. A
complete list of the compound names and their correspond- hj x exp x cj 2 =r2j 1
ing experimental liquid heat capacities at 25 8C is given in
Table 1. The data were randomly separated into two subsets:
a training set of 746 compounds and a test set of 125
compounds. The training set was used to adjust the
parameters of RBFNNs and the test set was used to evaluate
the predictive ability of RBFNNs.
Table 1. The compounds and the predicted results of liquid heat capacities: (J/mol ´ K)
Number NAME HEATCAP MLR RBFNNs
1 Bromochlorodifluoromethane 127.81 106.55 127.23
2* Bromotrichloromethane 130.65 108.16 130.03
3 Bromotrifluoromethane 143.25 104.50 126.00
4 dibromodifluoromethane 129.94 109.53 129.03
5 dichlorodifluoromethane 121.07 105.12 125.43
6 phosgene 112.03 118.18 117.02
7 trichlorofluoromethane 68.65 108.15 128.95
8 carbontetrachloride 130.72 100.65 126.01
9* tribromomethane 129.94 141.93 136.20
10 chlorodifluoromethane 111.66 100.40 105.96
11 dichlorofluoromethane 107.20 104.37 106.08
12 chloroform 112.49 97.81 99.01
13 bromochloromethane 100.75 113.78 100.37
14 dibromomethane 105.11 129.51 116.57
15 dichloromethane 101.98 95.29 82.05
16* difluoromethane 107.05 101.92 94.33
17 diiodomethane 135.45 145.06 141.58
18 formic acid 98.40 98.60 106.48
19 methylbromide 85.16 119.67 81.05
20 methylchloride 82.33 104.69 61.86
21 methyliodide 82.91 128.28 98.97
22 nitromethane 104.22 111.55 127.14
23* methanol 79.93 123.08 94.04
24 Methylmercaptan 96.39 131.43 92.57
25 Methylamine 114.19 120.29 99.43
26 Bromotrifluoroethylene 135.33 135.29 140.92
27 1,2-dibromotetrafluoroethane 184.03 153.13 179.99
28 Chlorotrifluoroethylene 150.36 135.29 140.92
29 Chloropentafluoroethane 174.61 150.71 173.07
30* 1,2-dichlorotetrafluoroethan 110.13 150.31 174.06
31 1,1,2-trichlorotrifluoroetha 169.44 153.97 177.66
32 Tetrachloroethylene 146.93 134.64 140.24
33 trichloroacetylchloride 168.81 161.40 169.97
34 2-chloro-1,1-difluoroethylen 126.95 140.04 138.28
35 trichloroethylene 123.67 130.81 128.04
36 dichloroacetyl chloride 154.23 152.81 152.72
37* trichloroacetaldehyde 153.22 150.51 154.35
38 pentachloroethane 190.91 146.28 163.41
39 trifluoroacetic acid 170.15 155.99 160.88
40 pentafluoroethane 168.85 143.59 157.87
41 1,1,2,2-tetrabromoethane 168.43 157.56 156.62
42 1,1-dichloroethylene 112.43 122.69 121.29
43 cis-1,2-dichloroethylene 117.88 116.74 110.47
44* trans-1,2-dichloroethylene 118.37 115.54 109.88
45 chloroacetyl chloride 142.92 138.03 137.11
46 dichloroacetaldehyde 134.07 141.23 138.66
47 dichloroacetic acid 172.72 164.08 161.64
48 1,1,1-trichlorofluoroethane 151.42 147.95 154.60
49 1,1,1,2-tetrachloroethane 154.63 144.61 151.58
50 1,1,2,2-tetrachloroethane 159.39 133.99 139.18
51* 1,1,1,2-tetrafluoroethane 141.80 135.99 145.90
52 vinyl bromide 103.13 109.82 106.96
53 vinyl chloride 86.01 106.30 101.32
54 1-chloro-1,1-difluoroethane 132.37 127.27 140.17
55 acetyl chloride 115.02 114.12 121.16
56 chloroacetaldehyde 115.19 124.99 124.32
57 methyl chloroformate 141.03 114.12 121.16
58* 1,1,1-trichloroethane 145.56 137.87 147.19
59 1,1,2-trichloroethane 151.46 131.14 130.57
60 1,1,1-trifluoroethane 139.86 122.51 137.18
61 acetonitrile 91.94 99.29 88.93
62 1,1-dibromoethane 133.61 127.47 128.10
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
Table 1. (cont.)
In which, hj is the notation for the output of the jth RBF unit. 3 Results and Discussion
For the jth RBF cj and rj are the center and width
respectively. The operation of the output layer is linear, A great number of molecular descriptors, which encode the
which is given in equation 2 constitutional, topological and quantum chemical features
of the molecules, were calculated to describe the molecular
X
nh
structures. Forward stepwise regression routine implement-
yk x wkj hj x bk 2
j1
ed in SPSS is used to develop the linear model for the
prediction of liquid heat capacity using calculated molecular
descriptors. The best linear model contains six molecular
Where yk is the kth output unit for the input vector x, wkj is descriptors. They are one constitutional, three topological
the weight connection between the kth output unit and the and two quantum chemical descriptors. The one constitu-
jth hidden layer unit and bk is the bias. tional descriptor used is sum of atomic Sanderson electro-
From Eq (1) and Eq (2), one can see that designing a negativities, which is related to the constitution of a
RBFNN involves selecting centers and the corresponding molecule. The topological descriptors used are connectivity
number of hidden layer units, and determining width and index chi-1, connectivity index chi-2, and average connec-
weights. There are various ways for selecting the centers, tivity index chi-0. They describe the size, branching and
such as random subset selection, K-means clustering, composition of a molecule and relate to the dispersion
orthogonal least squares learning algorithm, RBF-PLS interaction among molecules. The two quantum descriptors
[19], etc. The widths of the radial basis function can be are maximum negative charge and mean absolute charge.
chosen either as the same for all the units or different value These two descriptors are correlated with the electrostatic
for each unit. In this paper, considerations were limited to and hydrogen bonding interactions among molecules. The
the Gaussian functions with a constant width for all units. A regression parameters of the best six-descriptor correlation
forward subset selection routine [20, 21] was used to select model are gathered in Table 2. This model produced a
the centers from training set samples. The adjustment of the standard error of 18.36 heat capacity unit and a correlation
connection weights between hidden layer and output layer is coefficient of 0.986. RMS (Root Mean Square Error) was
performed using a least-squares solution after the selection calculated as 18.827 heat capacity unit. Figure 2 shows the
of centers and width of radial basis functions. predicted vs. experimental liquid heat capacities.
The overall performance of RBFNNs is evaluated in After the establishment of a linear model, RBFNN was
terms of root mean squared error (RMS) according to the used to develop a non-linear model based on the same
following equation: subset of descriptors. The RBFNNs has six inputs (a set of six
v molecular descriptors), one output layer unit (liquid heat
uP 2 capacity) and one hidden layer of nh units. Such a RBFNN
u ns
u y ^ yk can be designed as 6-nh-1 net, indicating the number of units
t i1 k
RMS 3 in input layer, hidden layer and output layer respectively. A
ns
RBFNN is completely specified by choosing the following
parameters:
Where yk is the desired output and yÃk is the actual output of
* The number nh of radial basis functions
the network, ns is the number of compounds in analyzed set.
* The center cj and width rj of each radial basis function
* The connection weight wkj between jth hidden layer unit
2.4 Radial Basis Function Neural Networks and kth output unit
implementation and computation environment
The number of radial basis functions (the hidden layer units)
All calculation programs implementing RBFNNs were nh greatly influences the performance of a RBFNN. If the
written in M-file based on basis MATLAB script for radial number is too low, the network may not calculate a proper
basis function neural networks [20, 21]. All scripts were estimation of the data. On the other hand, if too many
compiled using MATCOM compiler running Redhat Linux hidden layer units are used, the network tends to overfit the
6.0 operating system on a Pentium 266 PC with 128M training data. In this paper, the radial basis functions were
RAM. added one by one and terminated if no performance of the
Table 2. Descriptors, coefficients, standard error, and T-values for the linear model
Chemical Meaning Descriptor Coefficient Std. Error t value
Intercept ( Constant) 180.939 17.486 10.348
sum of atomic Sanderson electronegativities SE 6.509 .243 26.755
connectivity index chi-1 X1 33.924 1.876 18.079
maximum negative charge QMIN 191.878 14.433 13.294
average connectivity index chi-0 X0A 230.778 18.937 12.187
connectivity index chi-2 X2 11.013 1.085 10.146
mean absolute charge QMEAN 157.903 18.165 8.693
R (correlation coefficient) 0.986
F value 4738.46
Std 18.67
Figure 2. The predicted vs. experimental liquid heat capacity using MLR
networks was improved by adding a new basis function. The increasing the model complexity is continued till some
centers of RBFNNs are determined with the forward subset criterion such as GCV stops increasing. The criterion of the
selection method proposed by Orr [20 ± 21] . The advantages selection used here is an approximation of the leaving-one-
of this method over other center selection ones is that it can out (LOO) cross-validation methods, according to the
determine the number of hidden layer units simultaneously equation below:
and there is no need to fix the number of hidden layer units
in advance. This method has also a tractable model order yP diag P P ^
^ 2
y
s 2LOO 4
selection and goes through a process of selecting a subset of p
radial basis functions from a large set of candidates (training
set samples). The model starts empty; the radial basis Where yà is the output of the network, P is the projection
function is added one by one by selecting the candidate from matrix, which can be computed by P Ip ZZt from the
sample list, followed by testing if the added center reduces outputs matrix Z of hidden layer units and the unit matrix Ip
the sum of squared errors compared with other samples, with dimension p, p is the pattern number in training sets.
until the sum of squared errors reached its minimum and not The LOO cross-validation method was used to prevent the
further changed. This process of adding hidden units and network from overfitting.
Figure 3. The optimization of the width of RBFNNs Figure 4. The predicted vs. experimental liquid heat capacity
using RBFNNs
After the selection of the centers and number of hidden effect on the performance of RBFNNs if width exceeds 3. So
layer units, the connection weights can be easily calculated we select the optimal width from 1.5 to 3 using an increment
by linear least square methods. of 0.1. Each minimum error on LOO cross-validation was
plotted versus the width (Figure 3) and the minimum was
w yZ'(ZZ') 1
(5) chosen as the optimal value. Finally, in our study the two
parameters r and nh are equal to 1.9 and 19 respectively.
Where y is the matrix of training example targets, Z is the The selected centers and their distribution among training
matrix of hidden layer unit outputs, Z' is the transpose of samples are listed in Table 3. As can be seen the selected
matrix Z and w is the weight matrix connection hidden layer centers correspond very well with the distribution of train-
and output layer. ing set samples. With the best network, a test was carried out
The optimal width was determined by experiments with a using the test set, and the results obtained are given in
number of trials by taking into account of the model Table 1. The plot of predicted values versus experimental
selection criterion: a width smaller than 1.5 gives poor data is shown in Figure 4. The network gives RMS of 17.141
prediction ability, varying the width indicates width has little for the whole set, 16.857 for the training set, and 18.744 for
the prediction set. These results are generally better than
those obtained with MLR method. Analysis of the obtained
Table 3. A full list of centers selected for RBFNNs results indicates that the RBFNN model we proposed can
well represent the structural-property relationships of these
No Name
compounds, and that we can use only the parameters
858 dibutyl sebacate calculated from structures as molecular descriptors for
832 dibutyl phthalate predicting the liquid heat capacities of studied compounds.
865 methyl oleate Compared with previous work reported for alkanes [9, 10],
859 1-cycopentyltridecane
864 n-tridecylbenzene
the RMS values seem to be more important in this study, this
146 decafluorobutane can be explained by diversity of the data set used in our
20 methylchloride investigation, they are not only limited to alkanes. The
82 ethylamine model obtained from this study covers almost all types of
159 methacrylic acid molecules. It is a generally applicable model in the
110 acetone prediction of heat capacity. The result is therefore very
831 1-n-hexylnaphthalene
satisfactory.
403 2,4-dichlorobenzotrifluoride
69 acetic acid
757 1,3-dimethylnaphthalene
857 oleic acid 4 Conclusion
97 methyl chloroacetate
85 hexafluoropropylene QSPR models for the prediction of liquid heat capacity for a
469 5-methyl-1-hexanol
diverse set of organic compounds using RBFNNs based on
301 m-dichlorobenzene
descriptors calculated from molecular structure alone have
been developed. The models proposed could identify and [8] Yaws, C. L. Chemical properties handbook, McGraw-Hill,
provide some insight into what structural features are New York, 1999.
related to liquid heat capacities of these compounds. [9] Gakh, A. A., Gakh, E. G., Sumpter, B. G., Noid, D. W. Neural
network-Graph Theory approach to the prediction of the
RBFNNs were proved to be a useful tool in the prediction physical properties of organic componds. J. Chem. Inf.
of heat capacity. The training procedure is also simple when Comput. Sci. 34, 832 ± 839.(1994).
using RBFNNs because there are fewer parameters having [10] Liu, S., Cai, S., Cao, C., Li, Z Molecular electronegative
to be optimized: the width of radial basis function and the distance vector (MEDV) related to 15 properties of alkanes,
number of units in the hidden layer. Furthermore the J. Chem. Inf. Comput. Sci, 40, 1337 ± 1348, (2000).
proposed approach can also be extended in other QSPR/ [11] Yao, X. J., Zhang, X. Y., Zhang, R. S., Liu, M. C., Hu, Z. D.,
Fan, B. T. Prediction of enthalpy of alkanes by the use of
QSAR investigations.
radial basis function neural network. Comput. Chem. 25,
475 ± 483, (2001).
[12] Yao, X. J., Zhang, X. Y., Zhang, R. S., Liu, M. C., Hu, Z. D.,
Acknowledgement Fan, B. T. Radial basis function neural networks based QSPR
for the prediction of critical pressure of substituted benzenes.
The authors thank the Association Franco-Chinoise pour la Comput. Chem. 26, 159 ± 169, (2002).
[13] Yao, X. J., Zhang, X. Y., Zhang, R. S. ; Liu, M. C. ; Hu, Z. D.,
Recherche Scientifique & Technique (AFCRST) for sup-
Fan, B. T. Prediction of gas chromatography indices by the
porting this study (Programme PRA SI 00-05). Special use of radial basis function neural network. Talanta, 57, 297 ±
thanks were given to Prof.Todeschini and other members in 306, (2002).
Milano Chemometrics and QSAR Research Group for [14] Yao, X. J., Yawei Wang, Zhang, X. Y., Zhang, R. S., Liu,
providing Dragon package for use in this research. The M. C., Hu, Z. D., Fan, B. T. Radial basis function neural
authors would also like to thank Mark J. Orr for providing networks based QSPR for the prediction of critical temper-
his MATLAB routines to the scientific community. atures Chemometr. Intell. Lab. Syst, 62, 217 ± 225, (2002).
[15] Yao, X. J., Liu, M. C., Zhang, X. Y., Hu, Z. D., Fan, B. T.
Radial basis function neural networks based QSPR for the
prediction of Henrys law constant, Anal.Chim.Acta, 462,
References 101 ± 117, (2002).
[16] Xiang, Y. H., Liu, M. C., Zhang, X. Y., Zhang, R. S., Hu,
[1] Katritzky, A. R., Maran, U., Lobanov, V. S., Karelson, M. Z. D., Fan, B. T., Doucet, J. P., Panaye, A. Quantitative
Structurally Diverse Quantitative Structure-Property Rela- Prediction of Liquid Chromatography Retention of N-Ben-
tionship Correlations of Technologically Relevant Physical zylideneanilines Based on Quantum Chemical Parameters
Properties. J. Chem. Inf. Comput. Sci. 4, 1 ± 18 (2000). and Radial Basis Function Neural Network J. Chem. Inf.
[2] Katritzky, A. R., Petrukhin, R., Tatham, D., Basak, S., Comput. Sci, 42, 592 ± 597, (2002).
Benfenati, E., Karelson, M., Maran, U. Interpretation of [17] HyperChem 4.0, Hypercube, Inc, 1994.
Quantitative Structure-Property and Activity Relationships. [18] Todeschini, R. Dragon software for the calculation of the
J. Chem. Inf. Comput. Sci. 41, 679 ± 685 (2001). molecular descriptors, Rel. 1.1 for Windows. Milano, 2000.
[3] Karelson, M. Molecular Descriptors in QSAR/QSPR; John [19] Walczak, B., Massart, D. L. Local modeling with radial basis
Wiley & Sons: New York, 2000. function networks. Chemometr. Intell. Lab. Syst. 50, 179 ± 198,
[4] Todeschini, R., Consonni, V. Handbook of Molecular De- (2000).
scriptors; Wiley-VCH: Weinheim, Germany, 2000. [20] Orr, M. J. L. Introduction to Radial basis function networks,
[5] Devillers, J., Balaban, A. T., Eds. Topological Indices and centre for cognitive science, Edinburgh University, 1996.
Related Descriptors in QSAR and QSPR; Gordon and [21] Orr, M. J. L. MATLAB routines for subset selection and
Breach: Amsterdam, The Netherlands, 1999. ridge regression in linear neural networks, Centre for
[6] Estrada, E., Uriarte, E. Recent advances on the role of cognitive science, Edinburgh University, 1996.
topological indices in drug discovery research. Curr. Med.
Chem. 8, 1699 ± 1714 (2001). Received on: June 14, 2002
[7] Karelson, M., Lobanov, V. S., Katritzky, A. R. Quantum-
chemical descriptors in QSAR/QSPR studies. Chem. Rev. 96,
1027 ± 1043 (1996).