2015 Elsevier an Advanced ACO Algorithm for Feature Subset Selection
2015 Elsevier an Advanced ACO Algorithm for Feature Subset Selection
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: Feature selection is an important task for data analysis and information retrieval processing, pattern
Received 25 January 2014 classification systems, and data mining applications. It reduces the number of features by removing
Received in revised form noisy, irrelevant and redundant data. In this paper, a novel feature selection algorithm based on Ant
25 April 2014
Colony Optimization (ACO), called Advanced Binary ACO (ABACO), is presented. Features are treated as
Accepted 23 June 2014
graph nodes to construct a graph model and are fully connected to each other. In this graph, each node
Communicated by Lijun Tang
Available online 6 July 2014 has two sub-nodes, one for selecting and the other for deselecting the feature. Ant colony algorithm is
used to select nodes while ants should visit all features. The use of several statistical measures is
Keywords: examined as the heuristic function for visibility of the edges in the graph. At the end of a tour, each ant
Feature selection
has a binary vector with the same length as the number of features, where 1 implies selecting and
Wrapper
0 implies deselecting the corresponding feature. The performance of proposed algorithm is compared to
Ant colony optimization (ACO)
Binary ACO the performance of Binary Genetic Algorithm (BGA), Binary Particle Swarm Optimization (BPSO),
Classification CatfishBPSO, Improved Binary Gravitational Search Algorithm (IBGSA), and some prominent ACO-
based algorithms on the task of feature selection on 12 well-known UCI datasets. Simulation results
verify that the algorithm provides a suitable feature subset with good classification accuracy using a
smaller feature set than competing feature selection methods.
& 2014 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2014.06.067
0925-2312/& 2014 Elsevier B.V. All rights reserved.
272 S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279
initially used for solving Traveling Salesman Problem (TSP) [26,27] 2. Ant colony optimization
and then has been successfully applied to a large number of NP-
hard problems such as Quadratic Assignment Problem (QAP), Ant colony optimization is a metaheuristic algorithm which
vehicle routing, system fault detecting, scheduling, etc. [28]. In was inspired by the foraging behavior of real ants. ACO was
recent years, some ACO-based methods for feature selecting are introduced by Dorigo and his colleagues for the solution of hard
reported. Al-Ani [25] introduces two terms, i.e., “update selection combinational optimization (CO) problems in the early 1990s [23].
measure (USM)” and “local importance (LI)” in the ACO-based When a food source is found, ants lay some pheromone to mark
feature selection method. The hybrid of ACO and mutual informa- the path. The quantity of the laid pheromone depends on the
tion has been used for feature selection in the forecaster [29]. distance, quantity and quality of the food source. While an isolated
Aghdam et al. [30] proposed an ACO based feature selection ant moves essentially at random, an ant encounter the previously
algorithm for text categorization. Basiri et al. [10] proposed an laid trail can detect it and decide with high probability to follow it,
ACO algorithm for feature selection in prediction post-synaptic thus reinforcing the trail with its own pheromone. The process is
activity of proteins. Later, they hybridized ACO and genetic algo- thus characterized by a positive feedback loop, where the more are
rithms to obtain their excellent features by synthesizing them [31]. the ants following a trail, the more that trail becomes attractive for
Vieira [1] presents an algorithm for feature selection based on being followed. This indirect communication between the ants via
two cooperative ant colonies, which minimizes two objectives: pheromone trails enables them to find the shortest path between
the number of features and the classification error. Two phero- the food source and their nest [24].
mone matrices and two different heuristics are used for these Artificial ants, also referred as agents, imitate their natural
objectives. Xiong et al. [32] proposed a hybrid feature selection counterparts and find the optimal solutions to the problems. They
algorithm based on dynamic ant colony algorithm. Mutual infor- lay pheromone on edges of the graph and they choose their path
mation is taken as heuristic function. Chen et al. proposed a new with respect to probabilities that depend on pheromone trails that
rough set approach to feature selection based on ACO, which have been previously laid by other ants. These pheromone trails
adopts mutual information based feature significance as heuristic progressively decrease by evaporation.
information [33]. Artificial ants also have some extra features that are not found
There are two criteria for stopping the search through the space in real ants. Each ant contains an internal memory, which is used
of feature subsets in these methods. Some of the methods ask the to store their previous actions, and they may have some char-
user to predefine the number of selected features. Other methods acteristics such as local search, to improve the quality of computed
are based on the evaluation function. In these methods, an optimal paths. Based on the problem for which the algorithm is designed,
subset according to some evaluation strategy is obtained. As soon daemon actions may be introduced into the algorithm to speed up
as the stopping criterion is met, searching through features is convergence. An example of such actions is to deposit additional
stopped [25]. Therefore, ants are only allowed to have a limited pheromone on the states of the global best solutions. Finally, the
number of steps and cannot see all features. To solve this problem movement of ants is guided by two factors: the pheromone value
reference [34] for the first time suggests to use the binary form of and the problem-specific local heuristic [24].
ACO for feature selection, called BACO, in which ants visit all
features one by one and are allowed to select a feature or not. 2.1. Ant colony optimization for feature selection
Although this method could solve the problem of ACO, there is still
a problem with this approach. In BACO algorithm ants traverse a As mentioned earlier, given the original set of size n, feature
constant sequence of features. Therefore, they can only judge to selection problem is to find a minimal subset of salient features of
select or deselect the subsequent feature and are not able to track size p (p on), such that the classification accuracy is maximized.
any desired sequence of unseen features. This limitation reduces The optimization capability of ACO can be used to select features.
the exploration of search and leads to non-optimal solutions; Each of the original features is treated as a graph node to construct
although the results are much better than ACO. graph G, and then search feature subset based on this graph. Nodes
In our previous work [35], we have proposed a new ACO-based are fully connected to allow any feature to be selected next.
FS algorithm, called ABACO. This algorithm is an advanced version ACO algorithms are stochastic algorithms that make probabil-
of binary ant colony optimization, which attempts to solve the istic decision in terms of the artificial pheromone trails (i.e. history
problems of ACO and BACO algorithms by combination of these of previous successful moves) and the local heuristic information
two. It gives ants the ability of a comprehensive view of features, (expressing desirability of the move/visibility of the edge). These
and helps them to select the most salient features. In this paper, two factors are combined to form the so-called probabilistic
we intend to extend this algorithm by adding heuristic desirability. transition rule
At each step, ants should be able to visit all unseen features, but 8 α β
they can judge to select a feature or not. It means there are two < τij ηij ; If l and j are admissible nodes
k β
P ij ðtÞ ¼ ∑l ταil ηil ð1Þ
roads ending to each feature; one for selecting and the other for :
0 otherwise
deselecting that feature. This is the difference between the
proposed algorithm and the traditional ACO algorithms, which where P kij ðtÞdenotes the transition probability from feature (node)
led to better results compared to them and other metaheuristic i–j for the k-th ant at time step t, τij is the amount of pheromone
algorithms like BGA and BPSO. k-Nearest neighbor (k-NN) classi- trail on edge (i,j) at time t; ηij is the heuristic desirability or
fier performance is regarded as evaluation criteria of the feature visibility of edge (i,j); α and β are two parameters that control the
subset, and then feature pheromone is computed and updated relative importance of the pheromone value versus the heuristic
according to the evaluation results. Experimental results show that information.
this algorithm has high classification accuracy and can effectively The solution space is initially empty and is expanded by adding
reduce the number of features. a solution component at every probabilistic decision. The transi-
The paper is organized as follows. Section 2 summarizes ant tion probability used by ACO is a balance between pheromone
colony algorithm and its binary form. Our proposed ABACO is intensity (i.e. history of previous successful moves), τij, and
discussed elaborately in Section 3. Section 4 presents the results of heuristic information (expressing desirability of the move), ηij.
our experimental studies. Finally, Section 5 concludes the paper This effectively balances the exploitation–exploration trade-off.
with a brief summary and a few remarks. The best balance between exploitation and exploration is achieved
S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279 273
through proper selection of the parameters α and β. If α¼0, no approach, only the global best ant is admissible to deposit
pheromone information is used, i.e. previous search experience is pheromone and max–min ant system strategy is utilized for
neglected. If β¼0, the attractiveness (or visibility) of moves is pheromone updating, i.e. the pheromone trail is limited to the
neglected. interval [τmin, τmax], where τmin and τmax are arbitrary positive real
At the end of every iteration, ants that have found good number satisfied τmin oτmax. Later, Chen et al. [40] employed this
solutions are made to mark their path by depositing pheromones type of ACO for feature selection called ACOFS, using F-score
on the edges chosen by them. The optimization algorithm may be criterion as the heuristic value (visibility of edges) but with a
designed such that either the iteration best ant or global best ant different pheromone updating strategy. Here, nodes represent
or both of them deposit pheromone. After all ants have completed features, where each node contains two sub-nodes; 0 and 1. First,
their solutions, pheromone evaporation on all edges triggered. It all ants are at the beginning of the bit string and then pass through
helps in avoiding rapid convergence of the algorithm toward a these sub-nodes. The ant's tour ends when it passes the last
sub-optimal region. The pheromone content of path (i,j) at a time feature. If an ant chooses sub-node 1 (or 0) of the i-th feature, it
instance tþ 1 is given by (2) means this feature is selected (or deselected) by that ant. Assume
m that the probability of being preferred of the sub-path between
τij ðnewÞ ¼ ð1 ρÞτij ðtÞ þ ∑ Δτkij ðtÞ þ Δτgij ðtÞ ð2Þ 0 and 1 (0-1) at a stage is calculated. Then the following equation
k¼1
is used:
( τ01
Q
if the kth ant traverse arc ði; jÞ in T k P 01 ðtÞ ¼ ð5Þ
Δτkij ¼ Fk ð3Þ τ01 þ τ00
0 otherwise
where P01 is the probability associated with the sub-path (0-1),
where ρϵð0; 1 is the evaporation rate, m is the number of ants, and τ00 and τ01 are the artificial pheromones of the sub-paths (0-
Δτkij ðtÞ and Δτgij ðtÞ are respectively, the amount of pheromone laid 0 and 0-1).
on edge (i,j) by the k-th ant, and the amount of pheromone to be Although BACO is able to find near-optimal solution in a short
deposited by the global best ant g up to the time instance t, over time, it suffers a big problem, which is the limited view of ants
the edge (i,j). Q is a constant, and F k is the cost value of the toward features. At any time each ant can only observe its next
solution found by k-th ant in its current tour Tk (in this paper feature and is not able to see other features. In the proposed
classification error rate is considered as cost function). The less its algorithm, we try to solve this problem by combining conventional
classification error rate, the stronger the intensity of pheromone it ACO and BACO.
is allowed to deposit in each iteration.
The proposed algorithm uses the max–min ant system, i.e., only
the global best ant is allowed to update the pheromone trails, and 3. Proposed feature selection algorithm
the value of pheromone on each road is confined to [τmin,τmax].
Therefore, Eq. (2) is modified as follows: This section describes the proposed approach. It is a combina-
τij ðnewÞ ¼ ½ð1 ρÞτij ðtÞ þΔτgij ðtÞττmax ð4Þ tion of ACO and BACO algorithms to remove their limitations. In
min
other words, there is no need to predefine the number of features
The iterative process continues till the stopping criterion is to be selected (limitation of ACO) and this task is assigned to the
reached. The stopping criterion may be either a number of algorithm to select feature subsets with arbitrary numbers. Beside,
iterations or a solution of desired quality [36]. nodes are fully connected and ants are able to observe all features
simultaneously (contrary to BACO algorithm) and they can decide
2.2. Binary ant colony optimization to select a feature or not (similar to BACO).
Like ACO-based FS approach, the problem is defined as a fully
Touring ant colony optimization (TACO), was initially designed connected graph where nodes represent features, with the edges
by [37], for handling continuous variables (which are decoded as between them denoting the choice of the next features. Similar to
binary strings) in the optimization problems, and was later used BACO algorithm, there are two sub-nodes assigned to each feature
for different problems such as digital filter design [38,39]. In this in the graph, one for selecting and the other for deselecting the
algorithm, each solution is represented with a string of binary bits. corresponding feature. Fig. 2 illustrates this structure. Based on the
Artificial ants search for the value of each bit in the string. In other pheromone values ants decide their next edge. In each iteration,
words, they try to decide whether the value of a bit is 0 or 1. At the all ants should visit all features, but can decide whether to select a
decision stage for the value of a bit, ants only use the pheromone feature or not. If an ant chooses sub-node 1 (or 0) of feature Fi, it
information. After all ants in the colony have produced their means the feature is selected (or deselected) by that ant. Note that,
solutions and the pheromone amount belonging to each solution ants are only allowed to select one of the sub-nodes of each
has been calculated, the pheromone of sub-paths (edges) between feature, sub-nodes 1 or 0. For the next step, this ant can see all the
the bits is updated. This is carried out by evaporating the previous unvisited features. Again, there are two roads ending to each
pheromone amounts and depositing the new pheromone amounts feature. Based on the pheromone values and heuristic information
on the paths. The concept of TACO algorithm is shown in Fig. 1.
Using TACO for feature selection problem was first introduced
by Touhidi et al. [34]. They proposed three modifications of TACO F2 F3
called elitism TACO (ETACO), rank-based TACO (RTACO) and binary F2 F3 1 1
ACO (BACO), where the latter led to the best results. In this
0 0
F1 Fn
Fig. 1. The concept of touring ant colony optimization algorithm. Fig. 2. ABACO algorithm representation.
274 S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279
of these edges, the ant chooses its next road and the process 3.1.1. Method 1
continues until the ant visits all features. A similar process is The first three methods are based on correlation. Correlation is
repeated for other ants. At the end of each iteration, each ant has a one of the most common and useful statistics that describes the
solution path in the form of a binary vector with the same length degree of relationship between two variables. A number of criteria
as the number of the features, where 1 means selecting and have been proposed in statistics to estimate correlation. In this
0 means deselecting the corresponding feature. For updating the work, ABACO uses the best known Pearson product-moment
pheromone of the roads, only the best ant, i.e. the ant with the correlation coefficient to measure correlation between different
smallest classification error of the classifier, is allowed to deposit features of a given training set. The correlation coefficient rij
pheromone on the edges it has traversed. This process continues between two features i and j is
for all iterations and at last, the best feature subset with the least ∑h ðxi xi Þðxj xj Þ
classification error of the classifier is suggested as the best result. r ij ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð7Þ
With this idea, the FS problem can be modeled as TSP, where ∑h ðxi xi Þ2 ∑h ðxj xj Þ2
ants start their tours from a random node and end it when all
where xi and xj are the value of features i and j, respectively. The
nodes are visited. The main point of the proposed algorithm is
variables xi and xj represent the mean values of xi and xj , averaged
that, by considering two sub-nodes with the values of 0 and 1 for
over h samples. If the features i and j are completely correlated, i.e.,
each node, there is no need to predetermine the number of the
exact linear dependency exist, then rij would be 1 or 1. If i and j
features to be selected. Beside, as each node contains two sub-
are completely uncorrelated then rij would be 0 [7].
nodes, there are four edges between each two nodes instead of
Method 1 employs the idea of min redundancy, which tries to
one edge. These edges are illustrated in Fig. 3 for features i and j.
select the most distinct features. For convenience, the correlation
Expressions above each edge show the pheromone intensity and
between features i and j is assumed to be high. In this case, the
heuristic information of that edge. For example, τi1;j0 and ηi1;j0 are
two features are highly similar. Therefore, to describe the whole
respectively, the pheromone value and heuristic information of the
set, one of these features is enough. Hence, if one of them is
edge which connects sub-node 1 of feature i to sub-node 0 of
selected, the probability to select/deselect the other feature can be
feature j. In the following, the methods are described.
described as 1 rij/rij, that has a low/high value. Again, if the first
The search for the optimal feature subset is the goal of the ants
feature is not selected, presence of the other feature is not
traverse through the graph. Suppose an ant is currently at node Fi,x
necessary too, because they are similar. So, the probability to
(i¼ 1,2,…,n and x¼ 0,1) and has to choose one path connecting Fj,y
select/deselect the other feature can be described as 1 rij/rij, that
(j A addmissible nodes and y¼0,1) to pass through. A probabilistic
has a low/high value. Eq. (8) illustrates the above statements.
function of transition, denoting the probability of an ant at node Fi,
x to choose the path to reach Fj,y, is designed by combining the ηi0;j0 ¼ r ij
heuristic desirability (visibility of edge (i,j)) and pheromone
ηi0;j1 ¼ 1 r ij
density of the edge. The probability of the ant k at sub-node Fi,x
ηi1;j0 ¼ r ij
to choose the edge (i,j) at time t is
8 ηi1;j1 ¼ 1 r ij ð8Þ
< ταix;jy ηβix;jy
k β β if j; l A admissible nodes
P ix;jy ðtÞ ¼ ∑l ταix;l0 ηix;l0 þ ∑l ταix;l1 ηix;l1 ð6Þ
:
0 otherwise 3.1.2. Method 2
In this method, the idea of max-relevance and min-redundancy
Here, τix,jy is the pheromone on edge (ix,jy ) between sub-nodes is used. Max-relevance is one of the most popular approaches to
Fi,x and Fj,y at time t, which reflects the potential tend for ants to realize max-dependency in feature selection, i.e. selecting the
follow this edge. ηix,jy is the heuristic information reflecting the features with the highest relevance to the target class c. Relevance
desirability of choosing edge (ix,jy). α and β are two parameters is usually characterized in terms of correlation or mutual informa-
that determine the relative importance of the pheromone value tion [42]. As Eq. (9) shows, geometric mean of the two criteria
and the heuristic information. (max-relevance and min-redundancy) is considered to calculate
the heuristic information of edges. Here, cls_corj, is the correlation
between feature j and class labels over all samples. The more this
3.1. Heuristic information measurement value is close to 1 for a feature, the higher this feature is correlated
to the class labels and thus the feature is more important.
Adding heuristic desirability/visibility to the edges, enhances qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
the ability of exploiting the search space and makes it easier to ηi0;j0 ¼ ðjr ij jÞð1 jcls_cor j jÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
find the optimum solution. A heuristic value, η, for each feature
ηi0;j1 ¼ ð1 jr ij jÞðjcls_cor j jÞ
generally represents the attractiveness of the features, and can be qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
any subset evaluation function like an entropy based measure or ηi1;j0 ¼ ðjr ij jÞð1 jcls_cor j jÞ
rough set dependency measure [41]. In the proposed algorithm, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
we tried four methods to determine the heuristic value including ηi1;j1 ¼ ð1 jr ij jÞðjcls_cor j jÞ ð9Þ
F-score and three different approaches based on correlation.
3.1.3. Method 3
The idea behind this method is to use feature–feature correla-
tion, if feature i is selected, and class-feature correlation is used in
the case of deselecting feature i.
ηi0;j0 ¼ 1 cls_cor j
ηi0;j1 ¼ cls_cor j
ηi1;j0 ¼ r ij
Fig. 3. Edges between two nodes (features) in ABACO algorithm. ηi1;j1 ¼ 1 r ij ð10Þ
S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279 275
4. Experimental studies
Generate m ants
A series of experiments are conducted to show the effective-
ness of the proposed feature selection algorithm. All experiments
For each ant, construct a subset were performed on a laptop with 2.40 GHz CPU and 4 Gb of RAM
using transition rule using Matlab. For experimental studies, we have considered 12
datasets from the UCI (University of California, Irvine) machine
learning repository [45], including Abalon, Glass, Iris, Letter,
Shuttle, Spambase, Tae, Vehicle, Waveform, Wine, Wisconsin and
Evaluate all Generate Yeast. These datasets have been the subject of many studies in
constructed subsets new m ants machine learning, covering examples of small, medium and high-
dimensional datasets [16,44]. The characteristics of these datasets,
summarized in Table 1, show a considerable diversity in the
Select the local best and
global best subsets Table 1
Update τ Characteristics of different benchmark data sets.
Abalone 11 8 3842
Termination No Glass 6 9 214
Iris 3 4 150
criteria is met?
Letter 26 16 20,000
Shuttle 7 9 58,000
Spambase 2 57 4601
Yes Tae 3 5 151
Vehicle 4 18 846
Waveform 3 21 5000
Return the best subset Wine 3 13 178
Wisconsin 2 9 683
Yeast 9 8 1484
Fig. 4. The structure of the proposed ACO-based FS approach (ABACOH).
276 S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279
number of features, classes, and samples. To validate the results reduction. The more it is close to 1, the more features are reduced,
obtained by the proposed algorithm, it is compared with binary and the classifier complexity is less. The following section
particle swarm optimization (BPSO) and CatfishBPSO [17], binary describes the implementation results.
genetic algorithm (BGA), improved binary gravitational search
(IBGSA) [21], ant colony optimization with/without heuristic 4.2. Parameter setting
information (ACOH/ACO), binary ant colony optimization (BACO)
[34], a modified binary ACO based feature selection algorithm Population size for all algorithms and the maximum iterations
presented in [40], which is denoted as ACOFS and ABACO [35], and are set to 50. During each experimentation, 60% of samples were
the results obtained are reported. chosen randomly for training. Remaining 40% of samples were
used for testing. Results are averaged over 20 independent runs in
4.1. Evaluation functions each data set and by every algorithm. The selected features of each
method are classified using k-Nearest Neighbor (k-NN, k¼1) and
The features selected by the proposed algorithms are evaluated fitness function is defined as the classification accuracy.
with the well-known metrics precision, recall, accuracy and For ACO-based algorithms ABACOH, ABACO, BACO, ACOH and
feature-reduction. Precision is defined as the ratio of correctly ACO, the evaporation coefficient, ρ is 0.049, the minimum and the
assigned category C samples to the total number of samples maximum pheromone intensity of each edge is set to 0.1 and 6,
classified as category C as in Eq. (13). Recall is the ratio of correctly respectively. Also the initial pheromone intensity (τ0 ) of each edge
assigned category C samples to the total number of samples is set to 0.1. Since α and β in ACO are two parameters that
actually in category C as in Eq. (14) [36]. Let TPi, FPi, TNi, and FNi determine the relative importance of the pheromone and the
indicate a number of samples as follows: heuristic information, we first fix the value of α as 1 and set the
value of β in the range of [0.2,1]. The best parameters that are
TPi – the number of test samples correctly classified under ith obtained are as follows: α ¼1, β¼ 0.5. Parameters of ACOFS are set
category (Ci). according to [40] (α ¼1, β¼ 0.5, ρ ¼ 0:049, τ0 ¼ 1).
FPi – the number of test samples incorrectly classified under Ci. For GA-based FS method, we choose one point cross-over with
TNi – the number of test samples correctly classified under probability of 0.9. The mutation probability is set to 0.01. Finally
other categories. parameters of BPSO and CatfishBPSO are set according to [17]:
FNi – the number of test samples incorrectly classified under w ¼ 1 and c1 ¼ c2 ¼ 2. In IBGSA, the parameters, as reported in
other categories. [21], are set to k1 ¼ 1 and k2 ¼ 500, and Hamming distance is used
for distance calculation.
TP i
Precisioni ¼ ð13Þ
TP i þ FP i
4.3. Expeimental results and analysis
TP i
Recalli ¼ ð14Þ
TP i þ FN i To specify the best method for the heuristic information of
edges, the average classification accuracies of the proposed algo-
In this paper, classification accuracy (CA) is used to define the
rithm over 20 independent runs on the tested datasets are given in
quality function of a solution, which is the percentage of samples
Table 2. The numbers in the last row of the table, show the average
correctly classified and evaluated in (15).
classification accuracy over the datasets. According to this table,
number of samples correctly classified method 4 could achieve the best result among other proposed
Accuracy ¼
total number of samples taken for experimentation methods. Therefore, this method is chosen for the heuristic
ð15Þ information of paths.
To show the utility of the proposed algorithm, we compare the
Another parameter which is used for comparison is the average
algorithm with IBGSA [21], CatfishBPSO [17], BGA, BPSO, ACO and
feature reduction Fr, to investigate the rate of feature reduction
two new ACO-based FS algorithms [34,40], which are reported to
np be very strong algorithms in FS. Table 3 shows the mean of
Fr ¼ ð16Þ
n classification accuracy (CA) results of every algorithm for each
where n is the total number of features and p is the number of dataset. Comparison of the average precision and recall and the
selected features by the FS algorithm. Fr is the average feature amount of Fr of the competing algorithms on the datasets are
Table 2
Classification accuracy of each method on the tested data sets. The results are averaged over 20 independent runs. The number below each column of the table, shows the
average classification accuracy over the data sets.
Table 3
Classification accuracy for the tested datasets. The average of results over 20 independent runs is reported. The number in brackets in each table slot shows the ranking of
each algorithm.
Dataset ABACOH ABACO ACOFS BACO ACOH ACO BGA BPSO IBGSA CatfishBPSO
Abalone 0.244 (1) 0.241 (3) 0.243 (2) 0.243 (2) 0.241 (3) 0.238 (5) 0.241 (3) 0.241 (3) 0.241 (3) 0.240 (4)
Glass 0.758 (1) 0.747 (3) 0.750 (2) 0.743 (4) 0.728 (8) 0.713 (9) 0.733 (7) 0.734 (6) 0.738 (5) 0.733 (7)
Iris 0.976 (2) 0.974 (3) 0.977 (1) 0.967 (6) 0.971 (4) 0.963 (9) 0.967 (7) 0.965 (8) 0.969 (5) 0.968 (6)
Letter 0.861 (1) 0.856 (4) 0.859 (3) 0.859 (3) 0.856 (4) 0.806 (7) 0.837 (5) 0.825 (6) 0.86 (2) 0.856 (4)
Shuttle 0.998 (1) 0.998 (1) 0.998 (1) 0.998 (1) 0.998 (1) 0.998 (1) 0.997 (2) 0.998 (1) 0.998 (1) 0.998 (1)
Spambase 0.923 (2) 0.921 (4) 0.922 (3) 0.919 (5) 0.913 (6) 0.901 (8) 0.906 (7) 0.9 (9) 0.922 (3) 0.924 (1)
Tae 0.583 (1) 0.573 (3) 0.569 (4) 0.558 (8) 0.559 (7) 0.578 (2) 0.556 (9) 0.567 (5) 0.556 (9) 0.565 (6)
Vehicle 0.753 (1) 0.753 (1) 0.749 (2) 0.749 (2) 0.739 (5) 0.718 (8) 0.737 (6) 0.722 (7) 0.745 (4) 0.746 (3)
Waveform 0.798 (1) 0.795 (4) 0.797 (2) 0.793 (6) 0.796 (3) 0.768 (10) 0.776 (9) 0.792 (7) 0.794 (5) 0.79 (8)
Wine 0.969 (2) 0.969 (2) 0.964 (3) 0.964 (3) 0.958 (4) 0.938 (7) 0.957 (5) 0.941 (6) 0.978 (1) 0.969 (2)
Wisconsin 0.976 (2) 0.976 (2) 0.974 (4) 0.975 (3) 0.974 (4) 0.968 (5) 0.975 (3) 0.968 (5) 0.977 (1) 0.976 (2)
Yeast 0.524 (2) 0.529 (1) 0.516 (4) 0.515 (5) 0.512 (7) 0.509 (8) 0.513 (6) 0.507 (9) 0.515 (5) 0.522 (3)
Table 4
Comparison of performance (precision, recall and Fr) of the algorithms on 12 data sets.
Metrics Abalone Glass Iris Letter Shuttle Spambase Tae Vehicle Waveform Wine Wisconsin Yeast Sum
ABACOH Precision 0.222 0.743 0.978 0.867 0.912 0.92 0.589 0.755 0.799 0.972 0.973 0.529 9.259
Recall 0.223 0.736 0.975 0.862 0.927 0.92 0.585 0.757 0.799 0.973 0.976 0.521 9.254
Fr 0.337 0.299 0.437 0.343 0.541 0.431 0.357 0.436 0.585 0.484 0.384 0.063 4.697
ABACO Precision 0.221 0.693 0.974 0.863 0.912 0.919 0.585 0.755 0.797 0.97 0.973 0.504 9.166
Recall 0.219 0.676 0.974 0.861 0.907 0.916 0.576 0.762 0.797 0.973 0.976 0.51 9.147
Fr 0.386 0.361 0.375 0.335 0.453 0.439 0.389 0.428 0.272 0.549 0.348 0.076 4.411
ACOFS Precision 0.225 0.723 0.978 0.864 0.958 0.921 0.574 0.749 0.798 0.965 0.969 0.509 9.233
Recall 0.223 0.71 0.976 0.861 0.934 0.918 0.569 0.752 0.798 0.97 0.974 0.496 9.181
Fr 0.331 0.365 0.319 0.361 0.522 0.447 0.356 0.426 0.272 0.483 0.367 0.102 4.351
BACO Precision 0.223 0.73 0.967 0.865 0.944 0.917 0.568 0.749 0.794 0.965 0.973 0.509 9.204
Recall 0.222 0.695 0.967 0.863 0.93 0.917 0.562 0.754 0.794 0.972 0.974 0.507 9.157
Fr 0.387 0.347 0.459 0.332 0.565 0.465 0.363 0.439 0.261 0.494 0.387 0.089 4.588
ACOH Precision 0.221 0.677 0.971 0.861 0.933 0.91 0.568 0.739 0.796 0.959 0.971 0.476 9.082
Recall 0.219 0.682 0.971 0.857 0.921 0.91 0.573 0.745 0.796 0.964 0.973 0.485 9.096
Fr 0.344 0.24 0.313 0.313 0.4 0.461 0.41 0.397 0.238 0.478 0.328 0.113 4.035
ACO Precision 0.222 0.648 0.963 0.812 0.931 0.898 0.589 0.716 0.768 0.941 0.964 0.464 8.916
Recall 0.219 0.654 0.962 0.808 0.914 0.896 0.58 0.721 0.768 0.943 0.966 0.476 8.907
Fr 0.388 0.372 0.313 0.313 0.472 0.461 0.41 0.431 0.371 0.504 0.389 0.156 4.58
BGA Precision 0.225 0.719 0.969 0.843 0.925 0.903 0.567 0.734 0.777 0.958 0.971 0.48 9.071
Recall 0.222 0.682 0.968 0.839 0.894 0.901 0.556 0.74 0.777 0.964 0.975 0.476 8.994
Fr 0.373 0.339 0.388 0.341 0.561 0.454 0.441 0.454 0.324 0.49 0.368 0.141 4.674
BPSO Precision 0.221 0.708 0.965 0.863 0.97 0.896 0.578 0.718 0.795 0.944 0.963 0.466 9.087
Recall 0.217 0.673 0.965 0.858 0.91 0.895 0.571 0.725 0.795 0.946 0.966 0.464 8.985
Fr 0.438 0.367 0.45 0.348 0.531 0.477 0.38 0.478 0.283 0.531 0.406 0.206 4.895
NBGSA Precision 0.212 0.672 0.961 0.84 0.911 0.906 0.548 0.728 0.773 0.93 0.966 0.472 8.919
Recall 0.212 0.668 0.962 0.835 0.928 0.905 0.546 0.732 0.773 0.935 0.964 0.474 8.934
Fr 0.542 0.315 0.372 0.42 0.381 0.465 0.283 0.413 0.51 0.341 0.386 0.129 4.557
Catfish BPSO Precision 0.22 0.706 0.97 0.865 0.867 0.928 0.579 0.749 0.792 0.972 0.976 0.496 9.12
Recall 0.221 0.699 0.969 0.862 0.935 0.926 0.568 0.754 0.793 0.974 0.977 0.491 9.169
Fr 0.369 0.359 0.3 0.34 0.527 0.443 0.44 0.451 0.305 0.513 0.392 0.134 4.573
displayed in Table 4. We can conclude from these tables that the Table 5
proposed ABACO algorithm can obtain, in some cases, better The sum of the relative obtained ranks on the 12 number of data sets for each of the
algorithms.
classification accuracy using a smaller feature set, compared to
other algorithms. Catfish
In Table 5, another comparison has been made according to the ABACOH ABACO ACOFS BACO ACOH ACO GA BPSO IBGSA
BPSO
sum of the ranks available in Table 3 for each algorithm. The lower
sum of ranks for an algorithm shows the better average results in 17 (1) 31 (2) 31 (2) 48 (5) 56 (6) 81 (9) 69 (7) 70 (8) 44 (3) 47 (4)
total cases against the others. Although this quantity is of lower
accuracy degree for reporting results in some cases, it is common
in Nonparametric Statistics. As can be seen in this table, ABACOH,
gets the first rank. Both ABACO and ACOFS achieve the second In practical engineering issues, solving the problem consumes
rank. IBGSA and CatfishBPSO ranked third and fourth and BACO, the main time, and the time spent by the metaheuristic algorithm
ACOH, GA, BPSO and ACO get the fifth, sixth, seventh, eighth and operators is negligible. As a wrapper method is employed in this
ninth rankings, respectively. special problem, classifier takes a lot of time. Table 6 shows the
278 S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279
Table 6
The average time of algorithms on each dataset. The last row of the table, shows the average latency of each method over the datasets.
Dataset ABACOH ABACO ACOFS BACO ACOH ACO BGA BPSO IBGSA CatfishBPSO
Iris 1.13 1.12 1.11 0.87 1.14 1.15 0.85 0.87 0.85 0.88
Glass 1.60 1.57 1.65 1.10 1.41 1.44 1.06 1.15 1.05 1.25
Vehicle 8.29 7.10 6.51 4.50 5.20 7.06 4.56 5.22 5.11 6.58
Wine 2.08 1.83 1.56 1.02 1.94 1.74 1.31 1.11 1.53 1.17
Abalone 8.07 7.69 7.94 5.22 7.57 6.56 6.27 6.08 6.94 6.58
Letter 16.40 15.77 10.14 10.70 9.99 13.04 11.99 8.08 8.41 11.29
Shuttle 20.93 22.84 19.20 15.20 23.94 28.74 19.95 16.99 17.64 25.94
Spambase 169.89 161.36 128.62 109.64 112.72 106.77 75.26 60.08 62.20 80.37
Tae 1.13 0.83 1.10 0.76 1.10 0.79 0.85 0.85 0.82 1.28
Waveform 91.84 67.06 88.52 75.94 73.78 66.67 49.37 50.45 56.14 81.41
Wisconsin 2.97 2.10 3.73 2.08 2.98 2.33 2.77 2.36 2.25 2.91
Yeast 4.03 2.97 3.42 3.35 4.81 3.38 3.44 3.33 2.61 4.82
mean 27.36 24.35 22.79 19.19 20.54 19.97 14.80 13.04 13.79 18.70
average time of algorithms over each dataset. The last row of the IBGSA, CatfishBPSO, ACOFS, BACO, ACO with and without heuristic
table, shows the average latency of each method over the datasets. desirability, BGA and BPSO.
According to this table, the delay of ABACO is not much different In order to evaluate the performance of these approaches,
than other algorithms. The results obtained by ABACOH confirm experiments were performed using twelve datasets from the UCI
that the proposed algorithm can be thought as a worthwhile machine learning repository. The experimental results confirm our
method for feature selection. algorithm and provide obvious evidences, allowing us to conclude
At this point, it should be mentioned that although there is no that our method achieves a better feature set in terms of
universal metaheuristic algorithm that can get the best results on classification accuracy and number of selected features. Further
the entire available benchmarks, the results obtained by ABACOH investigation on the parameters values and testing the ABACO
confirm that the proposed algorithm also can be thought as a model with other heuristic functions are an area of future
worthwhile method for feature selection. research.
4.4. Discussion
References
This section briefly explains the reason that the performance of
ABACO was better than other algorithms. Here, are some salient [1] S.M. Vieira, J.M.C. Sousa, T.A. Runkler, Two cooperative ant colonies for feature
characteristics of ABACO. selection using fuzzy models, Expert Syst. Appl. 37 (2010) 2714–2723.
[2] M. Pal, G.M. Foody, Feature selection for classification of hyperspectral data by
The first one is that ABACO permits ants to explore all features, SVM, IEEE Trans. Geosci. Remote Sens. 48 (5) (2010) 2297–2307.
but in most of ACO-based FS algorithms, searching among features [3] H. Liu, L. Yu, Toward integrating feature selection algorithms for classification
continues until the stopping criterion is met. Therefore, ants do and clustering, IEEE Trans. Knowl. Data Eng. 17 (4) (2005) 491–502.
[4] S. Ding, Feature selection based F-score and ACO algorithm in support vector
not have the opportunity to observe all features.
machine, in: Proceedings of the 2nd International Symposium on Knowledge
The second characteristic is the new search technique used in Acquisition and Modeling, 2009.
ABACO. Unlike other ACO-based FS algorithms that ants should [5] L.T. Vinh, S. Lee, Y.-T. Park, B.J. d'Auriol, A novel feature selection method based
on normalized mutual information, Appl. Intell. 37 (2010) 100–120.
select every feature they visit, in ABACO algorithm ants have the
[6] M.H. Aghdam, N. Ghasem-Aghaee, M.E. Basiri, Text feature selection using ant
authority to select or deselect visiting features. This search colony optimization, Expert Syst. Appl. 36 (2009) 6843–6853.
technique is common between BACO, ACOFS and ABACO. [7] M. Kabir, Md. Shahjahan, K. Murase, A new local search based hybrid genetic
Third, the advantage of ABACO to BACO and ACOFS arises from algorithm for feature selection, Neurocomputing 74 (2011) 2914–2928.
[8] D. Hu, P. Ronhvde, Z. Nussinov, Replica inference approach to unsupervised
the comprehensive view of ants to features in ABACO compared to multiscale image segmentation, Phys. Rev. E 85 (2012) 016101.
the limited view in BACO and ACOFS. Although in BACO and [9] R. Kohavi, G. John, Wrappers for feature selection, Artif. Intell. 97 (1–2) (1997)
ACOFS, every ant could visit all features, and had the authority to 273–324.
[10] M.E. Basiri, N. Ghasem-Aghaee, M.H. Aghdam, Using ant colony optimization-
select or deselect the next feature, but at any time each ant could based selected features for predicting post-synaptic activity in proteins,
only observe its next feature and could not see other features. EvoBIO, in: Lecture Notes in Computer Science, vol. 4973, 2008, 12–23, Italy.
Since in ABACO, ants can see all of the unvisited features [11] P. Bermejo, J.A. Gámez, J.M. Puerta, A GRASP algorithm for fast hybrid (filter-
wrapper) feature subset selection in high-dimensional datasets, Pattern
simultaneously, they can decide better about the next feature, Recognit. Lett. 32 (2001) 701–711.
and are not forced to select or deselect a predefined feature. [12] J. Huang, Y. Cai, X. Xu, A hybrid genetic algorithm for feature selection wrapper
Finally, compared to the initial version of ABACO introduced in based on mutual information, Pattern Recognit. Lett. 28 (2007) 1825–1844.
[13] R.K. Sivagaminathan, S. Ramakrishnan, A hybrid approach for feature subset
[35], adding heuristic desirability increases the exploration of
selection using neural networks and ant colony optimization, Expert Syst.
search and guide ants to more salient features. Appl. 33 (2007) 49–60.
[14] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach.
Learn. Res. 3 (2003) 1157–1182.
[15] M. Dash, H. Liu, Feature selection for classification, Intell. Data Anal. 1 (1997)
5. Conclusion 131–156.
[16] L.Y. Chuang, C.H. Yang, J.C. Li, Chaotic maps based on binary particle swarm
Feature selection is an important task which can significantly optimization for feature selection, Appl. Soft Comput. 11 (2011) 239–248.
[17] L.Y. Chuang, S.W. Tsai, C.H. Yang, Improved binary particle swarm optimization
affect the performance of classification and recognition. In this
using catfish effect for feature selection, Expert Syst. Appl. 38 (2011)
paper, we present a new feature selection technique based on Ant 12699–12707.
Colony Optimization (ACO) by combining two models of ACO. The [18] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on rough
proposed algorithm has a strong search capability in the problem sets and particle swarm optimization., Pattern Recognit. Lett. 28 (4) (2007)
459–471.
space and can effectively find the minimal feature subset. This [19] I. oh, J.S. Lee, B.R. Moon, Hybrid genetic algorithm for feature selection, IEEE
algorithm is compared with some powerful algorithms, including Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1424–1437.
S. Kashef, H. Nezamabadi-pour / Neurocomputing 147 (2015) 271–279 279
[20] S. Sarafrazi, H. Nezamabadi-pour, Facing the classification of binary problems [39] L. Ozbakir, A. Baykasoglu, S. Kulluk, H. Yapici, TACO-miner: an ant colony
with a GSA–SVM hybrid system, Math. Comput. Model. 57 (1–2) (2013) based algorithm for rule extraction from trained neural networks, Expert Syst.
270–278. Appl. 36 (2009) 12295–12305.
[21] E. Rashedi, H. Nezamabadi-pour, Feature subset selection using improved [40] B. Chen, L. Chen, Y. Chen, Efficient ant colony optimization for image feature
binary gravitational search algorithm, J. Intell. Fuzzy Syst. 26 (3) (2014) selection, Signal Process. 93 (2013) 1566–1576.
1211–1221. [41] R. Jensen, Combining Rough and Fuzzy Sets for Feature Selection (Ph.D. thesis),
[22] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, A simultaneous feature adapta- University of Edinburgh, 2005.
tion and feature selection method for content-based image retrieval systems, [42] H. Peng, F. Long, C. Ding, Feature selection based on mutual information:
Knowl. Based Syst. 39 (2013) 85–94. criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans.
[23] M. Dorigo, G.D. Caro, Ant colony optimization: a new meta-heuristic, in: Pattern Anal. Mach. Intell. 27 (8) (2005).
Proceedings of the IEEE Congress on Evolutionary Computing, 1999. [43] C.L. Huang, ACO-based hybrid classification system with feature subset
[24] M. Dorigo, V. Maniezzo, A. Colorni, The ant system: optimization by a colony selection and model parameters optimization, Neurocomputing 73 (2009)
of cooperative agents, IEEE Trans. Syst. Man Cybern. 26 (1) (1996) 1–13. 438–448.
[25] A. Al-Ani, Feature subset selection using ant colony optimization, Int. J. [44] C.T. Su, H.C. Lin, Applying electromagnetism-like mechanism for feature
Comput. Intell. 2 (1) (2005) 53–58. selection, Inf. Sci. 181 (2011) 972–986.
[26] M. Dorigo, L.M. Gambardella, Ant colonies for the traveling salesman problem, [45] UCI Machine Learning Repository. Center for Machine Learning and Intelligent
BioSystems 43 (1997) 73–81. Systems. 〈http://archieve.ics.uci.edu/ml/datasets.html〉.
[27] M. Dorigo, L.M. Gambardella, Ant colony system: a cooperative learning
approach to the traveling salesman problem, IEEE Trans. Evolut. Comput. 1
(1) (1997) 53–66.
[28] C. Blum, Ant colony optimization: introduction and recent trends, Phys. Life
Shima Kashef received her B.Sc. and M.Sc. degrees in
Rev. 2 (2005) 353–373.
Electrical Engineering from Shahid Bahonar University
[29] C.K. Zhang, H. Hu, Feature selection using the hybrid of ant colony optimiza-
of Kerman, Iran, in 2011 and 2014, respectively. Her
tion and mutual information for the forecaster, in: Proceeding of the 4th
research interests include pattern recognition and evo-
International Conference on Machine Learning and Cybernetics, 2005,
lutionary computation.
pp. 1728–1732.
[30] M.H. Aghdam, N. Ghasem-Aghaee. M.E. Basiri, Application of ant colony
optimization for feature selection in text categorization, in: Proceeding of
5th IEEE Congress on Evolutionary Computation, Hong Kong, 2008.
[31] S. Nemati, M.E. Basiri, N. Ghasem-Aghayee, M.H. Aghdam, A novel ACO–GA
hybrid algorithm for feature selection in protein function prediction, Expert
Syst. Appl. 36 (2009) 12086–12094.
[32] S.H. Xiong, J.Y. Wang, H. Lin, Hybrid feature selection algorithm based on
dynamic weighted ant colony algorithm, in: Proceedings of the 9th Interna-
tional Conference on Machine Learning and Cybernetics, Qingdao, 2010.
[33] Y. Chen, D. Miao, R. Wang, A rough set approach to feature selection based on
ant colony optimization, Pattern Recognit. Lett. 31 (2010) 226–233. Hossein Nezamabadi-pour received his B.Sc. degree in
[34] H. Touhidi, H. Nezamabadi-pour, S. Saryazdi, Feature selection using binary Electrical Engineering from Shahid Bahonar University
ant algorithm, in: Proceedings of the Frist Joint Congress on Fuzzy and of Kerman in 1998, and his M.Sc. and Ph.D. degrees in
Intelligent Systems, Mashhad, Iran, August 2007 (in Farsi). Electrical Engineering from Tarbait Moderres Univer-
[35] S. Kashef, H. Nezamabadi-pour, A new feature selection algorithm based on sity, Tehran, Iran, in 2000 and 2004, respectively. In
binary ant colony optimization, in: Proceedings of the 5th Conference on 2004, he joined the Department of Electrical Engineer-
Information and Knowledge Technology, IKT, Shiraz, Iran, 2013. ing at Shahid Bahonar University of Kerman, Kerman,
[36] M. Janaki Meena, K.R. Chandran, A. Karthik, A. Vijay Samuel, An enhanced ACO Iran, as an assistant Professor, and was promoted to full
algorithm to select features for text categorization and its parallelization, Professor in 2012. Dr. Nezamabadi-pour is the author
Expert Syst. Appl. 39 (2012) 5861–5871. and co-author of more than 300 peer reviewed journal
[37] T. Hiroyasu, M. Miki, Y. One, Y. Minami, Ant Colony for Continuous Functions, and conference papers. His interests include image
The Science and Engineering, Doshisha University, Japan, 2000. processing, pattern recognition, soft computing, and
[38] N. Karaboga, A. Kalinli, D. Karaboga, Designing digital IIR filters using ant evolutionary computation.
colony optimization algorithm, Eng. Appl. Artif. Intell. 17 (2004) 301–309.