0% found this document useful (0 votes)

3 views

2015-Elsevier-Multi-objective-optimization-of-shared-nearest-neighbor-similarity-for-feature-selection

The document presents a new unsupervised feature selection algorithm that utilizes shared nearest neighbor (SNN) distance within a multi-objective optimization framework to enhance sample similarity and reduce dimensionality. The proposed method aims to minimize the impact of outliers and computational complexity while demonstrating superior performance compared to existing feature selection techniques through experimental validation on various datasets. The algorithm employs the NSGA-II genetic algorithm for optimization, focusing on preserving the neighborhood structure of data in the reduced feature space.

Uploaded by

chandreshgovind

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

2015-Elsevier-Multi-objective-optimization-of-shared-nearest-neighbor-similarity-for-feature-selection

Uploaded by

chandreshgovind

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Applied Soft Computing 37 (2015) 751–762

Contents lists available at ScienceDirect

Applied Soft Computing

journal homepage: www.elsevier.com/locate/asoc

Multi-objective optimization of shared nearest neighbor similarity for

feature selection
Partha Pratim Kundu ∗ , Sushmita Mitra
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India

a r t i c l e i n f o a b s t r a c t

Article history: A new unsupervised feature selection algorithm, based on the concept of shared nearest neighbor distance
Received 4 March 2014 between pattern pairs, is developed. A multi-objective framework is employed for the preservation of
Received in revised form 3 August 2015 sample similarity, along with dimensionality reduction of the feature space. A reduced set of samples,
Accepted 25 August 2015
chosen to preserve sample similarity, serves to reduce the effect of outliers on the feature selection
Available online 18 September 2015
procedure while also decreasing computational complexity. Experimental results on six sets of publicly
available data demonstrate the effectiveness of this feature selection strategy. Comparative study with
Keywords:
related methods based on different evaluation indices have demonstrated the superiority of the proposed
Nearest neighbor distance
Hubs
algorithm.
Multi-objective optimization © 2015 Published by Elsevier B.V.
Sample similarity
Redundancy analysis

1. Introduction select the optimal set based on the structure of the learning algorithm [32]. Since
finding the best feature subset is found to be intractable or NP-hard [2], therefore
Analysis of large data involves feature selection as an important stage. It is partic- heuristic and non-deterministic strategies are often deemed practical.
ularly used during preprocessing, for reducing dimensionality, removing irrelevant Supervised feature selection is mostly dependent in the existence of a labeled
attributes, reducing storage requirements, increasing computational efficiency, and dataset. On the other hand, in the absence of class information, the unsupervised
enhancing output comprehensibility. In the process it selects a minimum subset of techniques use some intrinsic property of the data [33]. Here no external information
features with cardinality d, from an original set of cardinality D (d < D), such that the like class label of a pattern is needed, and the selected feature subset is evaluated
feature space is optimally reduced according to certain predetermined evaluation in terms of certain criteria. Related literature on some such evaluation measures
criteria [18]. Feature selection has been widely applied to many fields such as text include Category Utility score [11], Fisher’s feature dependency measure [13,37], and
categorization [16], stock market analysis [19], wireless sensor network analysis [1], entropy-based unsupervised feature ranking [8]. These algorithms select subset(s)
genomic analysis [3] and social media analysis [38,39]. of features, while preserving inherent characteristics of the data.
Search is an important issue in feature selection, encompassing its starting point, Similarity measures based on distance are often sensitive to the dimensionality
direction, and strategy [32]. A search over a dataset involving D features needs to of the pattern space [6]. The relative contrast is found to decrease, with increase
traverse a feature space of 2D possibilities. Here D is the cardinality of the original in dimensionality, over a broad range of data distribution and distance measures.
feature space. Therefore, with larger values of D the cardinality of the feature space This, in turn, reduces the discriminatory ability of the measures [22]. As an alterna-
becomes huge. To deal with such situations, random search or related strategies are tive, researchers have devised a simple and common secondary similarity measure
found to be useful. One also needs to evaluate the performance of the generated involving shared nearest neighbor (SNN) information. The novelty this paper lies in
feature subsets [30,31]. selecting a subset of features by using this discriminative property of a secondary
Feature selection can be supervised, semi-supervised or unsupervised, depend- distance for preserving the neighborhood structure of a pattern, that exists in the
ing on the availability of class or label information of patterns. The algorithms are original feature space, also in the reduced space.
typically categorized as filter, wrapper and embedded models [27,32], based on SNN has been used in the merging step of agglomerative clustering [17,24], for
whether (or not) the learning methodology is used to select the feature subset and clustering high dimensional data sets [15,21], and in identifying outliers in sub-
the stage at which the selection is made. Filter methods rank or evaluate feature spaces of high dimensional data [28]. It is found to be less affected by the distance
subsets based on information content and/or usage of intrinsic properties of data concentration effect that occurs in higher dimensions, and is more robust than pri-
[27]. The wrapper methods assess the selected feature subsets according to their mary distances while providing better separability in the presence of irrelevant and
usefulness toward a given predictor or classifier. However selecting a good set of redundant features [22].
features is usually suboptimal for building a predictor, particularly in the presence A popular feature selection algorithm, based on nearest neighbor approach, is
of redundant variables. The embedded models use a learner with all features, and ReliefF [26,36]. Another well-known algorithm is SPEC [41], which ranks a feature
based on its alignment to the leading eigenvectors of the pair-wise similarity matrix
of samples. Thereby it helps to preserve the geometric structure of the data. However
these algorithms handle each feature individually, while neglecting possible corre-
∗ Corresponding author. Tel.: +91 3325753109. lation between different sets of features. Zhao et al. [42] overcame this limitation
E-mail addresses: pkundu2003@gmail.com (P.P. Kundu), sushmita@isical.ac.in by collectively evaluating sets of features to solve the combinatorial optimization
(S. Mitra). formulation, using a sequential forward selection (SPFS-SFS) [42] approach.

http://dx.doi.org/10.1016/j.asoc.2015.08.042
1568-4946/© 2015 Published by Elsevier B.V.
752 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

In this paper we employ the SNN-based distance for feature selection. It chooses
a reduced set of features while preserving the pairwise sample similarity. A feature
Ω
evaluation criterion is formulated in terms of the SNN distance, and is simulta- Λ
neously optimized with the feature set cardinality in a multi-objective framework. F1
x1
We employ multi-objective genetic algorithm NSGA-II [10] as an optimization tech-
nique to traverse the search space and ﬁnd a non-dominated set of features. NSGA-II
is a randomized search, guided by the principle of evolution and natural genetics,
with a large amount of implicit parallelism.
The rest of the paper is organized as follows. In Section 2 we present an outline x2 F2
of multi-objective optimization and shared nearest neighbors followed by some
basic measures for evaluating feature subspaces. The feature selection algorithm
Fig. 1. Mapping from feasible solutions space into objective function space.
is introduced in Section 3. The experimental results and comparative study are
described in Section 4, on various real datasets. Finally, Section 5 concludes the
article.

2. Preliminaries
Pareto optimal
F1 Front
In this section we present the mathematical background related
to the proposed shared nearest neighbor based feature selection
algorithm. We begin with some concepts from multi-objective opti-
mization and shared nearest neighbors, followed by a few feature
subset evaluation indices.
F2

Fig. 2. Pareto optimal front or non-dominated solutions of F1 and F2 .

2.1. Multi-objective optimization

Multi-objective optimization [9] trades off between a vector of 1. The solution x (1) is no worse than x (2) in all M objectives, i.e.
objective functions F( x ) = [F1 (x ), F2 (x ), . . ., FM (x )], where M is the Ft (x (1) ) Ft (x (2) ) ∀t = 1, 2, . . ., M.
number of objectives and x (∈ RZ ) is a vector of Z decision variables. 2. The solution x (1) is strictly better than x (2) in at least one of the M
Unlike single-objective optimization problems, this technique tries objectives, i.e.
to optimize two or more conflicting characteristics represented by Ft (x (1) ) Ft (x (2) ) for at least one t ∈ {1, 2, . . ., M}.
objective functions. Modeling this situation in a single objective
format would amount to a heuristic determination of a number Here we have used the operator between two solutions t1 and
of parameters employed in expressing such a scalar-combination- t2 as t1 t2 to denote that solution t1 is better than solution t2 on a
type objective function. The multi-objective technique, on the other particular objective. If any of the above conditions is not satisfied,
hand, is engaged with minimization or maximization of a vector of then the solution x (1) does not dominate the solution x (2) . So, the
x ) that can be the subject of a number of inequality
objectives F( solution x (1) and x (2) constitute the Pareto optimal front of these
constraints (pi ) and/ or equality constraints (qk ) or bounds. In other objective functions. A typical Pareto-optimal front over two objec-
words, this methodology has tive functions is depicted in Fig. 2. Here we simultaneously optimize
the conflicting requirements of the multiple objective functions.
x )
Minimize/Maximize F( (1) Genetic algorithms may be used as a tool for multi-objective
optimization. In this article we have used the Non-dominated Sor-
ting Genetic Algorithm (NSGA-II) [10], that has been shown to
converge to the global Pareto front while simultaneously maintain-
subject to pi (x ) ≤ 0, i = 1, 2, . . ., I; ing the diversity of the population [10].
qk (x ) = 0, k = 1, 2, . . ., K;
2.2. Shared nearest neighbor distance
xjL ≤ xj ≤ xjU , j = 1, 2, . . ., Z;
A fundamental form of shared nearest-neighbor similarity mea-
where I and K are the inequality and equality constraints respec- sure is the ‘overlap’ [24]. Let a data set X consist of N = |X| sample
tively. Each decision variable xj takes a value within lower bound points, s ∈ N+ and NNs (x) ⊆ X be the set of s-nearest-neighbors of
xjL and upper bound xjU , with the bounds composing a decision x ∈ X as determined by some specified primary distance or similar-
ity measure, viz. euclidean, city block, cosine distance. A primary
variable space D. Z denotes the number of components of x . The
similarity or distance measure is any function which determines a
solution set of x that satisfies all (I + K) constraints and all 2Z variable
ranking of sample points relative to a query. It is not necessary for
bounds, constructs the feasible solution space . As these objective
the data points to be represented as vectors. The query pertains to
functions are competing with each other, there is no unique solu-
the s-nearest neighbors of a sample.
tion to this technique. Instead, the concept of nondominance [9]
The ‘overlap’ between sample points x and y is defined to be the
(also called Pareto optimality [7]) must be used to characterize the
intersection size SNNs (x, y) = |NNs (x) ∩ NNs (y)|. It is an alternative to
objectives. The objective function space is defined as = f ∈
x ) . A mapping from feasible solutions space conventional similarity, and is sometimes called a secondary simi-
RM , where f = F( x∈
larity measure as it is based on the rankings induced by a specified
into objective functions space, in two dimensions, is delineated
primary similarity measure. The similarity measure, implemented
in Fig. 1.
here, is based on this ‘overlap’. It is similar to the cosine of the
The concept of optimality, behind the multi-objective optimiza-
angle between the zero-one set membership vectors for NNs (x) and
tion, handles a set of solutions. The conditions for a solution to be
NNs (y), and is defined as [15,20]
dominated with respect to the other solutions are listed here. A solu-
tion x (1) is said to dominate the other solution x (2) if the following SNN s (x, y)
simcoss (x, y) = . (2)
two conditions are true [9]: s
P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762 753

Transforming to the distance form [22], we have Patterns which are closer to the mean of a data distribution
are typically closer to all other patterns over any feature subset.
dacoss (x, y) = arccos(simcoss (x, y)). (3) It is shown [35] that this tendency gets amplified with higher
This distance is symmetric and satisfies the triangular inequality. dimensionality, such that patterns residing in the proximity of
Therefore, this distance is a metric [22]. There also exist other sim- the mean appear closer to all other patterns as compared to the
ilar distance forms like linear inversion dinvs (x, y) = 1 − simcoss (x, analogous situation in a reduced space. This tendency results in
y) and the logarithmic form dins (x, y) = − ln(simcoss (x, y)) [22]. high-dimensional patterns, that are closer to the mean, having
However, these distances do not satisfy the triangular inequality increased inclusion probability in the k-NN lists of other patterns.
property. All of these distance functions decrease monotonically Such patterns are termed hubs and are the most relevant [35]. Real
with respect to the similarity value between the points x and y. datasets are usually clustered, with patterns being organized into
The proposed algorithm, outlined in the following section, groups produced by a mixture of distributions. Therefore, hubs tend
employs the dacoss (x, y) distance to evaluate feature subsets. to be closer to their respective cluster centers. Our algorithm tries to
include these special patterns by selecting those with lower values
of dacoss . These are used as representative points of the datasets,
2.3. Evaluation of subspaces
since a lower value of dacoss is indicative of a higher value of SNNs .
Moreover, dacoss is computed in the original high-dimensional
Feature subsets can be evaluated in terms of sample similarity
space with a lower value implying an increase in the inclusion prob-
and redundancy. Two such criteria are the Jaccard score (JAC) and
ability of the patterns that are closer to the cluster-means in the
redundancy rate (RED).
dataset.
The JAC evaluates the proficiency of a selected feature subset in
Algorithm 2 is used to generate a reduced set of patterns
preserving pairwise sample similarity, and is computed as [42]
Xsel . As |Xsel | N, it reduces the computational complexity of the
1 NN(i, m, MF ) ∩ NN(i, m, M)
N subsequent optimization process. PDM s1 is computed employing
JAC(MF , M, m) = . (4) s1 -nearest neighbors on the sample set Xsel , with its dimension
N NN(i, m, MF ) ∪ NN(i, m, M)
i=1 being Nsel × Nsel (Nsel = |Xsel |) where Nsel is a user-defined parame-
ter. It may be noted that the choice Nsel N serves to reduce the
Here MF = XF XFT is the similarity matrix computed over the effect of outliers on the feature selection process.
selected feature set F (using the inner product), XF is the pattern set Let PDM s1 fredu (i, j) be the pairwise secondary distance, with s1
with F features, M is the similarity matrix computed in the original nearest neighbors being evaluated on set Xsel over fredu subset of
feature space, NN(i, m, M) and NN(i, m, MF ) denote the m-nearest features. The fredu is a reduced subset of the original features, and
neighbors of the ith sample according to M and MF respectively. JAC is considered for evaluation. In a multi-objective framework the
measures the average overlapping of the neighborhoods specified proposed algorithm simultaneously reduces the size of the feature
by MF and M, with a higher score indicating a better preservation subset while preserving the pairwise topological neighborhood
of sample similarity. information present in the s-size neighborhood in the original fea-
The RED assesses the average linear correlation among all feature space. This neighborhood information is incorporated in terms
ture pairs over a subset of F features, and is measured as [42] of the objective function
1
i=Nsel −1,j=Nsel
RED(F) = i,j , (5)
d(d − 1) F1 = abs(PDM s1 (i, j) − PDM s1 fredu (i, j)). (8)
fi ,fj ∈F,i>j
i=1,j>i

where Pearson’s correlation coefficient i,j is defined as [40] The second objective function is the cardinality of the reduced
N feature set, and is expressed by minimizing
(x(k, i) − x(i))(x(k, j) − x(j))
i,j = k=1
(6)
F2 = |fredu |. (9)
N 2 N 2
k=1
(x(k, i) − x(i)) k=1
(x(k, j) − x(j))
We employ NSGA-II [10,5] for heuristically exploring this search
N space.
between feature pairs fi and fj , with x(i) = 1/N ∗ l=1
{x(k, i)}, and
the cardinality of F is d. A larger value of this measure indicates A given feature subset is represented as a binary string, also
that more features are strongly correlated, and this implies that called as chromosome, with a “0” or “1” in position k specifying the
greater redundancy exists in F. Therefore a smaller value of RED(F) absence or presence of the kth feature in the set. The length of the
corresponds to the selection of a better feature subset. chromosome is equal to the total number of available features in the
data. It represents a prospective solution of the problem in hand.
A population of such chromosomes is evaluated by optimizing the
3. Multi-objective feature selection
two objective functions, in a multi-objective framework, in order
to enhance the fitness.
The proposed feature selection method selects a small subset
of features which preserves the pair-wise common natural group- Algorithm 1. MFSSNN
ing present in its s-size neighborhood in the original feature space Input: Pattern set X, with N sample points and D features.
while simultaneously reducing size of the feature set. Let PDMs be Size of neighborhoods s and s1 , cardinality of reduced set Nsel .
Output: A feature subset ffinal .
the pairwise secondary distance matrix of N × N dimension, where
1: Construct pairwise dissimilarity matrix PDMs using Eq. (7).
N is the number of patterns in a data set. We have 2: Construct a set Xsel of samples using Algorithm 2.
3: Calculate PDM s1 with s1 -nearest neighbors on the set Xsel .
PDM s (i, j) = dacoss (i, j). (7) 4: Select feature subset(s) by simultaneously optimizing F1 [Eq. (8)] and F2
[Eq. (9)] in a multi-objective framework.
It is obvious from the definition that PDMs is symmetric and its prin-
cipal diagonal elements are always zero. Hence the upper triangular Multi-objective GA proceeds to find a fit set of individuals (here,
part of matrix PDMs contains information about the pairwise com- feature subsets) by reproducing new children chromosomes from
mon natural groupings of all N data points in the original feature older parents. In the process it employs the operators selection,
space. crossover (where parts of two parent chromosomes are mixed to
754 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

Pareto optimal front of MF Pareto optimal front of usps

220 140

200 120

180 100
F2

F2
160 80

140 60

120 40

100 20
20 25 30 35 40 45 50 55 60 50 100 150 200 250 300 350
F1 F1
(a) (b)
Pareto optimal front of COIL20 Pareto optimal front of ORL
440 1200

420 1150
400
1100
380
1050
360
1000
F2

340
950
320
900
300

280 850

260 800

240 750
100 120 140 160 180 200 220 30 35 40 45 50 55 60
F1 F1
(c) (d)
Pareto optimal front of usps Pareto optimal front of usps
35 40

30 35

30
25

25
20
F2

20
15
15

10
10

5 5

0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
F1 F1
(e) (f)

Fig. 3. Pareto optimal front for the proposed algorithm over datasets (a) MF, (b) USPS, (c) COIL20, (d) ORL, (e) Ozone-onehr, and (f) Ozone-eighthr.
P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762 755

MF data USPS data

95 100

94
95
93

92 90
Accuracy (%)

Accuracy (%)
91
1−NN 85
90 3−NN
5−NN
NB
89 SVM 80

88
75 1−NN
3−NN
87 5−NN
NB
SVM
86 70
100 120 140 160 180 200 220 20 40 60 80 100 120 140
Number of Features, d Number of Features, d
(a) (b)
COIL20 data ORL data
100 99

99
98
98

97
97
Accuracy (%)
Accuracy (%)

95 96
1−NN
3−NN
94 5−NN
1−NN
3−NN
NB 95 5−NN
93 SVM NB
SVM
92
94
91

90 93
240 260 280 300 320 340 360 380 400 420 440 750 800 850 900 950 1000 1050 1100 1150 1200
Number of Features, d Number of Features, d
(c) (d)
Ozone−onehr Ozone−eighthr
100 95

95 90
90
85
85
Accuracy (%)

80
Accuracy (%)

75 75

70 70
65
65
60 1−NN 1−NN
3−NN 60 3−NN
55 5−NN 5−NN
NB NB
SVM SVM
50 55
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of Features, d Number of Features, d
(e) (f)

Fig. 4. Classiﬁcation performance of different feature subsets, selected from the Pareto optimal front, for datasets (a) MF, (b) USPS, (c) COIL20, (d) ORL, (e) Ozone-onehr, and
(f) Ozone-eighthr.
756 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

MF data USPS data

98 100

96
90
94

92 80
K−NN Accuracy (%)

K−NN Accuracy (%)

90
70
88
60
86

84 50
82
SPFS−SFS 40 SPFS−SFS
80 ReliefF ReliefF
SPEC SPEC
Proposed Proposed
78 30
100 120 140 160 180 200 220 20 40 60 80 100 120 140
Number of Features, d Number of Features, d
(a) (b)
COIL20 Data ORL data
100 98

98
97.5

96
97
K−NN Accuracy (%)

K−NN Accuracy (%)

94
96.5
92
96
90

95.5
88

86 SPFS−SFS 95 SPFS−SFS
ReliefF ReliefF
SPEC SPEC
Proposed Proposed
84 94.5
240 260 280 300 320 340 360 380 400 420 440 750 800 850 900 950 1000 1050 1100 1150 1200
Number of Features, d Number of Features, d
(c) (d)
Ozone−onehr Ozone−eighthr
98 94

96 92

90
94
K−NN Accuracy (%)

K−NN Accuracy (%)

88
92
86
90
84

88
82

86 SPFS−SFS
80 SPFS−SFS
ReliefF ReliefF
SPEC SPEC
Proposed Proposed
84 78
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of Features, d Number of Features, d
(e) (f)

Fig. 5. Comparative study of k-NN classiﬁer performance using feature subsets (of different cardinality) selected by MFSSNN from the Pareto optimal front, with other
state-of-the-art methods, for datasets (a) MF, (b) USPS, (c) COIL20, (d) ORL, (e) Ozone-onehr, and (f) Ozone-eighthr.
P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762 757

MF data USPS data

0.97 0.65

0.96 0.6

0.55
0.95

0.5
0.94
JAC

JAC
0.45
0.93
0.4

0.92
0.35

0.91 0.3

0.9 0.25
100 120 140 160 180 200 220 20 40 60 80 100 120 140
Number of Features, d Number of Features, d
(a) (b)
COIL20 data ORL data
0.9 0.95

0.85 0.9

0.8 0.85
JAC

JAC

0.75 0.8

0.7 0.75

0.65 0.7
240 260 280 300 320 340 360 380 400 420 440 750 800 850 900 950 1000 1050 1100 1150 1200
Number of Features, d Number of Features, d
(c) (d)
Ozone−onehr Ozone−eighthr
0.98 0.92

0.975 0.915

0.97 0.91

0.965 0.905
JAC

JAC

0.96 0.9

0.955 0.895

0.95 0.89

0.945 0.885

0.94 0.88
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of Features, d Number of Features, d

(e) (f)

Fig. 6. Performance in terms of JAC over different feature subsets, selected from the Pareto optimal front, for datasets (a) MF, (b) USPS, (c) COIL20, (d) ORL, (e) Ozone-onehr,
and (f) Ozone-eighthr.
758 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

Table 1
Performance comparison with related algorithms.

Data set and d Algorithm Accuracy (%) JAC RED

parameters

k-NN NB SVM

MF 155 SPFS-SFS 91.2 94.9 83.1 0.90 0.0253

N = 2000, D = 649
# Class = 10, Nsel = 80
s = 50, s1 = 50
(0.25) (0.10) (0.20)
ReliefF 83.1 94.0 84.7 0.90 0.0122
(0.28) (0.12) (0.16)
SPEC 96.5 93.3 94.1 0.32 0.0053
(0.14) (0.10) (0.11)
Proposed 93.8 93.0 87.4 0.96 0.0107
(0.21) (0.11) (0.15)
D – 95.1 95.9 88.9 – –
(0.24) (0.15) (0.10)

USPS 89 SPFS-SFS 69.5 52.1 64.0 0.09 0.0619

N = 9298, D = 256
# Class = 10, Nsel = 80
s = 50, s1 = 50
(0.17) (0.07) (0.07)
ReliefF 94.8 82.1 90.2 0.41 0.0405
(0.07) (0.07) (0.06)
SPEC 94.4 81.4 90.4 0.43 0.0388
(0.09) (0.06) (0.06)
Proposed 95.7 83.8 91.4 0.55 0.0387
(0.07) (0.05) (0.05)
D – 98.3 82.5 96.2 – –
(0.06) (0.06) (0.03)

COIL20 253 SPFS-SFS 89.8 63.7 80.2 0.11 0.0633

N = 1440, D = 1024
# Class = 20, Nsel = 80
s = 50, s1 = 50
(0.24) (0.51) (0.41)
ReliefF 98.4 78.3 93.4 0.30 0.1390
(0.22) (0.32) (0.29)
SPEC 84.4 61.9 73.6 0.12 0.0665
(0.35) (0.36) (0.45)
Proposed 99.7 90.7 95.5 0.67 0.0737
(0.08) (0.30) (0.14)
D – 99.8 92.6 95.9 – –
(0.08) (0.40) (0.19)

ORL 756 SPFS-SFS 95.6 89.4 92.5 0.48 0.0953

N = 400, D = 2576
# Class = 40, Nsel = 60
s = 50, s1 = 50
(0.11) (0.39) (0.36)
ReliefF 95.4 81.6 92.6 0.30 0.1312
(0.17) (0.38) (0.26)
SPEC 94.9 87.0 92.3 0.42 0.1009
(0.17) (0.35) (0.39)
Proposed 97.6 94.0 98.0 0.77 0.0649
(0.15) (0.30) (0.25)
D – 97.8 94.1 98.1 – –
(0.14) (0.30) (0.18)

Ozone-onehr 31 SPFS-SFS 95.0 70.2 97.0 0.01 0.1879

N = 2536, D = 72
# Class = 2, Nsel = 80
s = 50, s1 = 50
(0.22) (0.08) (0.15)
ReliefF 95.2 70.2 96.9 0.05 0.1596
(0.17) (0.08) (0.08)
SPEC 94.6 68.6 95.8 0.01 0.1191
(0.14) (0.06) (0.30)
Proposed 95.2 65.0 97.0 0.99 0.1733
(0.22) (0.14) (0.07)
D – 97.0 21.2 97.12 – –
(0.05) (0.13) (0.10)
P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762 759

Table 1 (Continued)

Data set and d Algorithm Accuracy (%) JAC RED

parameters

k-NN NB SVM

Ozone-eighthr 36 SPFS-SFS 91.3 67.0 92.5 0.01 0.1578

N = 2534, D = 72
# Class = 2, Nsel = 80
s = 50, s1 = 50
(0.14) (0.04) (0.30)
ReliefF 90.5 66.2 93.0 0.05 0.2000
(0.20) (0.10) (0.15)
SPEC 90.7 68.0 92.3 0.01 0.1141
(0.04) (0.13) (0.18)
Proposed 92.0 65.0 94.0 0.99 0.2000
(0.15) (0.01) (0.14)
D – 93.4 24.3 93.7 – –
(0.04) (0.03) (0.02)

create an offspring) and mutation (where bit(s) of a single parent 2. USPS is a handwritten digit database [23]. It contains 9298 hand-
are randomly perturbed to create an offspring). Crossover probabil- written images.2 over 16 × 16 pixels, and has 10 classes.
ity pc (with scattered crossover function) and mutation probability 3. ORL database consists of a total of 400 face images of 40 subjects.3
pm are used. Crowded-comparison operator [10] is used for selec- The original images are subsampled to a size of 56 × 46 pix-
tion, with preference for solutions having higher non-domination els, with 256 gray levels per pixel. Thus each face image can be
rank and separated from each other based on crowded distance. The represented by a 2576-dimensional feature vector.
chromosomes associated with the non-dominated solutions with 4. COIL20 is a database of gray scale images of 20 objects, each hav-
respect to the ﬁtness functions are decoded to obtain the reduced ing 72 images [34]. The original images are subsampled down to
feature subsets. The algorithm stops when the weighted average 32 × 32 pixels, with 256 gray levels per pixel.
change in the ﬁtness function value, over 50 generations, is less 5. Ozone-onehr contains ground level ozone data measured at one
than the average change in value of the spread of the Pareto set hour peak values,4 from 1998 to 2004, in the Houston, Galveston
[10]. and Brazoria areas.
The multi-objective feature selection algorithm, based on 6. Ozone-eighthr consists of ground level ozone data,5 from eight
Shared Nearest Neighbors (MFSSNN), is outlined as Algorithm 1. hour peak values, collected as above.

Algorithm 2. A heuristic for constructing Xsel

Since s « N, the performance of SNN is found to be reasonably
Input: The pair wise dissimilarity matrix PDMs and Nsel .
Output: Reduced sample set Xsel robust to the choice of s [22]. Here we selected s as 50 and s1 = s,
1: Find the minimum entry (>0) of each row of PDMs and store as with s1 < Nsel < 2 * s1 for our experiments. s was selected according
min rowi , with i ∈ 1, . . ., N − 1. to the guidelines provided in Ref [22]. Results of Table 1 were gener-
2: Sort min rowi in ascending order along with indices. ated using cosine distance as the primary measure. Multi-objective
3: Select top Nsel index values of min rowi .
4: Generate sample set Xsel with these selected points.
NSGA-II has been used to optimize the evaluation criteria of Eqs.
(8) and (9), for selecting a minimal set of features. The parameter
settings used were crossover probability pc = 0.8, size of population
4. Experimental results = 100, and number of generations = 300, with the mutation prob-
ability pm being varied over the generations based on a Gaussian
The effectiveness of the algorithm was evaluated by externally function. The value of pm was varied between 0.01 and 0.2. The
validating selected feature subsets in terms of their predictive selection operator used here for GA is stochastic universal sampling
accuracy, as measured by well-known classifiers, like k-nearest (SUS). The GA parameters are determined after extensive experi-
neighbor (k-NN) [14] for k = 1, 3, 5, Naive Bayes (NB) [12] and Sup- mental evaluation. Fig. 3 depicts the Pareto optimal front, for the
port Vector Machine (SVM) [25], using 10-fold cross validation. The six datasets, using the proposed feature selection algorithm. The
process was repeated 50 times and the results were averaged for two objective functions F1 and F2 , by Eqs. (8)–(9), are plotted along
the final result. But the feature selection was done on training data the two axes.
when it was specifically available. Otherwise we had used 90% of We provide in Fig. 4 the classification accuracy (%) with respect
total samples as training data. But we did not use label of it, as this to the cardinality of the feature subsets (selected by our algorithm)
is an unsupervised method. The paired Student’s t-test for unequal over the six datasets, for the classifiers k-NN (k = 1, 3, 5), NB and
mean and variance [29,4] was used to compute the statistical sig- SVM. It is observed that the classification performance is more or
nificance of the obtained results, and the threshold for rejecting the less stable for different d of feature subsets belonging to the Pareto
null hypothesis was set at 0.05. We used six sets of real data, whose optimal front, across all the classifiers over all datasets. However,
characteristics are listed below. k-NN is generally better.
Fig. 5 depicts a comparative study of the k-NN classifier perfor-
mance using MFSSNN for selecting different feature subsets from
1. Multiple features (MF) dataset consists of 2000 samples from 10 the Pareto optimal front, along with those selected from several
classes of handwritten numerals (‘0’–‘9’), having 649 real-valued
features. These are extracted from a collection of Dutch utility
maps, available in the UCI Machine Learning Repository.1 2
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multiclass.
html%23usps.
3
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
4
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
1 5
http://www.ics.uci.edu/∼mlearn/MLRepository.html. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
760 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

x 10 −3 MF data USPS data

11.5 0.065

11
0.06
10.5

10 0.055

9.5
RED

RED
0.05
9

8.5 0.045

8
0.04
7.5

7
100 120 140 160 180 200 220 20 40 60 80 100 120 140
Number of Features, d Number of Features, d
(a) (b)
COIL20 data ORL data
0.076 0.074

0.073
0.074
0.072
0.072
0.071

0.07 0.07
RED

RED

0.068 0.069

0.068
0.066
0.067
0.064
0.066

0.062 0.065
240 260 280 300 320 340 360 380 400 420 440 750 800 850 900 950 1000 1050 1100 1150 1200
Number of Features, d Number of Features, d
(c) (d)
Ozone−onehr Ozone−eighthr
0.35 0.45

0.4
0.3
0.35
0.25
0.3

0.2 0.25
RED

RED

0.15 0.2

0.15
0.1
0.1
0.05
0.05

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of Features, d Number of Features, d
(e) (f)

Fig. 7. Performance in terms of RED over different feature subsets, selected from the Pareto optimal front, for datasets (a) MF, (b) USPS, (c) COIL20,(d) ORL, (e) Ozone-onehr,
and (f) Ozone-eighthr.
P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762 761

state-of-the-art methods, over different datasets. In most cases but also improved a little in some of the cases with respect to the
algorithm MFSSNN performs better over the other feature selection original feature space. The validation index indicated that sam-
methods. ple similarity was also preserved in the reduced space, with the
Figs. 6 and 7 illustrate the variation in preservation of sample selected features having very little correlation amongst them. Com-
similarity using JAC [Eq. (4)] and variation in redundancy using parative study with related algorithms like SPFS-SFS, ReliefF and
RED [Eq. (5)], respectively, with respect to the cardinality of the SPEC demonstrated the suitability of our algorithm.
feature subsets over the six datasets. It is found from Fig. 6 that Our algorithm tries to preserve pair-wise structural similarity
the value of JAC remains stable for datasets MF, Ozone-onehr and through a set of representative points (termed hubs [35]), which
Ozone-eighthr over feature subsets of higher cardinality. However lie closer to the cluster-means of a dataset. When these points get
in case of USPS, COIL20 and ORL datasets there is an increase in identified correctly, our algorithm performs better than many other
its value with higher cardinality of subsets. We can thus conclude related methods (as demonstrated in Fig. 5 and Table 1). However
that the patterns which are similar in a higher dimensional space appropriate selection of this representative set is a bottleneck, and
remain similar in the reduced space. Thus the structural similarity is sometimes results in a collection of patterns other than hubs. This is
appropriately preserved over reduced dimensions. Our algorithm an area which needs further investigations, and is a reason why our
is also able to eliminate redundant features during selection, as algorithm fails in some cases. We also plan to design other multi-
demonstrated in Fig. 7. objective frameworks, specific to this problem, in order to alleviate
The performance of Algorithm 1, for the six datasets, was also the situation. We shall further investigate our method on more
compared to that of SPFS-SFS [42], ReliefF [26] and SPEC [41]. The datasets as well as on datasets which have no label information
results are presented in Table 1 for a subset of features, chosen associated with them.
from the plots of Figs. 3 and 4. The cardinality of the reduced fea-
ture subsets, in each case, is listed in column 2. The third, fourth and
References
fifth columns indicate the average classification accuracy involving
10-fold cross-validation, using k-NN (for k = 1), Naive Bayes’ (NB) [1] M. Alwadi, G. Chetty, A novel feature selection scheme for energy efficient wire-
and Support Vector Machine (SVM) classifiers respectively. The val- less sensor networks, in: Proceedings of International Conference of Algorithms
ues within parentheses represent the standard deviations over 50 and Architectures for Parallel Processing, 2012, pp. 264–273.
[2] E. Amaldi, V. Kann, On the approximation of minimizing non zero variables or
independent runs. The last two columns depict the effectiveness of unsatisfied relations in linear systems, Theor. Comput. Sci. 209 (1998) 237–260.
the selected feature subset in preserving pairwise sample similar- [3] M. Ashraf, G. Chetty, D. Tran, D. Sharma, Hybrid approach for diagnosing thy-
ity, in terms of JAC, and the feature subset redundancy, in terms of roid, hepatitis, and breast cancer based on correlation based feature selection
and Naïve Bayes, in: Proceedings of International Conference of Neural Infor-
RED. Statistical significance of the classification performance of the mation Processing, 2012, pp. 272–280.
algorithms compared was also tested. The last row, corresponding [4] A. Aspin, Tables for use in comparisons whose accuracy involves two variances,
to each dataset, contains the average, cross-validated classification Biometrika (1949) 245–271.
[5] M. Banerjee, S. Mitra, H. Banka, Evolutionary-rough feature selection in gene
accuracy over the original feature space for the data. expression data, IEEE Trans. Syst. Man Cybern. C: Appl. Rev. 37 (2007) 622–632.
The proposed feature selection algorithm MFSSNN performs the [6] K.S. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor”
best (as highlighted in bold in the table) over all classifiers, as well meaningful? in: Proceeding of International Conference on Database Theory
(ICDT), 1999, pp. 217–235.
as in terms of sample similarity for datasets USPS, COIL20 and ORL.
[7] Y. Censor, Pareto optimality in multiobjective problems Appl. Math. Optim. 4
Interestingly the classification accuracy (%) is found to be compa- (1977) 41–59.
rable for feature cardinality d with respect to the original feature [8] M. Dash, H. Liu, J. Yao, Dimensionality reduction for unsupervised data, in: Proc.
of the Nineteenth IEEE International Conference on Tools with AI, Newport
space having cardinality D (d < D). However, for the MF data its
Beach, CA, USA, 1997, pp. 532–539.
performance is comparable to that of the three other algorithms. [9] K. Deb, Multi-Objective Optimization using Evolutionary Algorithms, John
Moreover, since our feature selection algorithm is unsupervised, Wiley, London, 2001.
the efficacy of its performance becomes even more apparent. In [10] K. Deb, S. Agarwal, A. Pratap, T. Meyarivan, A fast and elitist multi-objective
genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2002) 182–197.
case of the two Ozone datasets, MFSSNN performs well with classi- [11] M. Devaney, A. Ram, Efficient feature selection in conceptual clustering, in: Proc.
fiers SVM and k-NN while producing comparable performance with of the Fourteenth International Conference on Machine Learning, Nashville, TN,
NB. 1997, pp. 92–97.
[12] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice
The time complexity of the proposed algorithm largely depends Hall, Englewood Cliffs, 1982.
on the cost of building the dissimilarity matrix PDMs as well as [13] F. Douglas, Knowledge acquisition via incremental conceptual clustering, Mach.
on the multi-objective optimization technique. We need O(sN2 D) Learn. 2 (1987) 139–172.
[14] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley, New Jersey,
floating point operations for constructing PDMs . Next we generate 2001.
PDM s1 for each randomly selected feature subset from the set Xsel . [15] L. Ertoz, M. Steinbach, V. Kumar, Finding clusters of different sizes, shapes, and
It requires O(s1 n2 D) floating point operations. This is followed by densities in noisy, high dimensional data, in: Proceedings of SIAM International
Conference on Data Mining (SDM’03), 2003, pp. 333–352.
optimization using eqn. (8). When NSGA-II is used with a popula-
[16] M. Grineva, M. Grinev, D. Lizorkin, Extracting key terms from noisy and mul-
tion size St over g generations, the complexity of the optimization titheme documents., in: Proceedings of the 18th International Conference on
process becomes O(MS 2t gs1 n2 D). Here we have used two objectives, World Wide Web, WWW 2009, 2009, pp. 661–670.
[17] S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large
i.e. M = 2. Hence the overall time complexity of Algorithm 1 becomes
databases., in: Proceedings of SIGMOD Conference 1998, 1998, pp. 73–84.
O((sN 2 + St2 gs1 n2 )D). [18] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach.
Learn. Res. 3 (2003) 1157–1182.
[19] Y. He, K. Fataliyev, L. Wang, Feature selection for stock market analysis, in:
Proceedings of International Conference on Neural Information Processing,
5. Conclusions 2013, pp. 737–744.
[20] M.E. Houle, Navigating massive datasets via local clustering, in: Proceeding of
In this article we have developed a new unsupervised feature Knowledge Discovery and Data Mining (KDD), 2003, pp. 333–352.
[21] M.E. Houle, The relevant-set correlation model for data clustering, Stat. Anal.
selection algorithm which tries to preserve sample similarity in Data Min. 1 (3) (2008) 157–176.
a reduced feature space based on the concept of shared near- [22] M.E. Houle, H.P. Kriegel, P. Kröger, E. Schubert, A. Zimek, Can shared-neighbor
est neighbor distance. It simultaneously reduces the cardinality of distances defeat the curse of dimensionality? in: Proceedings of 22nd interna-
tional Conference on Scientific and Statistical Database Management (SSDBM),
feature subsets in a multi-objective framework. The results demon- 2010.
strate that the reduced feature subsets could not only preserve the [23] J.J. Hull, A database for handwritten text recognition research IEEE Trans. Pat-
predictive accuracy of the classifiers in the reduced feature space, tern Anal. Mach. Intell. 16 (1994) 550–554.
762 P.P. Kundu, S. Mitra / Applied Soft Computing 37 (2015) 751–762

[24] R.A. Jarvis, E.A. Patrick, Clustering using a similarity measure based on shared [33] L.C. Molina, L. Belanche, À. Nebot, Feature selection algorithms: a survey and
near neighbors, IEEE Trans. Comput. C-22 (1973) 1025–1034. experimental evaluation, in: Proceedings of ICDM 2002, 2002, pp. 306–313.
[25] T. Joachims, Text categorization with support vector machines: learning with [34] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20).
many relevant features, in: Proceedings of 10th European Conference on Technical Report No. CUCS-006-96, Dept. of Computer Science, Columbia Uni-
Machine Learning ECML, 1998, pp. 137–142. versity, New York, 1996.
[26] K. Kira, L. Rendell, A practical approach to feature selection, in: D. Sleeman, P. [35] M. Radovanović, A. Nanopoulos, M. Ivanović, Hubs in space: popular nearest
Edwards (Eds.), Proceedings of International Conference on Machine Learning, neighbors in high dimensional data, J. Mach. Learn. Res. 11 (2010) 2487–2531.
Aberdeen, Morgan Kaufmann, 1992, July, pp. 368–377. [36] M. Robnik-Sikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF
[27] R. Kohavi, G. John, Wrappers for feature selection, Artif. Intell. 97 (1997) and RReliefF, Mach. Learn. 53 (2003) 23–69.
273–324. [37] L. Talavera, Dependency-based feature selection for clustering symbolic data,
[28] H.P. Kriegel, P. Kröger, E. Schubert, A. Zimek, Outlier detection in axis-parallel Intell. Data Anal. 4 (2000) 19–28.
subspaces of high dimensional data., in: Proceedings of Paciﬁc-Asia Conf. on [38] J. Tang, X. Hu, H. Gao, H. Liu, Unsupervised feature selection for multi-view data
Knowledge Discovery and Data Mining (PAKDD 2009), 2009, pp. 831–838. in social media, in: Symposium on Data Mining, 2013, pp. 270–278.
[29] E.L. Lehmann, Testing of Statistical Hypothesis, John Wiley, New York, 1976. [39] J. Tang, H. Liu, Coselect: feature selection with instance selection for social
[30] H. Liu, E.R. Dougherty, J.G. Dy, K. Torkkola, E. Tuv, H. Peng, C.H.Q. Ding, F. Long, media data, in: Symposium on Data Mining, 2013, pp. 695–708.
M.E. Berens, L. Parsons, Z. Zhao, L. Yu, G. Forman, Evolving feature selection, [40] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Elsevier, CA, USA, 2009.
IEEE Intell. Syst. 20 (2005) 64–76. [41] Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learn-
[31] H. Liu, H. Motoda, R. Setiono, Z. Zhao, Feature selection: an ever evolving frontier ing, in: Proceedings of International Conference on Machine Learning (ICML),
in data mining, in: JMLR: Workshop and Conference Proceedings: The Fourth 2007, pp. 1151–1157.
Workshop on Feature Selection in Data Mining, 2010, pp. 4–13. [42] Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE
[32] H. Liu, L. Yu, Toward integrating feature selection algorithms for classiﬁcation Trans. Knowl. Data Eng. 25 (2013) 619–632.
and clustering, IEEE Trans. Knowl. Data Eng. 17 (2005) 491–502.