0% found this document useful (0 votes)

31 views11 pages

An Initial Seed Selection Algorithm

The document proposes a new initial seed selection algorithm for k-means clustering that aims to improve replicability of cluster assignments. The algorithm works by (1) sorting data points by value, (2) calculating distances between consecutive points, (3) identifying the largest gaps to serve as initial cluster boundaries, and (4) calculating cluster centers as the mean of points within each boundary. This method draws cluster boundaries at the deepest "valleys" in the data to maximize distance between centers. It introduces no new parameters and produces nearly identical results each time, improving on k-means and k-means++ in replicability for applications like mapping georeferenced data.

Uploaded by

hamzarash090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

An Initial Seed Selection Algorithm

Uploaded by

hamzarash090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Applied Soft Computing

Volume 12, Issue 11, November 2012, Pages 3698–3700

Short communication

An initial seed selection algorithm for k-means clustering of georeferenced

data to improve replicability of cluster assignments for mapping application
 Fouad Khan,
 Central European University-Environmental Sciences and Policy Department, Nador utca 9, 1051 Budapest, Hungary
Received 6 February 2012, Revised 25 June 2012, Accepted 11 July 2012, Available online 24 July 2012
DOI: 10.1016/j.asoc.2012.07.021

Abstract
K-means is one of the most widely used clustering algorithms in various disciplines, especially for large
datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers.
K-means++ has been proposed to overcome this problem and has been shown to have better accuracy
and computational efficiency than k-means. In many clustering problems though – such as when
classifying georeferenced data for mapping applications – standardization of clustering methodology,
specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability
of the methodology, may be of greater significance than any perceived measure of accuracy, especially
when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a
simple initial seed selection algorithm for k-means clustering along one attribute that draws initial cluster
boundaries along the “deepest valleys” or greatest gaps in dataset. Thus, it incorporates a measure to
maximize distance between consecutive cluster centers which augments the conventional k-means
optimization for minimum distance between cluster center and cluster members. Unlike existing
initialization methods, no additional parameters or degrees of freedom are introduced to the clustering
algorithm. This improves the replicability of cluster assignments by as much as 100% over k-means and
k-means++, virtually reducing the variance over different runs to zero, without introducing any additional
parameters to the clustering process. Further, the proposed method is more computationally efficient than
k-means++ and in some cases, more accurate.
Graphical abstract

Highlights

► We use a new initial seeding method for k-means clustering.

► The method improves replicability by at least 90% compared to k-means++.

► The method does not introduce any new parameters to the clustering algorithm.

► The method is especially suited for clustering of georeferenced data for mapping.

Keywords
 Classification;
 Grouping of data;
 Natural breaks;
 Jenks natural breaks;
 Natural breaks ArcGIS;
 Jenks ArcGIS
1. Introduction

Clustering or classification of data into groups that represent some measure of homogeneity

across a given variable range or values of multiple variables, is a much analyzed and studied

problem in pattern recognition. K-means clustering is one of the most widely used methods for

implementing a solution to this problem and for assigning data into clusters. The method in its

initial formulation was first proposed by Mac Queen in 1967 (Mac Queen 1967) though the

approximation developed by Lloyd in 1982 (Lloyd 1982) has proven to be most popular in

application. The method assumes apriori knowledge of the number of clusters k and requires

seeding with initial values of centers of these clusters in order to be implemented. These initial

seed values have been shown to be an important determinant of the eventual assignment of data

to clusters. In other words, k-means clustering is highly sensitive to the initial seed selection for

the value of cluster centers (Peña, Lozano et al. 1999).

K-means++ has been proposed to overcome this problem and has been shown to produce a scale

improvement in algorithm accuracy and computational efficiency or speed (Ostrovsky, Rabani et

al. 2006; Arthur and Vassilvitskii 2007). The algorithm assesses the performance of the initial

seed selection based on the sum of square difference between members of a cluster and the

cluster center, normalized to data size. While this is a worthwhile means of assessing method

performance, it may be noted that in many clustering applications, the replicability of the

resultant cluster assignment can be much more desirable than the homogeneity of the cluster

perceived through an objective measure.

We encountered one such application of the clustering problem while trying to cluster

georeferenced data into classes for mapping and visualization through a Geographical

Information Systems (GIS) software suite. Commercially available GIS software ArcGIS for
instance utilizes a proprietary modification of Jenks natural breaks algorithm (Jenks 1967) to

classify values of a variable for visualization in maps (ArcGIS 2009). The classification this

method obtains seems to reproduce itself with remarkable accuracy for each run. The clustering

bounds do not vary from run to run, even with variable values in eleven significant figures.

Jenks’ algorithm differs only slightly from k-means clustering. K-means using Lloyd’s algorithm

aims to minimize the following cost function C defined in Equation 2;

Equation 2

Where n is the data size of number of data points, k is the number of clusters and dist(di, cj)

computes the Euclidean distance between point di and its closest center cj. The algorithm runs as

follows;

a) Select centers c1,…,ck at random from the data.

b) Calculate the minimum cost function C, assigning data points d1,…,dn to their respective

clusters having the closest mean.

c) Calculate new centers c1,…,ck as means of the clusters assigned in step 2.

d) Repeats steps b and c until no change is observed in center values c1,…,ck.

Jenks algorithm differs in that instead of C it minimizes the cost function J, defined in Equation

Equation 3
As can be seen in Equation 3, Jenks’ algorithm not only searches for minimum distance between

data points and centers of clusters they belong to but for maximum difference between cluster

centers themselves (Jenks 1967).

If we are trying to develop a methodology for geo-processing; say a utility that studies the

scaling characteristic of a city and models the distribution of sizes of housing within different

size clusters, it can be essential to have a clustering mechanism that produces almost exactly

similar results each time. Drawing inspiration from Jenks’ algorithm, we propose an initial seed

selection algorithm for k-means clustering that produces similar clusters on each run. We

compare our results to those obtained by k-means as well as the widely used k-means++ initial

seed selection methodology. K-means++ selects the initial centers as follows;

a) Select one center at random from dataset.

b) Calculate squared distance of each point from the nearest of all selected centers and sum

the squared distances.

c) Choose the next center at random. Calculate sum of squared distances. Re-select this

center and calculate the sum of squared distances again. Repeat a given ‘number of trials’

and select the center with the minimum sum of squared distance as the next center.

d) Repeat steps b and c until k centers are selected.

The methodology is novel in that unlike other initial seed selection algorithms, it does not

introduce any new parameters (such as number of trials for k-means++) in the clustering

algorithm thereby avoiding additional degrees of freedom. By clustering along the deepest

valleys or highest gaps in the data series, the method introduces a measure of distance between

cluster centers augmenting the k-means optimization for minimum distance between cluster
center and cluster members. Additionally, unlike initialization algorithms like k-means++ there

is no randomness involved in the algorithm and the initial clusters obtained are always the same.

2. Materials and Method

We propose the following method for calculating initial seed centers of k-means clustering along

one attribute.

a) Sort the data points in terms of increasing magnitude d1,…,dn such that d1 has the

minimum and dn has the maximum magnitude.

b) Calculate the Euclidean distances Di between consecutive points di and di+1 as shown in

Equation 4;

Di = di+1 – di; where i = 1,…, (n-1) Equation 4

c) Sort D in descending order without changing the index i of each Di. Identify k-1 index i

values (i1,…,i(k-1)) that correspond to the k-1 highest Di values.

d) Sort i1,…i(k-1) in ascending order. The set (i1,…,i(k-1),ik) now forms the set of indices of

data values di, which serve as the upper bounds of clusters 1,…,k; where; ik = n.

e) The corresponding set of indices of data values di which serve as the lower bounds of

clusters 1,…,k would simply be defined as (i0, i1+1,…,i(k-1)+1), where i0 = 1.

f) The values of cluster centers c will now simply calculated as the mean of di values falling

within the upper and lower bounds calculated above. This set of cluster centers (c1,…,ck)

will form the initial seed centers.

The methodology discussed above simply draws the cluster boundaries at points in the data

where the gap between consecutive data values is the highest or the data has deepest ‘valleys’. In

this way, a measure of distance is brought between consecutive cluster centers.

The method can be easily implemented for small to medium size datasets by using the

spreadsheet freely available for download at http://ge.tt/api/1/files/7FON8KH/0/blob?download.

To test the replicability of cluster assignments produced using this methodology, the same data

was clustered using this methodology ten times. The variance observed in cluster centers for

these ten runs was calculated and averaged over the number of cluster centers. For comparison

similar analysis was performed employing k-means and widely used k-means++ initial seeding

methodology and the variance averaged over number of cluster centers was calculated.

The analysis was run for five different datasets. The first data is the popular Iris dataset from UC

Irvine Machine Learning Repository (UCIMLR) (Fisher 1936). Attribute one of the data was

used for clustering. The data having 150 points was classed into 5 clusters. The second data is

US census block wise population data for the Metropolitan Statistical Area (MSA) of St. George

Utah. The population, land area and water body area data was downloaded from the US Census

Bureau website (US Census Bureau 2010). The area was calculated by summing and water and

land areas for the census block. The population density for each census block was estimated by

dividing population for the block with the area for the block. The data having 1450 points was

clustered along population density into 10 clusters. The third data was Abalone dataset from

UCIMLR (Nash, Sellers et al. 1994). Attribute 5 was used for clustering. The data has 4177

instances and was clustered into 25 classes. The fourth set of data was cloud cover data

downloaded from Phillipe Collard (Collard 1989). Data in column 3 was used for cluster

analysis. The data having 1024 points was clustered in 50 clusters. The fifth data set was
randomly generated normally distributed data with mean 10 and standard deviation of 1. The

data having 10,000 points was clustered into 100 clusters.

3. Results

While the objective of development of this method is to produce more replicable results, the

sums of squared differences between cluster members and cluster centers between the proposed

method and k-means++ were compared and are juxtaposed in Table 1. As can be seen in Table 1,

k-means++ in general continues to produce more accurate clustering using this methodology,

though for two of the five datasets, our method produced better results.

Table 1: Sum of Squared Differences between Cluster Members and their Closest Centers

(Normalized to Data size)

k-means++ Proposed Reduction%

Dataset method
Iris 0.042243916 0.037471719 11.30%
St. George 2.39419E-07 1.76868E-07 26.13%
Abalone 0.000817549 0.001229598 -50.40%
Cloud 2.379979794 5.22916047 -119.71%
Normal 0.000644885 0.001465068 -127.18%

As shown in Table 2, our proposed method is also significantly faster than k-means++,

clustering as much as 89% faster than k-means++ in some cases. The advantage in clustering

speed is obtained over the initial seed selection, where k-means++ takes significantly longer

comparative to both, our proposed method and k-means [2].

Table 2: Algorithm Running Time (Seconds)

k-means++ Proposed Reduction%

Dataset method
Iris 0.101 0.011 89.11%
St. George 2.312999994 0.438000001 81.06%
Abalone 19.79400002 16.191 18.20%
Cloud 7.771000001 1.886000005 75.73%
Normal 207.8150008 145.3870012 30.04%
The premier advantage of our proposed method over k-means and k-means++ though is in

improving method replicability. The results are presented in Table 3. As can be seen in all three

cases, the variance was virtually reduced to zero using our method, which was at least a 90%

improvement on k-means++ and k-means.

Table 3: Variance of Centers over Ten (10) Runs Averaged to Number of Clusters

Proposed k-means++ Reduction% k-means Reduction%

Dataset method
Iris 4.73317E-31 0.046361574 100.00% 0.499704 100.00%
St. George 1.12847E-37 1.22722E-36 90.80% 1.23E-36 90.80%
Abalone 2.37968E-32 0.003285155 100.00% 0.005395 100.00%
Cloud 1.72981E-28 31.54401321 100.00% 22.24461 100.00%
Normal 5.75868E-31 0.009478013 100.00% 0.054631 100.00%

4. Discussions and Conclusion

The method for initial seed selection of algorithm we propose reduces the variance of clustering

to zero accurate up to eleven significant figures, for clustering along one attribute or dimension.

The further advantage of the proposed initialization method is that unlike k-means++ it does not

introduce any new variables within the analysis, such as the number of trials. Almost perfect

replicability and avoidance of additional degrees of freedom make the method especially suited

for inclusion as part in a protocol or standard methodology or algorithm. Further, the method

also produces results faster than k-means++ and hence is more computationally efficient at least

in two-dimensional space.

The method has applications in all areas of data analysis where a Jenks style ‘natural’

classification, with a high level of replicability may be needed. It has the following distinct

advantages over other initialization methods and naked k-means implementation;

 The results are highly replicable

 The method is fast and easy to implement

 No additional degrees of freedom or modifiable parameters are introduced that may need

expert input for getting replicable results

 The clustering may be more ‘natural’ in the manner of Jenks’ algorithm considering that

a measure of distance between cluster centers is introduced to augment the k-means

optimization of minimum distance between cluster members and cluster center.

Above advantages can render the initialization method highly useful in all areas where large

datasets have to be handled or a ‘natural’ classification of data is sought. This includes areas like

bioanalysis for instance where density based clustering is commonly deployed; the method can

be made part of a more detailed analysis regime with confidence that the replicability of the

results will not be negatively affected by the clustering algorithm. In the area of market

segmentation and computer vision, the method can be used to standardize clustering results. This

makes the method especially suited to utility development for GIS applications.
References

ArcGIS. (2009, April 15, 2011). "What is the source for ArcMap's Jenks Optimization classification?" Ask a
Cartographer, from http:/ / mappingcenter.esri.com/ index.cfm?fa=ask.answers&q=541.
Arthur, D. and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, Louisiana,
Society for Industrial and Applied Mathematics: 1027-1035.
Collard, P. (1989). Philippe Collard's cloud cover data.
Fisher, R. A. (1936). Iris data set. R. A. Fisher, UC Irvine Machine Learning Repository.
Jenks, G. F. (1967). "The data model concept in statistical mapping." International Yearbook of
Cartography 7: 186-190.
Lloyd, S. (1982). "Least squares quantization in PCM." IEEE Transactions on Information Theory 28(2): 9.
Mac Queen, J. (1967). Some methods for classification and analysis of multivariate observations.
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. L. M. Le
Cam and J. Neyman, University of California Press. I: 281-297.
Nash, W. J., T. L. Sellers, et al. (1994). Abalone data set. S. F. Division, UC Irvine Machine Learning
Repository.
Ostrovsky, R., Y. Rabani, et al. (2006). The effectiveness of Lloyd-type methods for the kMeans problem.
Symposium on Foundations of Computer Science.
Peña, J. M., J. A. Lozano, et al. (1999). "An empirical comparison of four initialization methods for the K-
Means algorithm." Pattern Recognition Letters 20(10): 1027-1040.
US Census Bureau. (2010). "US census 2010." from http:/ / www2.census.gov/ census_2010/ 04-
Summary_File_1/ .

Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
No ratings yet
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
10 pages
Determination of The Number of Cluster A Priori Using A K-Means Algorithm
No ratings yet
Determination of The Number of Cluster A Priori Using A K-Means Algorithm
3 pages
Novel Unsupervised K-Means Algorithm
No ratings yet
Novel Unsupervised K-Means Algorithm
17 pages
K-Means Clustering in Mathematica
No ratings yet
K-Means Clustering in Mathematica
11 pages
Comprehensive Review of K-Means Clustering Algorithms
No ratings yet
Comprehensive Review of K-Means Clustering Algorithms
5 pages
K-Means Clustering Techniques Explained
No ratings yet
K-Means Clustering Techniques Explained
10 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
Automatic Clustering With Single Optimal Solution
No ratings yet
Automatic Clustering With Single Optimal Solution
13 pages
K-Means Clustering
No ratings yet
K-Means Clustering
16 pages
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
No ratings yet
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
12 pages
Kmeans Journal
No ratings yet
Kmeans Journal
21 pages
Efficient K-Means Clustering Algorithm
No ratings yet
Efficient K-Means Clustering Algorithm
4 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
8 pages
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 861 866
No ratings yet
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 861 866
6 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Modified Clustering K-Means Algorithm For Intrusion Detection
No ratings yet
Modified Clustering K-Means Algorithm For Intrusion Detection
9 pages
A Comparative Study of Data Clustering
No ratings yet
A Comparative Study of Data Clustering
21 pages
Robust Seed Selection Algorithm For K-Means Type Algorithms
No ratings yet
Robust Seed Selection Algorithm For K-Means Type Algorithms
17 pages
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
No ratings yet
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
6 pages
Balanced K-Means Revisited-2
No ratings yet
Balanced K-Means Revisited-2
2 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
KMeansPP Soda
No ratings yet
KMeansPP Soda
9 pages
K-Means Clustering Insights
No ratings yet
K-Means Clustering Insights
8 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
K-Means Clustering Clustering Algorithms Implementation and Comparison
No ratings yet
K-Means Clustering Clustering Algorithms Implementation and Comparison
4 pages
Paper 16 - Clustering Applied To Data Structuring and Retrieval
No ratings yet
Paper 16 - Clustering Applied To Data Structuring and Retrieval
6 pages
5) - Differentiate Between K-Means and Hierarchical Clustering
No ratings yet
5) - Differentiate Between K-Means and Hierarchical Clustering
4 pages
K-Means Clustering: Distance Metrics Comparison
No ratings yet
K-Means Clustering: Distance Metrics Comparison
5 pages
K-Means Clustering Algorithm and Its Improvement R
No ratings yet
K-Means Clustering Algorithm and Its Improvement R
6 pages
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
No ratings yet
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
8 pages
Unsupervised K-Means Clustering
No ratings yet
Unsupervised K-Means Clustering
13 pages
Data Clustering Using Kernel Based
No ratings yet
Data Clustering Using Kernel Based
6 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
Research On K-Value Selection Method of K-Means Clustering Algorithm
No ratings yet
Research On K-Value Selection Method of K-Means Clustering Algorithm
10 pages
PART2
No ratings yet
PART2
61 pages
Dynamic Approach To K-Means Clustering Algorithm-2
No ratings yet
Dynamic Approach To K-Means Clustering Algorithm-2
16 pages
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
No ratings yet
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
7 pages
Clustering Algorithms Overview
No ratings yet
Clustering Algorithms Overview
11 pages
Balanced k-means Algorithm Analysis
No ratings yet
Balanced k-means Algorithm Analysis
3 pages
ComparisonofK MeansandFuzzyC MeansAlgorithmsonDifferentClusterStructures
No ratings yet
ComparisonofK MeansandFuzzyC MeansAlgorithmsonDifferentClusterStructures
11 pages
Comparison of K-Means and Fuzzy C-Means Algorithms On Different Cluster Structures
No ratings yet
Comparison of K-Means and Fuzzy C-Means Algorithms On Different Cluster Structures
11 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
8910 24120 1 PB
No ratings yet
8910 24120 1 PB
7 pages
Class21 22
No ratings yet
Class21 22
17 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
SCA - Module 8
No ratings yet
SCA - Module 8
13 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
Ca-3 QB (Pec-It602b) - 2024-1
No ratings yet
Ca-3 QB (Pec-It602b) - 2024-1
12 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
AI ML Lecture 6
No ratings yet
AI ML Lecture 6
20 pages
Na 2010
No ratings yet
Na 2010
5 pages
Clustering
No ratings yet
Clustering
10 pages
Fault Modeling and Detection Techniques
No ratings yet
Fault Modeling and Detection Techniques
40 pages
IIMT2601 24 s1 Lab Exceltopic Instructions
No ratings yet
IIMT2601 24 s1 Lab Exceltopic Instructions
3 pages
Practice Question Unit 4 - 32067705 - 2024 - 04 - 21 - 14 - 25
No ratings yet
Practice Question Unit 4 - 32067705 - 2024 - 04 - 21 - 14 - 25
22 pages
International Student Application
No ratings yet
International Student Application
6 pages
Diffirence Between ECC and S4HANA
No ratings yet
Diffirence Between ECC and S4HANA
5 pages
HTML Interview Questions
No ratings yet
HTML Interview Questions
5 pages
UPSC PRELIMS CSAT (9) - 32 Synopsis
No ratings yet
UPSC PRELIMS CSAT (9) - 32 Synopsis
49 pages
SQL Basics for Aspiring Developers
No ratings yet
SQL Basics for Aspiring Developers
8 pages
State Transition Diagram
100% (2)
State Transition Diagram
47 pages
Sap PM 1670150697375
No ratings yet
Sap PM 1670150697375
1 page
UAT Report for CTMS Platform
No ratings yet
UAT Report for CTMS Platform
6 pages
CSC 227 Internet Programing
No ratings yet
CSC 227 Internet Programing
36 pages
Passive Income Playbook - Volume 2
No ratings yet
Passive Income Playbook - Volume 2
118 pages
Vijeo Citect Installation Guide
No ratings yet
Vijeo Citect Installation Guide
64 pages
Sem-6 Project Book
No ratings yet
Sem-6 Project Book
123 pages
HP CLJ m751 m856 E75245 E85055 MFP m776 Repair Manual
100% (1)
HP CLJ m751 m856 E75245 E85055 MFP m776 Repair Manual
2,272 pages
100daysofrtl Material
No ratings yet
100daysofrtl Material
299 pages
CitectSCADA V7.30 Whats New
No ratings yet
CitectSCADA V7.30 Whats New
2 pages
Smart Lock Using Iot
No ratings yet
Smart Lock Using Iot
26 pages
Lightweight OS for IoT Devices
No ratings yet
Lightweight OS for IoT Devices
135 pages
Exam DBU F18 Solution
No ratings yet
Exam DBU F18 Solution
14 pages
STLD Unit Iii
No ratings yet
STLD Unit Iii
53 pages
Multidimensional Data Analysis in Mining
No ratings yet
Multidimensional Data Analysis in Mining
6 pages
Credit Card Fraud Detection Using Enhanced Random Forest Classifier For Imbalanced Data
No ratings yet
Credit Card Fraud Detection Using Enhanced Random Forest Classifier For Imbalanced Data
11 pages
BCA 2 Year
No ratings yet
BCA 2 Year
2 pages
WAGO Energy Data Management System
No ratings yet
WAGO Energy Data Management System
16 pages
C Think1 02
No ratings yet
C Think1 02
15 pages
Grade 7-ICT Workbook-Term 3-Answers
No ratings yet
Grade 7-ICT Workbook-Term 3-Answers
10 pages
Halliburton DST Feb 2016 Technical Seminar
No ratings yet
Halliburton DST Feb 2016 Technical Seminar
99 pages
C Programming Week12
No ratings yet
C Programming Week12
4 pages

An Initial Seed Selection Algorithm

Uploaded by

An Initial Seed Selection Algorithm

Uploaded by

Applied Soft Computing

Volume 12, Issue 11, November 2012, Pages 3698–3700

An initial seed selection algorithm for k-means clustering of georeferenced

► We use a new initial seeding method for k-means clustering.

► The method improves replicability by at least 90% compared to k-means++.

the value of cluster centers (Peña, Lozano et al. 1999).

improvement in algorithm accuracy and computational efficiency or speed (Ostrovsky, Rabani et

perceived through an objective measure.

aims to minimize the following cost function C defined in Equation 2;

a) Select centers c1,…,ck at random from the data.

clusters having the closest mean.

c) Calculate new centers c1,…,ck as means of the clusters assigned in step 2.

d) Repeats steps b and c until no change is observed in center values c1,…,ck.

centers themselves (Jenks 1967).

seed selection methodology. K-means++ selects the initial centers as follows;

a) Select one center at random from dataset.

the squared distances.

d) Repeat steps b and c until k centers are selected.

2. Materials and Method

minimum and dn has the maximum magnitude.

Di = di+1 – di; where i = 1,…, (n-1) Equation 4

values (i1,…,i(k-1)) that correspond to the k-1 highest Di values.

clusters 1,…,k would simply be defined as (i0, i1+1,…,i(k-1)+1), where i0 = 1.

will form the initial seed centers.

this way, a measure of distance is brought between consecutive cluster centers.

spreadsheet freely available for download at http://ge.tt/api/1/files/7FON8KH/0/blob?download.

data having 10,000 points was clustered into 100 clusters.

(Normalized to Data size)

k-means++ Proposed Reduction%

comparative to both, our proposed method and k-means [2].

Table 2: Algorithm Running Time (Seconds)

k-means++ Proposed Reduction%

improvement on k-means++ and k-means.

Proposed k-means++ Reduction% k-means Reduction%

4. Discussions and Conclusion

advantages over other initialization methods and naked k-means implementation;

 The results are highly replicable

expert input for getting replicable results

a measure of distance between cluster centers is introduced to augment the k-means

optimization of minimum distance between cluster members and cluster center.

You might also like