Implement A Mining Web Document Through New Data Clustering Algorithm PDF
Implement A Mining Web Document Through New Data Clustering Algorithm PDF
y
=under root (P
y
); //Vector for Euclidian distance
6. G=min
(1k)
(ceil ((
y
)));
7. r
(1.,nm/k)
cluster _reform(r
(1.k)
,G)
8. Q C //Move old centers into Q
9. C
1
mean (r
1
); C
2
mean (r
2
); C
k
mean (r
k
); kk
IJCAT International Journal of Computing and Technology, Volume 1, Issue 2, March 2014
ISSN : 2348 - 6090
www.IJCAT.org
233
} end while
18. Procedure stop
B. Proposed pseudo code for data clustering algorithm
Replace step 4 with steps 11 -17
11: //Calculate the variance formcolumns of nm
D1 Variance ([(1,.,n),1])
D2
Variance ([(1,., n),2])
Dm
Variance ([(1,.,n),m])
12: E ceil (max (D));//find the max of column
variance.
13: F Sortpro ([(1, n), E]); //Sort the column variance
14: // Partition matrix [(1,.,n),E] into k subset and store in
//vector form
G
1
(1,..,n/k)
[(1,.,n),E]
G
2
(n/k+1,.,2n/k)
[(n/k+1,.,2n/k), E]
G
1
((k-1) n/k+1,.n)
[((k-1)n/k+1,.,n), E]
15: For q=1 to k do
C
q
median (C
1
q
);
16: //find the index for the median data point in vector G
end for
17: //return to step 5 in the existing algorithm
IV. SIMULATION AND RESULT ANALYSIS
Step 1: k-means
Step 2: Proposed Clustering Algorithm
The simulation model was implemented using the Matlab
language. To first implement the proposed data
clustering algorithm, the K-means m file that comes
along the statistical toolbox was modified and the
command to run the algorithm was called with the
necessary parameters. The input to the program was the
simulated multivariate normal distributed dataset and the
iris dataset, while the output of the result was again
called by the confusion matrix m file to derive the
confusion matrix for each consecutive run. After
generating the matrix, (2) given by [11] was used to
calculate the accuracy for each run. This procedure was
repeated several times and sometimes produced irregular
results. However, the best run was chosen at the end.
The performance measures, such as accuracy, adjusted
rand index, entropy and speed, were used to show the
improvement of the proposed data clustering algorithm
over the existing algorithm. The results are reported on a
PC with the following configuration: Intel
(R) core (TM) 2 Duo, 1.83GHz, 2038MB, 160GB HD
and WLAN and Blue tooth. Where N is the number of
samples in the dataset, a
i
is the number of data samples
occurring in both cluster i and its corresponding class,
which have the maximal value.
A. Artificial datasets
These datasets were generated from a multivariate
normal distribution, whose mean vector and variance of
each variable (is assumed to be equal; and hence
covariance is zero). Also, in order to compare the
performance when some outliers are present among
objects, outliers were added to the generated datasets.
These outliers were generated from a multivariate
normal distribution.
B. Fishers Iris datasets
The Iris flower data set or Fisher's Iris data set is a
multivariate data set introduced by Sir Ronald Aylmer
Fisher in 1936 as an example of discriminate analysis.
It is sometimes called Anderson's Iris dataset, because
Edgar Anderson collected the data to quantify the
geographic variation of Iris flowers in the Gaspe
Peninsula (Wikipedia Free Encyclopedia). R.A.
Fisher's Iris dataset is often referenced in the field of
pattern recognition. It consists of 3 groups (classes) of
50 patterns each. One group corresponds to one species
of Iris flower: Iris Setosa (class C1), Iris Versicolor
(class C2), and Iris Verginica (class C3). Every pattern
has 4features (attributes), representing petal width,
petal length, sepal width, and sepal length (expressed in
centimeters).
Tables I and II show the summary of results on
accuracy of the K-means and proposed clustering
algorithm. The number of clusters was varied from 3 to
10 for a fixed number of iterations 10 and the best
results were used at the end of the iterations. The
results show that the accuracy is at its peak when the
number clusters is 3 and reduces as the number of
clusters increases. In comparing the standard results in
Table I with the simulated results in Table II, it is
shown that the accuracy of the proposed method is
higher at K=3 and also reduces as the number of
clusters increases.
The accuracy of 88.9% and 89.3% were obtained at
the same number of clusters K=3.Hence, the
proposed algorithm was able to achieve an accuracy
of 89.3%, as against 88.9% of the existing method,
thereby improving it by 1.12%.
TABLE I : SUMMARY OF RESULTS ON ACCURACY
FOR EXISTING DATA CLUSTERING ALGORITHM
NO OF
CLUSTERS 3 4 5 6 7 8 9 10
(K)
ACCURACY 0.889 0.693 0.666 0.666 0.66 0.66 0.6 0.6
IJCAT International Journal of Computing and Technology, Volume 1, Issue 2, March 2014
ISSN : 2348 - 6090
www.IJCAT.org
234
TABLE II : SUMMARY OF RESULTS ON ACCURACY FOR
PROPOSED DATA CLUSTERING ALGORITHM
NO OF
CLUSTERS 3 4 5 6 7 8 9 10
(K)
ACCURACY 0.893 0.693 0.666 0.666 0.66 0.66 0.6 0.6
C. Adjusted rand index versus number of
clusters (Iris datasets) using Euclidean
distances
Fig 2 shows the adjusted rand Index against number of
clusters when number of clusters was varied from n=1
to n=0 for both the existing and proposed method. The
adjusted rand index value varies from 0 to 1 and is best
at 1. The graph shows that at n=2 to 5. The adjusted
rand indices are equal and do not vary, but at K=6
clusters the existing algorithm decreases and shows the
characteristics of local optimum, while the proposed
algorithm is stable and decreases but at a lower rate
than the existing method. Also, at K=10 there is a small
difference in the adjusted rand index method with the
existing method. The existing method achieves an
adjusted rand index of 53%, as compared to the
proposed which achieves an adjusted rand index of
63.7%.On the average the proposed algorithm
performed better than the existing method.
Fig. 2: Adjusted Rand Index for K-means and Proposed
Clustering Algorithm using the Iris Dataset under Euclidean
Distances
D. Rand index versus number of clusters
(multivariate normal distribution datasets)
(Euclidean distances)
Fig 3 shows the rand index against a number of clusters
using the Euclidean distances when size/number of
clusters was varied from n=2 to 10.The graph shows that
in every number of cluster setting, the proposed method
was higher than the existing method except at K=6,7,8,9.
As the number of packets increased, the two schemes
increased in rand index value until a peak of 0.62929 at
K=9 for the proposed method and 0.63091 for K=9 for
the existing algorithm. The rand index of 0.53939 and
0.55192 were obtained at the same number of clusters for
existing and proposed methods at K=2. Hence, the
clustering results of proposed algorithm on clustering
multivariate normal distribution datasets, using
Euclidean distances, is of better quality than the
clustering with the existing algorithm.
Figure
Fig. 3 : Rand Index for K-means and Proposed Clustering
Algorithm using the Multivariate Normal Distribution
Dataset under Euclidean Distances
E. Time in clustering the iris dataset at fixed
number of clusters
Table III shows the time spent in clustering Iris dataset
using the proposed method and the existing method.
The simulation was done for 10 iterations and all the
times were recorded at the end of each iteration. From
table III, the existing algorithm was faster at some
points, i.e. at iteration =2, 3, 4, 5, 6, 7 and proposed
faster at iteration = 1, 2, 8, 9, 10. The average value for
K-means is 0.0451s and 0.0439s from the proposed
method, showing 2.7% decrease in speed. The range
and mean value are also tabulated in Table III. At K=3
while clustering iris dataset for a fixed number of
cluster K=3, the response time is faster for the
proposed method than for the existing method.
TABLE III: ALGORITHM COMPARISON FOR IRIS DATASET
AT K=3
TRIAL
K-MEANS
PROPOSED
NO. (SEC) METHOD(SEC)
1 0.362 0.320
2 0.011 0.020
3 0.007 0.017
4 0.005 0.013
5 0.008 0.018
6 0.006 0.01
7 0.007 0.011
8 0.009 0.007
IJCAT International Journal of Computing and Technology, Volume 1, Issue 2, March 2014
ISSN : 2348 - 6090
www.IJCAT.org
235
9 0.023 0.011
10 0.013 0.012
TABLE IV : COMPARATIVE RESULTS
ALGORITHM AVERAGE RANGE(LOW- MEAN
VALUE(SEC) HIGH)(SEC) VALUE
(SEC)
K-MEANS 0.0451 (0.005-0.365) 0.0451
PROPOSED 0.0439 (0.010-0.320) 0.0439
METHOD
4. Conclusion
The results from the performance evaluation showed that
the proposed data clustering algorithm can be
incorporated within a web based search engine to provide
better performance. The response time results show that
the time in retrieving documents will be reduced, while
the accuracy and adjusted rand index show that the users
queries will return consistent results that will meet their
search criteria as compared to using the existing web
search engines. The proposed model was able to reduce
the problem of speed while increasing accuracy to some
considerable level over the existing approach. Therefore,
it will be suitable for web search engine designers to
incorporate this model in an existing web based search
engine so that web users can retrieve their documents at a
faster rate and with higher accuracy.
References
[1] A. Jain, and M. Murty, Data Clustering: A Review.
ACM Computing Surveys, vol. 31, pp. 264-323. 1999.
[2] A. Mothd Belal, A New Algorithm for Cluster
Initialization. Proceedings of World Academy of
Science, Engineering and Technology. Vol. 4, pp. 74-
76. 2005.
[3] C.C. Hsu, and, Y.C. Chen, Mining of Fixed Data with
Application of Catalogue Marketing. Expert Systems
with Application, vol. 32, pp.12-23. 2007.
[4] C.M. Benjamin, K.W. Fung, and E. Martin,
Encyclopaedia of Data Warehousing and Mining.
Montclair State University, USA. 2006.
[5] D. Boley, M. Gini, R. Cross, E. Hong(Sam),K.
Hastings,G. Karypis, V. Kumar, B. Mobasher, and J.
Moore, Partitioning-based Clustering for Web
Document Categorization. Decision Support
Systems.Vol. 27, pp. 329-341.1999.
[6] E.Z. Oren, Clustering Web Documents: A Phrase-
Based Method for Grouping Search Engine Results.
Ph.D. Thesis, University of Washington.1999.
[7] F. Glenn, A Comprehensive Overview of Basic
Clustering Algorithms Technical Report, University
[of Winsconsin, Madison.2001.
[8] O. Zamir, O. Etzioni, Web Document Clustering
Department of Computer Science and Engineering,
University of Washington, Proceedings of the 21st
International ACM SIGIR Conference on Research
and Development in Information Retrieval. pp. 46-
54.1998.
[9] O.M. San, V.N. Huynh, and Y. Nakamori, An
Alternative Extension of the K-Means Algorithm for
Clustering, International Journal of Applied
Mathematics and Computer Science, vol. 14,
pp.241-247.2004.
[10] P. Hae-Sang, L. Jong Seok, and J. Chen Hyuck, A
K-means like algorithm for k-medoid clustering and
its performance, department of industrial and
management engineering, Iostech San 31, Hyopa-
Clong, 780-784.2006.
[11] S. Sambasivam and N. Theodosopoulos. Advanced
Data Clustering Methods of Mining Web
Documents. Issues in Informing Science and
Information Technology. Vol. 3, pp. 563-579.
[12] Y.M. Cheung, K*-means: A New Generalized k-
means Clustering Algorithm. Pattern Recognition
Letters, vol. 24, pp. 2883-2893.2003.
[13] Z. Huang, Extensions to the k-Means Algorithm for
Clustering Large Data Sets with Categorical
Values. Data Mining and Knowledge Discovery,
vol. 2, pp. 283-304.1998.