A Study On K-Means Clustering in Text Mining Using Python
A Study On K-Means Clustering in Text Mining Using Python
Available at http://www.ijcsonline.com/
Abstract
According to Statistics 195,248,950 Internet users are in India, which is the second largest internet user in the world.
The total number of websites gets increased to 672,985,183 in the year of 2013. Text Mining is an emerging research
area in nowadays as the information gets increased everyday on the web. The User did not know how the documents
were linked to the query given and displayed. Sometimes the documents are relevant and many times the documents are
irrelevant to the query typed by the user. These appropriate and inappropriate results are due to the clustering algorithm
applied to it. Getting proper results page from these websites are possible only with the process of Clustering. Clustering
is the fundamental process in many disciplines whereas Cluster Analysis is used for grouping of similar collection of
patterns based on Similarity factors. This paper discusses the tasks of Text Mining algorithms and clustering techniques.
Different types of clustering algorithm available where K-Means clustering algorithm presented in detail along with its
Strengths and Limitations in this paper. It also includes various Computation measures of algorithm which is used to
identify the similar objects to cluster. This paper gives the detailed information about the applications of Clustering and
tools used for clustering in different applications. Related works of K-means clustering algorithm in Text Mining
applications and other applications are presented with the conclusion that the K-Means algorithm can be combined with
other algorithms to get efficient results.
C. Concept Mining
I. INTRODUCTION
The task of discovering concepts which combine
Text Mining is retrieving information of different Categorization and clustering approach to find concepts
patterns from unstructured textual data in the web and their relations from text collections.
Repository. Text mining is a variation on a field called data
mining that tries to find interesting patterns from large D. Information Retrieval
databases. Text mining, also known as Intelligent Text Retrieving the information from a collection of
Analysis, Text Data Mining or Knowledge-Discovery in information resources available depending on the user's
Text (KDT), refers generally to the process of extracting query.
interesting and non-trivial information and knowledge from
unstructured text. [8]. Typically, only a small fraction of E. Information Extraction
the many available documents will be relevant to a given Task of automatically extracting structured information
individual user. Without knowing what could be in the from unstructured or Semi-Structured documents.
documents, it is difficult to formulate effective queries for
analyzing and extracting useful information from the data. III. CLUSTERING TECHNIQUES
Users need tools to compare different documents, rank the Clustering is grouping of similar data sets with the
importance and relevance of the documents, or find same content. It includes grouping of same text messages
patterns and trends across multiple documents. Thus, text in e-mail, same content from different Books. Text
mining has become an increasingly popular and essential Clustering algorithms are classified into many types,
theme in data mining. [9] namely distance-based algorithms, frequent sequence
algorithms, feature selection and extraction algorithms,
II. TASKS OF TEXT MINING ALGORITHMS [7] density-based algorithms, distance-based algorithms,
A. Text Categorization frequent sequence algorithms, feature selection and
extraction algorithms, density-based algorithms. A
Assigning the documents to pre-defined categories. clustering algorithm discovers groups in the set of
Many Statistical approaches have been applied such as documents such that documents within a group are more
Regression Models, Support Vector Machines. similar than documents across groups [2].
B. Text Clustering
Finding Group of Similar objects of data based on the
Similarity Function. Methods applied are categorized as
Hierarchical and Partitioning.
560 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python
Clustering Tasks
Document
The following conditions help to increase the Representation Definition of
effectiveness of the clustering. [1] Similarity
-------------------
Convert the Measure
A. Similarity Measure: Only Similar documents to be
documents into -------------------
considered which is hard to define.
structured form. Similarities
B. Dimension Reduction: The size of the data needs to between two
be reduced to increase the operations efficiency by documents.
removing the irrelevant words from the text collection.
C. Cluster Labels: Giving separate names to different
clusters in an appropriate way are needed to identify
the clusters in a clear way.
D. Number of Clusters: Number of clusters used to be Clustering Logic
deciding earlier, which is difficult when you have less ----------------------------------------------
information. Determining the documents is assigned to
E. Overlapping of Clusters: algorithm should accept the clusters based on similarity measure.
overlapping of clusters since several topics are used by
certain documents.
F. Scalability: Irrespective of size the algorithm should
be used. Fig 2. Key Tasks of Clustering
561 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python
Hierarchical Clustering
These methods start with one cluster and then split
into smaller and smaller clusters and then merge similar
clusters into larger and larger clusters in which objects
resulting in a tree of clusters.
Density Based clustering
For each data point in a cluster at least a minimum
number of points must exist within a given radius. Each
562 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python
563 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python
algorithm which generates mutually exclusive frequent sets of Computer Technology & Applications, Vol 3 (4), 1598-1604,
taken as initial points of k-means clustering algorithm. This ISSN: 2229-6093.
displays the highly related documents appearing together [15] L.V. Bijuraj “Clustering and its Applications”, Proceedings of
National Conference on New Horizons in IT –ISBN 978-93-82338-
with same features [13]. 79-6 .
Neetu Sharma et al uses K-means algorithm and [16] https://code.google.com/p/sofia-ml
Random Forest Classifier in WEKA tool and concluded [17] http://nlp.fi.muni.cz/projekty/gensim
that using clustering before classification on the data file [18] http://mahout.apache.org
poach.arff from WORDNET has optimized the [19] http://radimrehurek.com/gensim
performance [14]. [20] http://carrotsearch.com/lingo3g
[21] http://graphlab.org
VI. CONCLUSION [22] Toby Segaran, Programming Collective Intelligence: Building
The performance of Clustering algorithm depends Smart Web 2.0 Applications. Sebastopol, CA: O'Reilly Media.
on the structure, the amount and the representativeness of
the data. Some of the applications where Clustering is
widely used are discussed in this paper that shows the
importance of clustering in Text Mining. Many other
clustering algorithms available with some Pros and Cons
which can be combined for getting better results.
REFERENCES
[1] Francis Musembi Kwale, “A Critical Review of K - Means Text
Clustering Algorithms”, International Journal of Advanced
Research in Computer Science, Volume 4, No. 9, ISSN No. 0976-
5697.
[2] Dan Munteanu, Severin Bumbaru, “A Survey Of Text Clustering
Techniques Used For Web Mining”, The Annals Of ”Dunarea De
Jos” University Of Galati Fascicle III, ISSN 1221-454x.
[3] C. J. Van Rijsbergen , “Information Retrieval”, Butterworths,
London.
[4] Pushplata, Mr. Ram Chatterjee, “An Analytical Assessment on
Document Clustering”, I.J. Computer Network and Information
Security, 5, 63-71, DOI: 10.5815/ijcnis. 2012.05.08.
[5] Ms.S.Prabha, Dr.K.Duraiswamy, Ms.M.Sharmila “Analysis of
Different Clustering Techniques in Data and Text Mining”,
International Journal of Computer Science Engineering (IJCSE),
Vol. 3 No.02 , ISSN: 2319-7323.
[6] Mrs.S.C.Punitha, Dr. M. Punithavalli “A Comparative Study to
Find a Suitable Method for Text Document Clustering”,
International Journal of Computer Science & Information
Technology, Vol3, No.6.
[7] Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K.
Bandyopadhyay, “A Tutorial Review On Text Mining Algorithms”,
International Journal of Advanced Research in Computer and
Communication Engineering Vol. 1, Issue 4, ISSN : 2278 – 1021.
[8] Vishal Gupta , Gurpreet S. Lehal “A Survey of Text Mining
Techniques and Applications”, Journal of Emerging Technologies
in Web Intelligence, Vol. 1, No. 1.
[9] R. Sagayam, S.Srinivasan, S. Roshni “A Survey of Text Mining:
Retrieval, Extraction and Indexing Techniques”, International
Journal of Computational Engineering Research Vol. 2 Issue. 5.pp:
1443-1446.
[10] “Comparative Study of Clustering Algorithms On Textual
Databases”, Thesis submitted to Technical University Ilmenau,
Germany.
[11] O. J. Oyelade, O. O. Oladipupo, I. C. Obagbuwa, “Application Of
K-Means Clustering Algorithm For Prediction Of Students‟
Academic Performance”, (IJCSIS) International Journal of
Computer Science and Information Security, Vol. 7, Issue 1.
[12] Bader Aljaber Æ Nicola Stokes Æ James Bailey Æ Jian Pei
“Document Clustering Of Scientific Texts using Citation Contexts”,
Information Retrieval DOI 10.1007/s10791-009-9108-x, Springer
Science+Business Media, LLC .
[13] Anil Kumar Pandey, T. Jaya Laxmi, “Web Document Clustering
for Finding Expertise in Research Area”, BVICAM‟s International
Journal of Information Technology, Vol. 1 No. 2 ISSN 0973 –
5658.
[14] Neetu Sharma, Dr. S. Niranjan “Optimization Of Word Sense
Disambiguation Using Clustering In Weka”, International Journal
564 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016