Efficient Preprocessing and Patterns Identification Approach For Text Mining

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Efficient Preprocessing and Patterns Identification Approach For Text Mining

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

Efficient Preprocessing and Patterns Identification

Approach for Text Mining

Pattan Kalesha1, M. Babu Rao 2,Ch. Kavitha3

1
(M
Tech, GEC, Gudlavalleru,)
2
3
(Professorof CSE, GEC, Gudlavalleru.)
( Professor of IT , GEC, Gudlavalleru.)

Abstract – Due to the rapid expansion of digital data , Due to the rapid increase of digital data made available
knowledge discovery and data mining have attracted recently, knowledge discovery and data mining [1] have
significant amount of attention for turning such data into attracted a large amount of attention which includes an
helpful information and knowledge. Text categorization imminent need for turning such data into useful
is continuing to become the most researched NLP suggestions and knowledge. Many applications, for
problems on account of the ever-increasing levels of instance market analysis and business, may benefit by
electronic documents and digital libraries. we present a way of the information and knowledge extracted from a
novel text categorization method that puts together the considerable amount of data. Knowledge discovery can
decision on multiple attributes. Since the most of existing be viewed as the method of nontrivial extraction of real
text mining methods adopted term-based approaches, all info from large databases, information that's implicitly
of these are affected by the difficulties of polysemy and presented among the data, previously unknown and
synonymy. Existing pattern discovery technique includes potentially ideal for users. Data mining is therefore a vital
the processes of pattern deploying and pattern evolving, help the method of knowledge discovery in databases.
to strengthen the impact of using and updating Previously decade, a major wide range of data mining
discovered patterns for looking for relevant and techniques have been presented in an effort to perform
interesting information. But the current association Rules different knowledge tasks. These techniques include
methods exist shortage in two aspects once it is used on association rule mining, frequent itemset mining,
patterns classification. a person is the strategy ignored sequential pattern mining, maximum pattern mining and
the data about word's frequency in a text . The opposite closed pattern mining.
happens to be the method need pruning rules whenever
the mass rules are generated. Within this proposed work A lot of them are proposed when considering developing
specific documents are preprocessed before placing efficient mining algorithms to locate particular patterns
patterns discovery. Preprocessing the document dataset within one reasonable and acceptable time period. By
using tokenization, stemming, and probability filtering using a good deal of patterns produced by analyzing
approaches. Proposed approach gives better decision statistics mining approaches, the best way to effectively
rules compare to existing approach. use and update these patterns is still an open research
issue.
Keywords: P a t t e r n s , Rules, Stemming, Probability.

I.INTRODUCTION I. LITERATURE SURVEY

Many data mining techniques have been proposed for
mining useful patterns in text documents. It is a Many types of text representations most certainly been
challenging issue to find accurate knowledge (or proposed during the past. A proper known is the bag of
features) in text documents to really help users to find words that utilizes keywords (terms) as elements within
what they want. In existing, Data Retrieval (IR) provided the vector of the attribute space. The issue of one's bag of
many term-based methods to solve this challenge. The words approach is how to actually select a limited
term-based methods suffer the pain of the problems of number of features among an enormous place of words or
polysemy and synonymy. Polysemy signifies a nice word terms in an effort to raise the system’s efficiency and
having different meanings, and synonymy means avoid over fitting. [1], combine unigram and bigrams
different words having the same meaning. The proposed was chosen for document indexing with regard to text
paper we work with pattern (or phrase)-based approaches categorization (TC) and evaluated on a variety of feature
which perform better in contrast studies in comparison valuation functions (FEF). A phrase-based textual
with other term-based methods. This process improves content representation for Web document management
the accuracy of evaluating support, term weights because was also proposed in [2].In [3]; data mining techniques
discovered patterns are usually more specific than whole have already been utilized for text analysis by extracting
documents [2]. cooccurring terms as descriptive phrases from document
collections. However, the overall impact of the text

ISSN: 2231-2803 http://www.ijcttjournal.org Page124

International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

mining systems using phrases as text representation III PROPOSED SYSTEM

showed no significant improvement. The likely reason
was that a phrase-based method had “lower consistency
of assignment and lower document band or frequency Select Category
range for terms” as in [4]. In, hierarchical clustering [5], or topic
[6] was used to determine synonymy and hyponymy ones
between keywords.
Extract Token Structural
Nevertheless, the challenging issue is how to effectively Documents Words ization Filtering
effectively contend with the big quantity of discovered
patterns. Regarding the challenging issue, closed
sequential designs could have been utilized for text
mining in [1], which proposed that the idea of closed Find distinct words
patterns in text mining was, useful in addition to had the
possible for improving the appearance of the
performance of message mining. Pattern taxonomy Calculate distinct
model was also developed in [4] and [3] to further words frequency
improve the effectiveness by effectively using closed
patterns in text mining. Additionally, a two-stage model
that used both term-based methods and pattern based Calculate Probability
methods made its entrance in [2] to significantly improve of distinct words
the performance of real info filtering. Natural language
processing (NLP) serves as a modern computational
technology that in fact can assist individuals to Determine Maximum
understand the meaning of message documents. For a Probability of distinct
very long time, NLP was struggling for grappling with words
uncertainties in human languages. Recently, a new
concept-based model [3], [4] was presented to bridge the
gap between NLP and text mining, which analyzed terms
on the sentence and document levels. Pattern based Calculate Topic Count
techniques was introduced in [3] to significantly improve
the performance of information filtering. Fig.2: Proposed Document Text processing
Methodology
Introduces PTM [2] consider positive documents and
negative documents and we adjust the term weights based
on term weight of positive document and negative
document. Using this technique we can increase Document clustering can loosely be defined as
maximum likelihood event one documents having more clustering of documents. Clustering is a process
overlapping terms and less content of the document to get of Process of understanding the similarity and/or dissimil
accurate results as shown below process. In this phase arity between the given objects and thus, dividing
desired pattern are evolved from the clusters obtained them into meaningful subgroups sharing common
from above phases. This is important phase of the model characteristic. Good clustersarethose in which the membe
which actually evolves patterns which will match to the rs inside the cluster have quite a deal of similar
keywords of user who want relevant information from characteristics.
large database which are generally in electronic forms
[7]. Tokenization is the task of chopping it up into pieces,
called tokens, perhaps at the same time throwing away
certain characters, such as punctuation.
Data pre-processing
Input: Friends, Romans, Countrymen, lend me your ears;

Phrase Based Analysis Output: Friends Romans Countrymen Lend

me Your Ears
Similarity Based Analysis

Pattern Evolving and Mining

Fig.1:Document Text processing Analysis

ISSN: 2231-2803 http://www.ijcttjournal.org Page125
International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

A. Data preprocessing: 4. For each Wi in D

5. For each K of T
In the data preprocessing stage there are many steps used 6. Pi=∑Wi/Wj//Probability calculation of
to prepare the documents to the next stage. Preprocessing distinct words
consists of steps that take its input as a plain text 7. count=count+1//Based on the topic
document and its output as a set of tokens (which can be count will be calculated
single-terms or n-grams) to be included in the vector 8. end
model. These steps typically can be stated as following: 9. end
10. return Wi;
1. Punctuations, special characters and the numbers are 11. return count;
filtered from the document.
2. Partition each document into phrases and then tokenize D. Sample pseudo Code:
each phrase into its constituting words.
3. Remove the stopwords which were detected in the Data preprocessing represents the first step. At this stage
stopword list provided by us. cleaning techniques can be applied such as stop words
4. Get POS (Part Of Speech) tagger for each remaining removal, streaming or term pruning according to the
word and eliminate each word which is not verb or noun. TF/IDF Values (term frequency/inverse document
5. Remove each word with low frequency or too much frequency). The next step in building the associative
occurring words. classifier is the generation of association rules using an
apriori - based algorithm. Once the entire set of rules has
been generated an important step is to apply some
B. Document Representation pruning techniques for reducing the set of association
rules found in the text corpora. The last stage in this
In this stage each document will be represented in the process is represented by the use of association rules set
form as given in Fig.2; that by detecting each new phrase in the prediction of classes for new documents. The first
and assigning an id for each cleaned unrepeated phrase three steps belong to the training process while the last
(neither when it contains the same words or carries the one represents the testing phase. If a document D is
same meaning). The main challenge is to detect the assigned to a set of categories C={Ct,C2, c and
phrases which do convey the same meaning. This is done afterwards pruning the set of term T = { t},t2, t is
by constructing a feature vector for each new phrase then retained , the following transaction is used to model the
a similarity measure is recursively calculated between document D :{CI,C2 Cm,t},t2, t and the association rules
each new phrase and the phrases that already have been are discovered from such transactions representing all
added to the feature vector and when the similarity documents in the collection.
exceeds a threshold (assumed value); then one of them The association rules discovered in this stage of the
will be discarded. process is further processed to build the associative
classifier. Using the apriori algorithm on our transactions
After obtaining the term weights of all topic phrases, it is representing the documents would generate a very large
easy to apply the cosine similarity to compute the number of association rules, most of them irrelevant for
similarity of any two documents. Let vectors dx= {x I classification. There are two approaches that we have
,x2, ... , xM}, and dy={yl,y2, ... , yM} denote two considered in building an associative text classifier. The
documents dx and dy, where xi and yi are the weights of first one ARPAC (Association Rule-based Patterns with
corresponding topic phrase term. Then the similarity of All Categorized) is to extract association rules from the
two documents in (1) is calculated by the following entire training set following the constraints discussed
formula [6-8]. above. As a result we propose a second solution ARP-BC
(Associative Rule-based Pattern by Category) that solves
existing problems. In this approach we consider each set
of documents belonging to one category as a separate text
collection to generate association rules from. If a
document belongs to more than one category this
document will be present in each set associated with the
C. Probability Calculation: categories that the document falls into. Algorithm ARP-
BC Find association rules on the training set of the text
Input : Set S of N document and a set T of K topics collection when the text corpora are divided in subsets by
Output: Array of distinct words and array of distinct category.
words count based on K
Method:
1. Read S,N;//Read the training documents one by
one to split all words
2. Read T,K;//Read the topics
3. Preprocess D to get Wi //Preprocess to identify
distinct words

ISSN: 2231-2803 http://www.ijcttjournal.org Page126

International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

Algorithm Pruning the set of association rules

Input: The set of association rules that were found in the
association rule mining phase(s) and the training text
collection (D).
Output: A set of rules used in the classification process
Method:
1. Sort the rules according to Definition 1
2. for each rule in the set S
3. Find all those rules that are more specific according
to (Definition 2)
4. prune those that have lower confidence
5. a new set of rules S is generated
6. for each rule R in the set S
7. go over D and find those transactions that are
covered by the rule R
8. if R classifies correctly at least one transaction
9. select R
10. Remove those cases that are covered by R.

IV RESULTS

A. Existing approach:
In this section existing results are totally based on single
attribute based decision system.

BEST RULES USING PTM(IPE):

1. Text=FED SETS 1.5 BILLION DLR CUSTOMER

REPURCHASE, FED SAYS
2 ==> class-att=0 2::::: Weight(min sup):(1)
In ARP-BC algorithm step (2) generates the frequent
2. Text=OPEC WITHIN OUTPUT CEILING,
itemset. In steps (3-13) all the k-frequent item sets are
SUBROTO SAYS Opec remains within its agreed output
generated and merged with the category in Ci. Steps (16-
Ceiling of 15.8 mln barrels a day, and had expected
18) generate the association rules. The document space is
current fluctuations in the spot market of one or two dlrs,
reduced in each iteration by eliminating the transactions
Indonesian Energy Minister Subroto said.
that do not contain any of the frequent item sets. This
He told reporters after meeting with President Suharto
step is done by Filter Table (Di-I, Fi-1) function. This
that present weakness in the spot oil market was the result
problem leads us to the next subsection where pruning
of warmer weather in the U.S. And Europe which
methods are presented.
reduced demand for oil. Prices had also been forced
Although the rules are similar to those produced using a
down because refineries were using up old stock, he said.
rule based induced system, the approach is different. In
But Opec would get through this period if members
addition, the number of words belonging to the
stuck together. REUTER
antecedent, while in some studies with rule-based
 2 ==> class-att=0 2::::: Weight(min sup):(1)
induced systems, the rules generated has only one or a
pair of words as antecedent.
REUTER
 1 ==> class-att=0 1::::: Weight(min sup):(1)
Def 2: Being given two rules Rl and R2 .Rl is higher
ranked than R2 if:
=== Evaluation ===
(1) Rl has higher confidence than R2
(2) If the confidences are equal, supp (Rl) must exceed
Elapsed time: 4.757s
supp
(R2)
(3) Both confidences and support are equal, but Rl has
B. Proposed Approach:
less attributes in left hand side than R2
With the sit of association rules sorted, the goal is to
Proposed Approach found 1514 rules (displaying top 25
select a subset that will build an efficient and effective
Rules)
classifier. In our approach we attempt to select a high
quality subset of rules by selecting those rules that are
1. [reuter=1]: 1441 ----(IMPLIES)------>[&#3=1]: 1441
general and have high confidence.
<RULE CONFIDENCE:(1)> Weighted Accuracy:(99.22)
2. [&lt=1]: 904 ----(IMPLIES)------>[&#3=1]: 904

ISSN: 2231-2803 http://www.ijcttjournal.org Page127

International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

<RULE CONFIDENCE:(1)> Weighted Accuracy:(62.24) Accuracy:(54.33)

3. [reuter=1, of=1]: 1077 ----(IMPLIES)------>[&#3=1]: 25. [and=1, a=1]: 743 ----(IMPLIES)------>[&#3=1]: 743
1077 <RULE CONFIDENCE:(1)> Weighted <RULE CONFIDENCE:(1)> Weighted Accuracy:(51.16)
Accuracy:(74.16)
4. [reuter=1, th=1]: 960 ----(IMPLIES)------>[&#3=1]: C. Performance Analysis:
960 <RULE CONFIDENCE:(1)> Weighted
Accuracy:(66.1)
5. [reuter=1, to=1]: 931 ----(IMPLIES)------>[&#3=1]: 1200
931 <RULE CONFIDENCE:(1)> Weighted 1000
Accuracy:(64.1) 800
Proposed
6. [reuter=1, and=1]: 944 ----(IMPLIES)------>[&#3=1]: 600
Exist ing
944 <RULE CONFIDENCE:(1)> Weighted 400

Accuracy:(65) 200

7. [reuter=1, in=1]: 915 ----(IMPLIES)------>[&#3=1]: 0

NumberOfRules Time(ms)
915 <RULE CONFIDENCE:(1)> Weighted Proposed 1000 273
Accuracy:(63) Exist ing 467 442
8. [reuter=1, said=1]: 934 ----(IMPLIES)------>[&#3=1]:
934 <RULE CONFIDENCE:(1)> Weighted Graph shows that Number of rules generated in the
Accuracy:(64.31) proposed approach vs. time taken to generate rules.
9. [reuter=1, &lt=1]: 900 ----(IMPLIES)------>[&#3=1]:
900 <RULE CONFIDENCE:(1)> Weighted V. CONCLUSION AND FUTURE SCOPE
Accuracy:(61.97)
10. [reuter=1, a=1]: 866 ----(IMPLIES)------>[&#3=1]: Many data mining techniques have been proposed in the
866 <RULE CONFIDENCE:(1)> Weighted
last decade. These techniques include sequential pattern
Accuracy:(59.63)
mining, maximum pattern mining, association rule
11. [reuter=1, it=1]: 775 ----(IMPLIES)------>[&#3=1]:
mining, frequent itemset mining, , and closed pattern
775 <RULE CONFIDENCE:(1)> Weighted
mining[1]. However, using these discovered knowledge
Accuracy:(53.36)
(or patterns) in the field of text mining is difficult and
12. [reuter=1, for=1]: 744 ----(IMPLIES)------>[&#3=1]:
ineffective. The reason is that some useful long patterns
744 <RULE CONFIDENCE:(1)> Weighted
with high specificity lack in support (i.e., the low-
Accuracy:(51.23)
frequency problem). In order to rectify the problems in
13. [reuter=1, mln=1]: 702 ----(IMPLIES)------>[&#3=1]:
existing approaches, Proposed gives robust pattern
702 <RULE CONFIDENCE:(1)> Weighted
discovery with decision making rules. Proposed
Accuracy:(48.34)
preprocessing framework gives better execution time
14. [of=1, th=1]: 906 ----(IMPLIES)------>[&#3=1]: 906
compare to existing approaches. In future this work is
<RULE CONFIDENCE:(1)> Weighted Accuracy:(62.38)
extended to ontology based framework to give more
15. [of=1, and=1]: 873 ----(IMPLIES)------>[&#3=1]:
accurate results in web intelligence.
873 <RULE CONFIDENCE:(1)> Weighted
Accuracy:(60.11)
16. [of=1, a=1]: 823 ----(IMPLIES)------>[&#3=1]: 823
<RULE CONFIDENCE:(1)> Weighted Accuracy:(56.67)
17. [th=1, to=1]: 873 ----(IMPLIES)------>[&#3=1]: 873 REFERENCES
<RULE CONFIDENCE:(1)> Weighted Accuracy:(60.11)
18. [th=1, and=1]: 820 ----(IMPLIES)------>[&#3=1]: [1] Effective Pattern Discovery for Text Mining
820 <RULE CONFIDENCE:(1)> Weighted Ning Zhong, IEEE TRANSACTIONS ON
Accuracy:(56.46) KNOWLEDGE AND DATA ENGINEERING,
19. [th=1, said=1]: 900 ----(IMPLIES)------>[&#3=1]: VOL. 24, NO. 1,
900 <RULE CONFIDENCE:(1)> Weighted [2] Hybrid Approach to Improve Pattern Discovery in
Accuracy:(61.97) Text mining Charushila Kadu, International Journal
20. [th=1, a=1]: 813 ----(IMPLIES)------>[&#3=1]: 813 of Advanced Research in Computer and
<RULE CONFIDENCE:(1)> Weighted Accuracy:(55.98) Communication Engineering
21. [th=1, it=1]: 747 ----(IMPLIES)------>[&#3=1]: 747 [3] R. Agrawal and R. Srikant, “Fast Algorithms for
<RULE CONFIDENCE:(1)> Weighted Accuracy:(51.43) Mining Association Rules in Large Databases,”
22. [to=1, said=1]: 855 ----(IMPLIES)------>[&#3=1]: Proc. 20th Int’l Conf. Very Large Data Bases
855 <RULE CONFIDENCE:(1)> Weighted (VLDB ’94), pp. 478-499, 1994.
Accuracy:(58.87) [4] H. Ahonen, O. Heinonen, M. Klemettinen, and A.I.
23. [and=1, in=1]: 776 ----(IMPLIES)------>[&#3=1]: Verkamo, “Applying Data Mining Techniques for
776 <RULE CONFIDENCE:(1)> Weighted Descriptive Phrase Extraction in Digital Document
Accuracy:(53.43) Collections,” Proc. IEEE Int’l Forum on Research
24. [and=1, said=1]: 789 ----(IMPLIES)------>[&#3=1]: and Technology Advances in Digital Libraries
789 <RULE CONFIDENCE:(1)> Weighted (ADL ’98), pp. 2-11, 1998.

ISSN: 2231-2803 http://www.ijcttjournal.org Page128

International Journal of Computer Trends and Technology (IJCTT) – volume 6 number 2– Dec 2013

[5] R. Baeza-Yates and B. Ribeiro-Neto,Modern

Information Retrieval. Addison Wesley, 1999.
[6] T. Chau and A. K. C.Wong, “Pattern discovery by
residual analysis and recursive partitioning,” IEEE
Trans. Knowledge Data Eng., vol. 11, pp.833–852,
Nov./Dec. 1999.
[7] Nitin Jindal, Bing Liu, Ee-Peng Lim, “Finding
Unusual Review Patterns Using Unexpected
Rules”.