0% found this document useful (0 votes)
343 views

General Architecture of Text Mining Systems

Text mining systems aim to extract useful information from unstructured document collections through identifying patterns. They follow a general model with four main areas: preprocessing tasks to prepare data, core mining operations to discover patterns, presentation components for browsing results, and refinement techniques to optimize discovery. Preprocessing transforms documents into a structured format while core operations detect distributions, frequent concepts, and associations. Background knowledge sources can provide additional context to guide pattern analysis.

Uploaded by

vandana_korde
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views

General Architecture of Text Mining Systems

Text mining systems aim to extract useful information from unstructured document collections through identifying patterns. They follow a general model with four main areas: preprocessing tasks to prepare data, core mining operations to discover patterns, presentation components for browsing results, and refinement techniques to optimize discovery. Preprocessing transforms documents into a structured format while core operations detect distributions, frequent concepts, and associations. Background knowledge sources can provide additional context to guide pattern analysis.

Uploaded by

vandana_korde
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Text mining

Text mining can be broadly defined as a knowledge-intensive process in which a user Interacts with a
ocument collection over time by using a suite of analysis tools. In a manner analogous to data mining, text
mining seeks to extract useful information from data sources through the identification and exploration of
interesting patterns.In the case of text mining, however, the data sources are document collections, and
interesting patterns are found not among formalized database records but in the unstructured textual data in
the documents in these collections.
Certainly, text mining derives much of its inspiration and direction from seminal research on data
mining. Therefore, it is not surprising to find that text mining and data mining systems evince many high-
level architectural similarities. For instance,both types of systems rely on pre-processing routines, pattern-
discovery algorithms, and presentation-layer elements such as visualization tools to enhance the browsing
of answer sets. Further, text mining adopts many of the specific types of patterns in its core knowledge
discovery operations that were first introduced and vetted in data mining research.
Because data mining assumes that data have already been stored in a structured format, much of its
preprocessing focus falls on two critical tasks: Scrubbing and normalizing data and creating extensive
numbers of table joins. In contrast, for text mining systems, preprocessing operations center on the
identification and extraction of representative features for natural language documents. These preprocessing
operations are responsible for transforming unstructured data stored in document collections into a more
explicitly structured intermediate format, which is a concern that is not relevant for most data mining
systems.
Moreover, because of the centrality of natural language text to its mission, text mining also draws on
advances made in other computer science disciplines concerned with the handling of natural language.
Perhaps most notably, text mining exploits techniques and methodologies from the areas of information
retrieval, information extraction, and corpus-based computational linguistics.

GENERAL ARCHITECTURE OF TEXT MINING SYSTEMS


At an abstract level, a text mining system takes in input (raw documents) and generates various types of
output (e.g., patterns, maps of connections, trends). Figure I.2 illustrates this basic paradigm.Ahuman-
entered view of knowledge discovery, however, yields a slightly more complex input–output paradigm for
text mining (seeFigure I.3). This paradigm is one in which a user is part of what might be seen as a
prolonged interactive loop of querying, browsing, and refining, resulting in answer sets that, in turn, guide
the user toward new iterative series of querying, browsing, and refining actions.

Functional Architecture
On a functional level, text mining systems follow the general model provided by some classic data mining
applications and are thus roughly divisible into four main areas: (a) preprocessing tasks, (b) core mining
operations, (c) presentation layer components and browsing functionality, and (d) refinement techniques.
_ Preprocessing Tasks include all those routines, processes, and methods required to prepare data for a text
mining system’s core knowledge discovery operations.These tasks typically center on data source
preprocessing and categorization
activities. Preprocessing tasks generally convert the information from each original data source into a
canonical format before applying various types of feature extraction methods against these documents to
create a new collection of documents fully represented by concepts. Where possible, preprocessing tasks
may also either extract or apply rules for creating document date stamps, or do both. Occasionally,
preprocessing tasks may even include specially designed methods used in the initial fetching of appropriate
“raw” data from disparate original data sources.
_ Core Mining Operations are the heart of a text mining system and include pattern discovery, trend
analysis, and incremental knowledge discovery algorithms. Among the commonly used patterns for
knowledge discovery in textual data are distributions (and proportions), frequent and near frequent concept
sets, and associations. Core mining operations can also concern themselves with comparisons between – and
the identification of levels of “interestingness” in – some of these patterns. Advanced or domain-oriented
text mining systems, or both, can also augment the quality of their various operations by leveraging
background knowledge sources. These core mining operations in a text mining system have
also been referred to, collectively, as knowledge distillation processes.
_ Presentation Layer Components includeGUIand pattern browsing functionality as well as access to the
query language. Visualization tools and user-facing query editors and optimizers also fall under this
architectural category. Presentationlayer components may include character-based or graphical tools for
creating or modifying concept clusters as well as for creating annotated profiles for specific concepts or
patterns.
_ Refinement Techniques, at their simplest, include methods that filter redundant information and cluster
closely related data but may grow, in a given text mining system, to represent a full, comprehensive suite of
suppression, ordering, pruning, generalization, and clustering approaches aimed at discovery optimization.
These techniques have also been described as postprocessing. Preprocessing tasks and core mining
operations are the two most critical areas for any text mining system and typically describe serial processes
within a generalized view of text mining system architecture, as shown in Figure I.4.
At a slightly more granular level of detail, one will often find that the processed document collection is,
itself, frequently intermediated with respect to core mining operations by some form of flat, compressed or
hierarchical representation, or both, of its data to better support various core mining operations such as
hierarchical tree browsing. This is illustrated in Figure I.5. The schematic in Figure I.5 also factors in
the typical positioning of refinement functionality. Further, it adds somewhat more

detail with respect to relative functioning of core data mining algorithms. Many text mining systems – and
certainly those operating on highly domain-specific data sources, such as medicine, financial services, high
tech, genomics, proteomics, and chemical compounds – can benefit significantly from access to special
background or domain-specific data sources. Background knowledge is often used for providing constraints
to, or auxiliary information about, concepts found in the text mining collection’s document collection.
The background knowledge for a text mining system can be created in various ways.Onecommonway is to
run parsing routines against external knowledge sources, such as formal ontologies, after which unary or
binary predicates for the conceptlabeled documents in the text mining system’s document collection are
identified. These unary and binary predicates, which describe properties of the entities represented
by each concept deriving from the expert or “gold standard” information sources, are in turn put to use by a
text mining system’s query engine. In addition, such constraints can be used in a text mining system’s front
end to allow a user to either (a) create initial queries based around these constraints or (b) refine queries
over time by adding, substracting, or concatenating constraints.Commonly, background knowledge is
preserved within a text mining system’s architecture in a persistent store accessible by various elements of
the system. This type of persistent store is sometimes loosely referred to as a system’s
knowledge base. The typical position of a knowledge base within the system architecture of a text mining
system can u see in fig 1.7 These generalized architectures are meant to be more descriptive than
prescriptive in that they represent some of the most common frameworks found in the present generation of
text mining systems. Good sense, however, should be the guide for prospective system architects of text
mining applications, and thus significant variation on the general themes that have been identified is
possible.system architects and developers could include more of the filters typically found in a text mining
system’s browser or even within subroutines contained among the system’s store of refinement techniques as
“preset” options within search algorithms included in its main discovery algorithms. Likewise, it is
conceivable that a particular text mining system’s refinement techniques or main discovery algorithms
might later find a very fruitful use for background knowledge
REFERENCES
[1] Huang Lucheng 1 “A Study on the Application of Data Mining in the Patent Information Analysis
for Company” School of Economics & Management Beijing University of Technology Beijing, China
2010 Second International Workshop on Education Technology and Computer Science
[2] “An Improved Fuzzy Clustering Method for Text Mining” Jiabin Deng1, JuanLi Hu1, Hehua Chi2,
Juebo Wu3 1 Computer Engineering Department, Zhongshan Polytechnic, Zhongshan, China -
hugodunne@yahoo.com.cn, hjlfoxes@163.com
[3] 2010 2nd International Conference on Computer Technology and Development (ICCTD 2010)
“Content Analysis based on Text Mining using Genetic Algorithm” Indrajit Mukherjee
Department of Computer Science Birla Institute of Technology Mesra, India e-mail:
[4] Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao,
11-14 July 2010” FORECASTING THE CHANGE OF INTRADAY STOCK PRICE BY USING
TEXT MINING NEWS OF STOCK” SHOU-HSIUNG CHENG Department of Information
Management, Chienkuo Technology University, Changhua 500, Taiwan E-MAIL: shcheng@ctu.edu.tw
[5] 2010 IEEE International Conference on Data Mining Workshops “Graphics Classification for
Enterprise Knowledge Management” Divna Djordjevic Accenture Technology Labs
Sophia Antipolis, France divna.djordjevic@accenture.com
[6] “HT2X[ML]: An HTML Converter” Hossein Shahsavand Baghdadi Multimedia University Faculty of
Information Technology Cyberjaya, Malaysia Email: bahamin.shahsavand@gmail.com
[7] “Hybrid Text Mining Model for Document Classification” Vidhya.K.A Dpartment of Computer
Science Pondicherry University Pondicherry, India vidhyaka@yahoo.com
[8] “Improving Classifier Performance using Data with Different Taxonomies” Tomoharu Iwata,
Toshiyuki Tanaka, Member, IEEE, Takeshi Yamada, Member, IEEE, and Naonori Ueda Member, IEEE,
[9] “Investigating Analysis of Speech Content through Text Classification” Souraya Ezzat1, Neamat El
Gayar1, and Moustafa M. Ghanem1,2 1Center for Informatics Science, Nile University. Giza, Egypt
2Department of Computing, Imperial College. London, England e-mail:
ouraya.ezzat@nileu.edu.eg,{nelgayar, mghanem}@nileuniversity.edu.eg
[10] “Job Opportunity Mining by Text Categorization “Shilin Zhang Faculty of Computer Science,
Network and Information Management Center North China University of Technology Beijing, China
[11] IEEE International Conference on E-Business Engineering “Mining Product Features from Online
Reviews” Weishu Hu, Zhiguo Gong, Jingzhi Guo Faculty of Science and TechnologyUniversity of –
MacauMacau, China {ma76523, fstzgg, jzguo}@umac.mo
[12] 10th International Conference on Information Science, Signal Processing and their Applications
(ISSPA 2010) “MINING WIKIPEDIA KNOWLEDGE TO IMPROVE DOCUMENT INDEXING
AND CLASSIFICATION” Ramesh Kumar Ayyasamy, Bashar Tahayna, Saadat Alhashmi, Siew eu-
gene, Simon Egerton School oflT Monash University
[13] “News clustering System Based on Text Mining” Ji-Rui LI Infonnation Engineer Department Henan
Vocational and Technical Institute Zhengzhou , China ljrokyes@163.com
[14] ” On The Use of Fuzzy Rules to Text Document Classification” Tatiane M. Nogueira, Solange O.
Rezende Institute of Mathematics and Computer Science University of S˜ao Paulo – Brazil
ftatiane,solangeg@icmc.usp.br
[15] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3,
MARCH 2011”Selecting Attributes for Sentiment Classification”Using Feature Relation Networks
Ahmed Abbasi, Member, IEEE, Stephen France, Member, IEEE, Zhu Zhang, and Hsinchun Chen,
Fellow, IEEE
Bibliography:

“THE TEXT MINING HANDBOOK “Advanced Approaches in Analyzing Unstructured Data


By-Ronen Feldman Bar-Ilan University, Israel
James Sanger, ABS Ventures, Waltham, Massachusetts

You might also like