General Architecture of Text Mining Systems

Text mining systems aim to extract useful information from unstructured document collections through identifying patterns. They follow a general model with four main areas: preprocessing tasks to prepare data, core mining operations to discover patterns, presentation components for browsing results, and refinement techniques to optimize discovery. Preprocessing transforms documents into a structured format while core operations detect distributions, frequent concepts, and associations. Background knowledge sources can provide additional context to guide pattern analysis.

Uploaded by

vandana_korde

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

343 views

General Architecture of Text Mining Systems

Uploaded by

vandana_korde

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Text mining

Text mining can be broadly defined as a knowledge-intensive process in which a user Interacts with a
ocument collection over time by using a suite of analysis tools. In a manner analogous to data mining, text
mining seeks to extract useful information from data sources through the identification and exploration of
interesting patterns.In the case of text mining, however, the data sources are document collections, and
interesting patterns are found not among formalized database records but in the unstructured textual data in
the documents in these collections.
Certainly, text mining derives much of its inspiration and direction from seminal research on data
mining. Therefore, it is not surprising to find that text mining and data mining systems evince many high-
level architectural similarities. For instance,both types of systems rely on pre-processing routines, pattern-
discovery algorithms, and presentation-layer elements such as visualization tools to enhance the browsing
of answer sets. Further, text mining adopts many of the specific types of patterns in its core knowledge
discovery operations that were first introduced and vetted in data mining research.
Because data mining assumes that data have already been stored in a structured format, much of its
preprocessing focus falls on two critical tasks: Scrubbing and normalizing data and creating extensive
numbers of table joins. In contrast, for text mining systems, preprocessing operations center on the
identification and extraction of representative features for natural language documents. These preprocessing
operations are responsible for transforming unstructured data stored in document collections into a more
explicitly structured intermediate format, which is a concern that is not relevant for most data mining
systems.
Moreover, because of the centrality of natural language text to its mission, text mining also draws on
advances made in other computer science disciplines concerned with the handling of natural language.
Perhaps most notably, text mining exploits techniques and methodologies from the areas of information
retrieval, information extraction, and corpus-based computational linguistics.

GENERAL ARCHITECTURE OF TEXT MINING SYSTEMS

At an abstract level, a text mining system takes in input (raw documents) and generates various types of
output (e.g., patterns, maps of connections, trends). Figure I.2 illustrates this basic paradigm.Ahuman-
entered view of knowledge discovery, however, yields a slightly more complex input–output paradigm for
text mining (seeFigure I.3). This paradigm is one in which a user is part of what might be seen as a
prolonged interactive loop of querying, browsing, and refining, resulting in answer sets that, in turn, guide
the user toward new iterative series of querying, browsing, and refining actions.

Functional Architecture
On a functional level, text mining systems follow the general model provided by some classic data mining
applications and are thus roughly divisible into four main areas: (a) preprocessing tasks, (b) core mining
operations, (c) presentation layer components and browsing functionality, and (d) refinement techniques.
_ Preprocessing Tasks include all those routines, processes, and methods required to prepare data for a text
mining system’s core knowledge discovery operations.These tasks typically center on data source
preprocessing and categorization
activities. Preprocessing tasks generally convert the information from each original data source into a
canonical format before applying various types of feature extraction methods against these documents to
create a new collection of documents fully represented by concepts. Where possible, preprocessing tasks
may also either extract or apply rules for creating document date stamps, or do both. Occasionally,
preprocessing tasks may even include specially designed methods used in the initial fetching of appropriate
“raw” data from disparate original data sources.
_ Core Mining Operations are the heart of a text mining system and include pattern discovery, trend
analysis, and incremental knowledge discovery algorithms. Among the commonly used patterns for
knowledge discovery in textual data are distributions (and proportions), frequent and near frequent concept
sets, and associations. Core mining operations can also concern themselves with comparisons between – and
the identification of levels of “interestingness” in – some of these patterns. Advanced or domain-oriented
text mining systems, or both, can also augment the quality of their various operations by leveraging
background knowledge sources. These core mining operations in a text mining system have
also been referred to, collectively, as knowledge distillation processes.
_ Presentation Layer Components includeGUIand pattern browsing functionality as well as access to the
query language. Visualization tools and user-facing query editors and optimizers also fall under this
architectural category. Presentationlayer components may include character-based or graphical tools for
creating or modifying concept clusters as well as for creating annotated profiles for specific concepts or
patterns.
_ Refinement Techniques, at their simplest, include methods that filter redundant information and cluster
closely related data but may grow, in a given text mining system, to represent a full, comprehensive suite of
suppression, ordering, pruning, generalization, and clustering approaches aimed at discovery optimization.
These techniques have also been described as postprocessing. Preprocessing tasks and core mining
operations are the two most critical areas for any text mining system and typically describe serial processes
within a generalized view of text mining system architecture, as shown in Figure I.4.
At a slightly more granular level of detail, one will often find that the processed document collection is,
itself, frequently intermediated with respect to core mining operations by some form of flat, compressed or
hierarchical representation, or both, of its data to better support various core mining operations such as
hierarchical tree browsing. This is illustrated in Figure I.5. The schematic in Figure I.5 also factors in
the typical positioning of refinement functionality. Further, it adds somewhat more

detail with respect to relative functioning of core data mining algorithms. Many text mining systems – and
certainly those operating on highly domain-specific data sources, such as medicine, financial services, high
tech, genomics, proteomics, and chemical compounds – can benefit significantly from access to special
background or domain-specific data sources. Background knowledge is often used for providing constraints
to, or auxiliary information about, concepts found in the text mining collection’s document collection.
The background knowledge for a text mining system can be created in various ways.Onecommonway is to
run parsing routines against external knowledge sources, such as formal ontologies, after which unary or
binary predicates for the conceptlabeled documents in the text mining system’s document collection are
identified. These unary and binary predicates, which describe properties of the entities represented
by each concept deriving from the expert or “gold standard” information sources, are in turn put to use by a
text mining system’s query engine. In addition, such constraints can be used in a text mining system’s front
end to allow a user to either (a) create initial queries based around these constraints or (b) refine queries
over time by adding, substracting, or concatenating constraints.Commonly, background knowledge is
preserved within a text mining system’s architecture in a persistent store accessible by various elements of
the system. This type of persistent store is sometimes loosely referred to as a system’s
knowledge base. The typical position of a knowledge base within the system architecture of a text mining
system can u see in fig 1.7 These generalized architectures are meant to be more descriptive than
prescriptive in that they represent some of the most common frameworks found in the present generation of
text mining systems. Good sense, however, should be the guide for prospective system architects of text
mining applications, and thus significant variation on the general themes that have been identified is
possible.system architects and developers could include more of the filters typically found in a text mining
system’s browser or even within subroutines contained among the system’s store of refinement techniques as
“preset” options within search algorithms included in its main discovery algorithms. Likewise, it is
conceivable that a particular text mining system’s refinement techniques or main discovery algorithms
might later find a very fruitful use for background knowledge
REFERENCES
[1] Huang Lucheng 1 “A Study on the Application of Data Mining in the Patent Information Analysis
for Company” School of Economics & Management Beijing University of Technology Beijing, China
2010 Second International Workshop on Education Technology and Computer Science
[2] “An Improved Fuzzy Clustering Method for Text Mining” Jiabin Deng1, JuanLi Hu1, Hehua Chi2,
Juebo Wu3 1 Computer Engineering Department, Zhongshan Polytechnic, Zhongshan, China -
hugodunne@yahoo.com.cn, hjlfoxes@163.com
[3] 2010 2nd International Conference on Computer Technology and Development (ICCTD 2010)
“Content Analysis based on Text Mining using Genetic Algorithm” Indrajit Mukherjee
Department of Computer Science Birla Institute of Technology Mesra, India e-mail:
[4] Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao,
11-14 July 2010” FORECASTING THE CHANGE OF INTRADAY STOCK PRICE BY USING
TEXT MINING NEWS OF STOCK” SHOU-HSIUNG CHENG Department of Information
Management, Chienkuo Technology University, Changhua 500, Taiwan E-MAIL: shcheng@ctu.edu.tw
[5] 2010 IEEE International Conference on Data Mining Workshops “Graphics Classification for
Enterprise Knowledge Management” Divna Djordjevic Accenture Technology Labs
Sophia Antipolis, France divna.djordjevic@accenture.com
[6] “HT2X[ML]: An HTML Converter” Hossein Shahsavand Baghdadi Multimedia University Faculty of
Information Technology Cyberjaya, Malaysia Email: bahamin.shahsavand@gmail.com
[7] “Hybrid Text Mining Model for Document Classification” Vidhya.K.A Dpartment of Computer
Science Pondicherry University Pondicherry, India vidhyaka@yahoo.com
[8] “Improving Classifier Performance using Data with Different Taxonomies” Tomoharu Iwata,
Toshiyuki Tanaka, Member, IEEE, Takeshi Yamada, Member, IEEE, and Naonori Ueda Member, IEEE,
[9] “Investigating Analysis of Speech Content through Text Classification” Souraya Ezzat1, Neamat El
Gayar1, and Moustafa M. Ghanem1,2 1Center for Informatics Science, Nile University. Giza, Egypt
2Department of Computing, Imperial College. London, England e-mail:
ouraya.ezzat@nileu.edu.eg,{nelgayar, mghanem}@nileuniversity.edu.eg
[10] “Job Opportunity Mining by Text Categorization “Shilin Zhang Faculty of Computer Science,
Network and Information Management Center North China University of Technology Beijing, China
[11] IEEE International Conference on E-Business Engineering “Mining Product Features from Online
Reviews” Weishu Hu, Zhiguo Gong, Jingzhi Guo Faculty of Science and TechnologyUniversity of –
MacauMacau, China {ma76523, fstzgg, jzguo}@umac.mo
[12] 10th International Conference on Information Science, Signal Processing and their Applications
(ISSPA 2010) “MINING WIKIPEDIA KNOWLEDGE TO IMPROVE DOCUMENT INDEXING
AND CLASSIFICATION” Ramesh Kumar Ayyasamy, Bashar Tahayna, Saadat Alhashmi, Siew eu-
gene, Simon Egerton School oflT Monash University
[13] “News clustering System Based on Text Mining” Ji-Rui LI Infonnation Engineer Department Henan
Vocational and Technical Institute Zhengzhou , China ljrokyes@163.com
[14] ” On The Use of Fuzzy Rules to Text Document Classification” Tatiane M. Nogueira, Solange O.
Rezende Institute of Mathematics and Computer Science University of S˜ao Paulo – Brazil
ftatiane,solangeg@icmc.usp.br
[15] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3,
MARCH 2011”Selecting Attributes for Sentiment Classification”Using Feature Relation Networks
Ahmed Abbasi, Member, IEEE, Stephen France, Member, IEEE, Zhu Zhang, and Hsinchun Chen,
Fellow, IEEE
Bibliography:

“THE TEXT MINING HANDBOOK “Advanced Approaches in Analyzing Unstructured Data

By-Ronen Feldman Bar-Ilan University, Israel
James Sanger, ABS Ventures, Waltham, Massachusetts

Assignment 3
No ratings yet
Assignment 3
4 pages
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
C Programming
100% (1)
C Programming
39 pages
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
33% (3)
Lab Manual: 18CS3262S Data Modelling and Visualization Techniques
17 pages
CR-IR346/348CL Service Manual: MT: Machine Troubleshooting
No ratings yet
CR-IR346/348CL Service Manual: MT: Machine Troubleshooting
156 pages
Hazelcast Manual PDF
No ratings yet
Hazelcast Manual PDF
798 pages
Car Parking Management System
100% (1)
Car Parking Management System
37 pages
Dbms Lab Manual II Cse II Sem
No ratings yet
Dbms Lab Manual II Cse II Sem
58 pages
2mark With Answer
No ratings yet
2mark With Answer
38 pages
Time Series Analysis
No ratings yet
Time Series Analysis
3 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Notes of Data Science Unit 3
No ratings yet
Notes of Data Science Unit 3
22 pages
Assignment 11
100% (1)
Assignment 11
4 pages
Data Mining Metrices
No ratings yet
Data Mining Metrices
6 pages
18Csc305J - Artificial Intelligence List of Lab Experiments
No ratings yet
18Csc305J - Artificial Intelligence List of Lab Experiments
1 page
AoA Important Question
100% (1)
AoA Important Question
3 pages
List of Star Pattern and Arrays Programming Exercises
No ratings yet
List of Star Pattern and Arrays Programming Exercises
26 pages
Aashima Aaashima - Be19@thapar - Edu: Quiz II - UCS415
No ratings yet
Aashima Aaashima - Be19@thapar - Edu: Quiz II - UCS415
8 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
RAID (Redundant Arrays of Independent Disks) - GeeksforGeeks
No ratings yet
RAID (Redundant Arrays of Independent Disks) - GeeksforGeeks
4 pages
Oose Unit 5
No ratings yet
Oose Unit 5
118 pages
Part - A: Database Management System Lab
No ratings yet
Part - A: Database Management System Lab
26 pages
Compiler Design Programs
No ratings yet
Compiler Design Programs
59 pages
CCS341 Data Warehousing Notes Unit I
No ratings yet
CCS341 Data Warehousing Notes Unit I
30 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Tech
No ratings yet
Tech
6 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
Anna University Data Warehousing and Data Mining November December 2011 Question Paper
No ratings yet
Anna University Data Warehousing and Data Mining November December 2011 Question Paper
3 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
TCS NQT Model Programming Coding Questions Paper
No ratings yet
TCS NQT Model Programming Coding Questions Paper
7 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
F U-4 PDF
No ratings yet
F U-4 PDF
48 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Finding Max Min
No ratings yet
Finding Max Min
20 pages
HCI Unit3
No ratings yet
HCI Unit3
13 pages
Case Study (Analysis of Algorithm
No ratings yet
Case Study (Analysis of Algorithm
14 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
C Programming Question Bank
No ratings yet
C Programming Question Bank
3 pages
Question Bank - Module 2 - Module-3 Module 4 -Module 5
No ratings yet
Question Bank - Module 2 - Module-3 Module 4 -Module 5
4 pages
GATE Compiler Design 93-2009
67% (3)
GATE Compiler Design 93-2009
12 pages
UNIT II Uid
No ratings yet
UNIT II Uid
28 pages
NN Question Bank VIISem
No ratings yet
NN Question Bank VIISem
42 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
Mc9280 Data Mining and Data Warehousing
No ratings yet
Mc9280 Data Mining and Data Warehousing
1 page
Nptel - Python For Data Science: Assignment 1 - Solution
No ratings yet
Nptel - Python For Data Science: Assignment 1 - Solution
3 pages
Ccs355 Neural Networks and Deep Learning Unit1 (1)
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1 (1)
29 pages
DS 2
No ratings yet
DS 2
70 pages
Data Warehousing & Data Mining Important Questions
No ratings yet
Data Warehousing & Data Mining Important Questions
1 page
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Concept Learning
No ratings yet
Concept Learning
62 pages
AI QB For All 5 Units - 2 Marks
No ratings yet
AI QB For All 5 Units - 2 Marks
28 pages
LP I ML Viva Questions
100% (1)
LP I ML Viva Questions
9 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
1694601295-Unit 3.6 Generalized Discriminant Analysis CU 2.0
100% (1)
1694601295-Unit 3.6 Generalized Discriminant Analysis CU 2.0
15 pages
AL3391-AI Unit IV
No ratings yet
AL3391-AI Unit IV
65 pages
Img Representation and Description (BY PVT) PDF
No ratings yet
Img Representation and Description (BY PVT) PDF
48 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
73 pages
Data Warehousing Components - L3 - L4 - L5
No ratings yet
Data Warehousing Components - L3 - L4 - L5
26 pages
CNS Bits
No ratings yet
CNS Bits
3 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Answers To Selected Exercises: Chapter 3 Data Modeling Using The Entity-Relationship Model
No ratings yet
Answers To Selected Exercises: Chapter 3 Data Modeling Using The Entity-Relationship Model
1 page
Temario Curso SAP ABAP - Core Data Services
No ratings yet
Temario Curso SAP ABAP - Core Data Services
5 pages
SQL Test Answers
No ratings yet
SQL Test Answers
7 pages
07 - Hadr
No ratings yet
07 - Hadr
42 pages
Swami Nandan: Education Experience
No ratings yet
Swami Nandan: Education Experience
1 page
Monitor and Administer Module AA
No ratings yet
Monitor and Administer Module AA
28 pages
Amazon-Backup-Cohasset-Assessment-Report-2022-11-21-FINAL
No ratings yet
Amazon-Backup-Cohasset-Assessment-Report-2022-11-21-FINAL
27 pages
Chapter 7. Association Mappings Chapter 7. Association Mappings
No ratings yet
Chapter 7. Association Mappings Chapter 7. Association Mappings
18 pages
MCS-014-J14 - Compressed
No ratings yet
MCS-014-J14 - Compressed
3 pages
(Happy) Assignment#1
No ratings yet
(Happy) Assignment#1
9 pages
Lecture-2 Systems, Roles and Development Methodologies
No ratings yet
Lecture-2 Systems, Roles and Development Methodologies
45 pages
CIS-245 Final Project Winter 2010
No ratings yet
CIS-245 Final Project Winter 2010
2 pages
[Ebooks PDF] download Java EE Development with Eclipse 1st Edition Vohra Deepak full chapters
100% (3)
[Ebooks PDF] download Java EE Development with Eclipse 1st Edition Vohra Deepak full chapters
55 pages
Hash Tables An Advanced Sorting
No ratings yet
Hash Tables An Advanced Sorting
43 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Aleck Resume
No ratings yet
Aleck Resume
5 pages
Apache Openwhisk
No ratings yet
Apache Openwhisk
41 pages
Dev Resume
No ratings yet
Dev Resume
2 pages
Combined Brochure_Web SE
No ratings yet
Combined Brochure_Web SE
16 pages
Machine Learning Semester Paper
No ratings yet
Machine Learning Semester Paper
31 pages
4nm20cs139 Web-Report
No ratings yet
4nm20cs139 Web-Report
17 pages
MA6452 S&NM 1 - by Civildatas - Com 12
No ratings yet
MA6452 S&NM 1 - by Civildatas - Com 12
50 pages
Sana Saeed ROLL # BSMS12143: "Some People Feel The Rain. Others Just Get Wet."
No ratings yet
Sana Saeed ROLL # BSMS12143: "Some People Feel The Rain. Others Just Get Wet."
24 pages
OPR201 Midterm Exam
No ratings yet
OPR201 Midterm Exam
3 pages
Arcserve Exchange - Backup - Database - Level - Backup - Fails - With - AE9725 - AW9732 - Exchange - 2007 - Exchange - 2010
No ratings yet
Arcserve Exchange - Backup - Database - Level - Backup - Fails - With - AE9725 - AW9732 - Exchange - 2007 - Exchange - 2010
3 pages
Xspoc User and Administrator Manual
No ratings yet
Xspoc User and Administrator Manual
188 pages