0% found this document useful (0 votes)

27 views38 pages

08-Text Mining

The document provides an overview of text mining, explaining its definition, goals, and processes involved in analyzing unstructured and semi-structured textual data. It contrasts text mining with data mining, highlights various applications, and outlines the challenges faced in the field. Additionally, it details the steps in the text mining process, including feature generation, selection, and various operations like classification and sentiment analysis.

Uploaded by

alprn13aydn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views38 pages

08-Text Mining

Uploaded by

alprn13aydn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Text Mining

CSE4416 – Introduction to Data Mining

Asst. Prof. Dr. Göksu Tüysüzoğlu

Outline

◘ What is Text Mining?

◘ Goals of Text Mining
◘ History
◘ Text Mining vs Data Mining
◘ Text Mining Process (6 Steps)
◘ Text Mining Applications
◘ Text Mining Operations
Introduction

◘ Information explosive
◘ 80% information stored in text
Documents, Journals, Reports, Web Pages, E-mails...
◘ Difficult to extract special information
◘ It is necessary to automatically analyze, organize, summarize...

100
90
80
70
60
50 Unstructured
40 Structured

30
20
10
0
Data volume Market Cap
Text Resources

• E-books 
• Scientific Articles
• News Articles
• Technical Documents Internet
• Web Pages
• Email
• Contracts
• Patent Portfolios
• Customer Complaint Letters
• Transcripts of Phone Calls with Customers
• ...
What is Text Mining?

◘ Text Mining is the process of analyzing the relations, the patterns, and
the rules among textual data (semi-structured or unstructured text)
◘ Text Mining is the discovery by computer of new, previously unknown
information, by automatically extracting information from different
written resources.

ge
wl ed
o
Kn
Text Mining Research Communities

◘ Text mining research – integrate research from several research

communities such as:
– Information Retrieval (IR)
– Information Extraction (IE)
– Natural Language Processing
– ...
“Search” versus “Discover”

Search Discover
(goal-oriented) (opportunistic)

Structured Data Data

Data Retrieval Mining

Unstructured Information Text

Data (Text) Retrieval Mining
Data Mining vs. Text Mining

Data Mining Text Mining

Data Object Numerical & categorical data Textual data

Data structure Structured Unstructured & Semi-

Structured

Data representation Straightforward Complex

Space dimension < tens of thousands > tens of thousands

Methods Data analysis, machine learning, Information retrieval, NLP,...
statistic, neural networks
Maturity Broad implementation since1994 Broad implementation starting
from 2000
Market Large and mid size companies Corporate workers and
individual users
Text Mining Examples

◘ Information that not even the writer knows.

– e.g., Discovering a new method for a hair growth that is described as a
side effect for a different procedure

◘ Rediscover the information that the author encoded in the text

– e.g., Automatically extracting a product’s name from a web-page.
Goals of Text Mining

◘ Goals:
– Minimize the amount of text we must read.
– Isolate pieces of information, such as the name of person and
the person’s role in the company.
– Discover relationships between terms
– Discover valid, useful, and previously unknown knowledge
Semi-Structured Data

◘ Text databases are, in general, semi-structured

Example:

• Title
• Author
• Publication_Date Structured attribute/value pairs
• Length
• Category
• Abstract
Unstructured
• Content
Unstructured Text

◘ Morphology - Structure of words

– Prefix - word stem - suffix, inflected word forms (e.g. manag- e/es/ing/ed)

◘ Syntax - Combination of word to form phrases

– Noun and verb groups, subject – predicate – object

◘ Semantics - Meaning of words and phrases

– Synonyms: Different words, but same meaning
True, right, correct
– Homonyms: Same word, but different meanings
Fall - to drop down
Fall - the season between summer and winter
Challenges of Text Mining

◘ Large textual data base

– over 50,000,000,000 web pages
◘ High dimensionality
– Consider each word/phrase as a dimension
◘ Ambiguity
– Word ambiguity
• Pronouns (he, she …)
"John told Tom that he should go to the store."
• “buy”, “purchase”
– Semantic ambiguity
• The king saw the rabbit with his glasses.
◘ Noisy data
• Example: Spelling mistakes
◘ Not well structured text
– Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”
Text Mining Process (6 Steps)
1. Collect relevant documents
– Identify source and documents
– Retrieve them from Web or from internal file systems
2. Pre-processing documents
– Syntactic/Semantic text analysis
3. Features Generation
– Text document is represented by the words it contains (e.g., “Lord of the rings”  {“the”,
“Lord”, “rings”, “of”})
– Identifies a word by its root (e.g., flying, flew  fly)
– Stop words: The most common words are unlikely to help text mining
e.g., “the”, “a”, “an”, “you” … ([Link]
4. Features Selection
– Reduce dimensionality - Irrelevant features
e.g., the existence of a noun in a news
article is unlikely to help classify it as “politics” or “sport”
5. Text mining operations
– Patterns and relationships are discovered.
e.g., classification, clustering, association rule mining

6. Evaluation of text mining results

Feature Generation Example
Feature Generation Example
Hi,
Here is your weekly update (that unfortunately hasn't gone out in about a month). Not
much action here right now.
1) Due to the unwavering insistence of a member of the group, the
[Link] package is now completely independent of the d2k
application.
2) Transformations are now handled differently in Tables. Previously, transformations were
done using a TransformationModule. That module could then be added to a list that an
ExampleTable kept. Now, there is an interface called Transformation and a sub-interface
called ReversibleTransformation.

hi, weekly update (that unfortunately gone out month). much action here right now. 1) due
unwavering insistence member group, [Link] package now
completely independent d2k application. 2) transformations now handled differently tables.
previously, transformations done using transformationmodule. module added list
exampletable kept. now, interface called transformation sub-interface called
reversibletransformation.

hi week update unfortunate go out month much action here right now 1 due unwaver
insistence member group ncsa d2k modules core datatype package now complete
independence d2k application 2 transformation now handle different table previous
transformation do use transformationmodule module add list exampletable keep now
interface call transformation sub-interface call reversibletransformation
Feature Selection Example
hi week update unfortunate go out month much action here right now 1 due unwaver
insistence member group ncsa d2k modules core datatype package now complete
independence d2k application 2 transformation now handle different table previous
transformation do use transformationmodule module add list exampletable keep now
interface call transformation sub-interface call reversibletransformation
hi core hi core
week datatype week datatype
update package update package
unfortunate complete unfortunate complete
go independence go independence
out application out application
month 2 month transformation
much transformation much handle
action handle action different
here different here table
right table right previous
now previous now use
1 use due add
due transformationmodule insistence list
unwaver add member keep
insistence list group interface
member exampletable modules call
group keep do sub-interface
ncsa interface
d2k call
modules sub-interface
do reversibletransformation
BoW (Bag of Words)
Document-Term Matrix

◘ Doc 1: I love dogs.

◘ Doc 2: I hate dogs and knitting.
◘ Doc 3: Knitting is my hobby and my passion.

Document-Term Matrix
TF-IDF

signature words
TF-IDF Example

◘ Consider a document containing 100 words wherein the word

«dog» appears 3 times.

TF(dog) = 3 / 100 = 0.03.

◘ Assume we have 10 million documents and the word

«dog» appears in one thousand of these.

IDF(dog) = log(10,000,000 / 1,000) = 4.

◘ Thus, the TF-IDF weight for the word «dog» is

TF-IDF(dog) = TF(dog) × IDF(dog) = 0.03 * 4 = 0.12.

Text Mining Applications

◘ Email: Spam filtering

◘ Marketing: Discover distinct groups of
potential buyers according to a user text
based profile
– e.g. amazon
◘ Industry: Identifying groups of
competitors web pages
– e.g., competing products and their prices
◘ Job seeking: Identify parameters in
searching for jobs
– e.g., [Link]
◘ Biomedical applications: Gene cluster
identification, Protein interactions, Gene-
disease associations, Protein-disease
associations
Example – Job Information Extraction

[Link]-Job2

JobTitle: Ice Cream Guru

Employer: [Link]
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: [Link]/jobs_midwest.html
OtherCompanyJobs: [Link]-Job1
Text Mining Operations

Clustering groups of automatically retrieved text into a list

1. Text clustering of meaningful categories

2. Text classification Cataloguing texts into categories

Creating a shortened version of a text containing the

3. Document summarization most important elements

4. Associations Occurrence of A together with B

5. Question answering Giving the user a (short) answer to their question

Identifying and extracting subjective information in source

6. Sentiment analysis materials (e.g., emotion, beliefs)
Text Mining Operations
1-Clustering
◘ Given:
– Set of documents and a similarity measure among documents

◘ Similarity measure:
– Problem-specific measures
e.g., how many words are common in these documents

◘ Find: Documents
– Clustering of related documents source
• Documents in one cluster are more similar to one another
• Documents in separate clusters are less similar to one another

Similarity
◘ Methods: Clustering
measure
– Partitioning Methods System
– Hierarchical Methods

Doc
Doc Doc
Doc Doc
Doc Doc
Doc
Doc
Doc
Text Mining Operations
2-Classification
BOISE, Idaho (CNN) -- Cooler weather Monday in the U.S. Pacific Northwest may help more than
27,000 firefighters in their marathon battle against dozens of wildfires. Ten new large fires were reported
overnight into Monday, bringing to 40 the number of large fires ablaze, said officials at the National
Interagency Fire Center. They said between 300,000 and 400,000 acres are aflame. The good news is
that five large fires were contained Sunday. So far, about 2.8 million acres of forest have burned this
year, making 2001 an average year for fires. Center officials said 241 fires were reported Sunday into
Monday but 96 percent were contained or extinguished in "initial attacks" by firefighters. A large fire is
defined as a fire burning uncontained and extending over 100 acres or more.

• Politics
• Economic
• Social
• Enviromental
• Cultural

e.g., classifying “football” document as a “basketball” document is

not as bad as classifying it as “crime”.
Text Mining Operations
2-Classification
◘ Use training set to build the model

◘ Use testing set to test the model

◘ Classification Techniques
– Decision trees
– Neural networks
– Bayesian classification
Text Mining Operations
2-Classification – Decision Tree
Categorization of Documents

Model Construction Model Usage

New Documents:
Training Archive: 1. Training
a decision tree
2. Applying the
decision tree
Test Archive:

Classified
Documents:

Example: Annotation of the training Example: Tagging new documents

archive with n types of events with one of n known events
Text Mining Operations
2-Classification – Decision Tree

xt ass
te cl
Ex#
Hooligan

An English football fan

1 Yes
… Hooligan
During a game in Italy
2 Yes
…
England has been A Danish football fan ?
3 Yes
beating France … Turkey is playing vs. France.
?
Italian football fans were The Turkish fans …
4 No 10

cheering …
An average USA
5 No
salesman earns 75K
The game in London
6 Yes
was horrific Test
7
Manchester city is likely
Yes Set
to win the championship
Rome is taking the lead
8 Yes
in the football league
Learn
10

Training
Set Classifier Model
Text Mining Operations
2-Classification – Neural Network

Threshold Hooligan

Weights vector 2.5 -0.9 1.2

Hidden nodes

Weights vector 0.9 3 -0.4 -0.8 0 0.7 1.2 -0.2 1 -2

Input
vector as ball cool pain Spain foot break FC soccerfootball

◘ Multi-layer
Text Mining Operations
2-Classification – Bayesian Classification
◘ The classification problem may be formalized using probabilities:
P(C|X) = prob. that the example is of class C
e.g. P(Hooligan | English, fan, married…)

◘ Idea: assign to example X the class label C such that P(C|X) is maximal
◘ Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
Text Mining Operations
3-Associations
◘ Example:

In a given a food-industry corpus:

“98% of the documents which are interested on apple juice does it

related with the chromatography analytic technique”

X  Y : “apple juice  chromatography”

Text Mining Operations
4-Summarizing
◘ High-level summary or survey of all main points

◘ Example: sentence extraction from a single document

◘ Method (How to summarize a collection?)

– Start with training set, allows evaluation
– Create heuristics to identify important sentences
– Classification function estimates the probability a given sentence is
included in the abstract
Text Mining Operations
5-Question Answering
◘ Question Answering is a computer science discipline within the
fields of information retrieval and natural language processing.
◘ QA system generates valid answers to questions asked by an user.

◘ Examples
– Question: Where is the Louvre Museum located?
Answer: The Louvre Museum is located in Paris
– Question: Who was the prime minister of Australia during the Great
Depression?
Answer: James Scullin (Labor) 1929–31

◘ Method:
– Heuristics about question type: who, when, where
– Match up noun phrases within and across documents
Text Mining Operations
6-Sentiment analysis
◘ Computational study of opinions, sentiments and emotions in text
◘ Task: Classifying the expressed opinion of a text (positive, negative,
neutral)
◘ Sentiment analysis uses NLP, text analysis, and computational
techniques to automate the extraction or classification of sentiment
from text
◘ Different levels of Sentiment Analysis:
– Document based: One score for the whole document (e.g. review)
– Sentence based: Does sentence express positive, neutral, negative opinion
– Aspect based: Focus on a specific aspect and identify what people like/dislike
◘ It becomes a hot area in decision-making
– 97% of customer’s read online reviews for local business in 2017 (Local Consumer
Review Survey 2017)
– 85% of consumers trust online reviews as much as personal recommendations
(Local Consumer Review Survey 2017)
Text Mining Operations
6-Sentiment analysis
Keras IMDB Movie Review Dataset [1]
◘ 50.000 reviews (25.000 for training and 25.000 for testing with each 12.500
reviews marked as positive or negative)
◘ It is a binary (0 = negative or 1 = positive) classification problem.
– The negative reviews have a score from 4 out of 10,
– The positive reviews have a score from 7 out of 10.
– Thus, neutral rated reviews are not included in the train/test sets.

◘ Convolutional neural network (CNN) was

used for sentiment classification task.
◘ Input lenght of the input sequences was
set to 500 (truncate longer reviews or pad
shorter reviews with zeros at the end)
◘ Total of 2813 wrong, 22187 true
classifications (Accuracy: 88,75%)
◘ [1] Maas et al. (2011): Learning Word Vectors for Sentiment Analysis
Language Models

◘ A language model uses machine learning to conduct a probability

distribution over words used to predict the most likely next word in a
sentence based on the previous entry.
◘ Language models learn from text and can be used for
– producing original text,
– predicting the next word in a text,
– part-of-speech (POS) tagging,
– text summarization,
– machine translation,
– question answering,
– speech recognition,
– optical character recognition and
– handwriting recognition.
Language Models

◘ PROBABILISTIC LANGUAGE MODEL

– N-grams

◘ NEURAL NETWORK-BASED LANGUAGE MODELS

– RECURRENT NEURAL NETWORKS (RNN)
Short-term memory (LSTM) or a gated recurrent unit (GRU)
– TRANSFORMERS
OpenAI GPT, Google’s BERT
TRANSFORMER MODELS

◘ BERT uses encoders only, GPT uses

decoders only.
◘ Both options understand language
including syntax and semantics.
• BERT: classification, questions
and answers, summarization,
named entity recognition
• GPT: translation, text generation
◘ The outputs of the core models are
different:
• BERT (encoder): Embeddings
representing words with attention
information in a certain context
• GPT (decoder): Next words with
probabilities

Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Mining for Business Experts
No ratings yet
Data Mining for Business Experts
41 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
Effective Text Classification Techniques
No ratings yet
Effective Text Classification Techniques
6 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
76 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Text Mining
No ratings yet
Text Mining
16 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
Data Mining & BI Course Guide
No ratings yet
Data Mining & BI Course Guide
25 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
l4 TP Slides Text Processing
No ratings yet
l4 TP Slides Text Processing
230 pages
Fuzzy Based Content Extraction From Multiple Text Documents: Presented By, C.Gayathri
No ratings yet
Fuzzy Based Content Extraction From Multiple Text Documents: Presented By, C.Gayathri
24 pages
Text Mining: Methods, Advantages, Challenges
No ratings yet
Text Mining: Methods, Advantages, Challenges
12 pages
Unit III
No ratings yet
Unit III
101 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
36 pages
Business Intelligence & Text Mining Guide
No ratings yet
Business Intelligence & Text Mining Guide
122 pages
Text Mining in Data Mining Guide
No ratings yet
Text Mining in Data Mining Guide
18 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
Data Mining for Beginners
No ratings yet
Data Mining for Beginners
6 pages
EBM
No ratings yet
EBM
16 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Data Mining Unit4
No ratings yet
Data Mining Unit4
16 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Text Mining
No ratings yet
Text Mining
41 pages
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Text Mining Preprocessing Techniques Overview
No ratings yet
Text Mining Preprocessing Techniques Overview
11 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Unit I - Chapter 1 - Data Mining
No ratings yet
Unit I - Chapter 1 - Data Mining
77 pages
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 5 (Presentation Kind of N
38 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
3 DM
No ratings yet
3 DM
36 pages
Emerging Concepts & Trends in Business Analytics
No ratings yet
Emerging Concepts & Trends in Business Analytics
15 pages
Data Mining vs. Information Retrieval
No ratings yet
Data Mining vs. Information Retrieval
63 pages
BI Ch02
No ratings yet
BI Ch02
29 pages
Preliminaries - Text Analytics
No ratings yet
Preliminaries - Text Analytics
40 pages
Complex Data Mining
No ratings yet
Complex Data Mining
5 pages
Case Study On Text Mining
100% (1)
Case Study On Text Mining
8 pages
Data Mining Note Sixth Semester ..
No ratings yet
Data Mining Note Sixth Semester ..
79 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Data Management
No ratings yet
Data Management
36 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Data Mining Applications and Trends
No ratings yet
Data Mining Applications and Trends
3 pages
Thesis Chapterwise
No ratings yet
Thesis Chapterwise
52 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
97 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
287 pages
Text Mining Applications Across Industries
No ratings yet
Text Mining Applications Across Industries
12 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Lesson 3 Fs 1 The Roles of Teacher and Learner
No ratings yet
Lesson 3 Fs 1 The Roles of Teacher and Learner
4 pages
Tuskegee Airmen Lesson Plan Overview
No ratings yet
Tuskegee Airmen Lesson Plan Overview
2 pages
Penduka Declaration on Ju and Khoe Languages
No ratings yet
Penduka Declaration on Ju and Khoe Languages
7 pages
Funwithfoldables 1
No ratings yet
Funwithfoldables 1
9 pages
Quiz 7 - Cbi-El 2204
No ratings yet
Quiz 7 - Cbi-El 2204
11 pages
APA 7th Edition Student Paper Example
No ratings yet
APA 7th Edition Student Paper Example
20 pages
Questionnaire: Techers' Perception Regarding Motivational Techniques in Personality Development
100% (1)
Questionnaire: Techers' Perception Regarding Motivational Techniques in Personality Development
4 pages
Multiple Diseases Prediction System Using ML
No ratings yet
Multiple Diseases Prediction System Using ML
12 pages
Reading and Math Pratice 2nd
90% (10)
Reading and Math Pratice 2nd
226 pages
ChatGPT: AI-Powered Conversational Tool
No ratings yet
ChatGPT: AI-Powered Conversational Tool
1 page
English Project Proposal
67% (3)
English Project Proposal
5 pages
Inclusive Education
No ratings yet
Inclusive Education
2 pages
Detailed Lesson Plan in Technical Drawing 7: Assemblywoman Felicita G. Bernardino Memorial Trade School
100% (2)
Detailed Lesson Plan in Technical Drawing 7: Assemblywoman Felicita G. Bernardino Memorial Trade School
4 pages
Level 4 PLP1 Project Brief
No ratings yet
Level 4 PLP1 Project Brief
5 pages
Sunidhi Gupta's Marketing Resume
No ratings yet
Sunidhi Gupta's Marketing Resume
2 pages
Pre-Primary Teacher Media Use Study
No ratings yet
Pre-Primary Teacher Media Use Study
22 pages
Resume of Daniel Hyun, Windward Class of 2015
No ratings yet
Resume of Daniel Hyun, Windward Class of 2015
4 pages
2023 Grade 12 Geography Field Project
No ratings yet
2023 Grade 12 Geography Field Project
4 pages
Information Systems Security Course
No ratings yet
Information Systems Security Course
9 pages
Syllabus B1 Original (5 Credits) 2024-2025
No ratings yet
Syllabus B1 Original (5 Credits) 2024-2025
11 pages
Matatag: Day 1 Day 2 Day 3 Day 4
No ratings yet
Matatag: Day 1 Day 2 Day 3 Day 4
3 pages
Project Management Analysis
No ratings yet
Project Management Analysis
19 pages
Water Conservation Initiatives of Coca Cola Company
No ratings yet
Water Conservation Initiatives of Coca Cola Company
5 pages
ENGLISH 9 q3 w4
No ratings yet
ENGLISH 9 q3 w4
3 pages
Grade 10 Mathematics Exam Guidelines
100% (5)
Grade 10 Mathematics Exam Guidelines
3 pages
Analyzing Mood in Number the Stars Chapter 12
No ratings yet
Analyzing Mood in Number the Stars Chapter 12
13 pages
2 - Kindergarten
100% (1)
2 - Kindergarten
21 pages
TAFE Assessment Help
No ratings yet
TAFE Assessment Help
3 pages
Rosemarie Ang-Ug Answers For Ge1
No ratings yet
Rosemarie Ang-Ug Answers For Ge1
2 pages
Nursing Education: Bridging Theory and Practice
No ratings yet
Nursing Education: Bridging Theory and Practice
2 pages