0% found this document useful (0 votes)
27 views38 pages

08-Text Mining

The document provides an overview of text mining, explaining its definition, goals, and processes involved in analyzing unstructured and semi-structured textual data. It contrasts text mining with data mining, highlights various applications, and outlines the challenges faced in the field. Additionally, it details the steps in the text mining process, including feature generation, selection, and various operations like classification and sentiment analysis.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views38 pages

08-Text Mining

The document provides an overview of text mining, explaining its definition, goals, and processes involved in analyzing unstructured and semi-structured textual data. It contrasts text mining with data mining, highlights various applications, and outlines the challenges faced in the field. Additionally, it details the steps in the text mining process, including feature generation, selection, and various operations like classification and sentiment analysis.

Uploaded by

alprn13aydn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Text Mining

CSE4416 – Introduction to Data Mining

Asst. Prof. Dr. Göksu Tüysüzoğlu


Outline

◘ What is Text Mining?


◘ Goals of Text Mining
◘ History
◘ Text Mining vs Data Mining
◘ Text Mining Process (6 Steps)
◘ Text Mining Applications
◘ Text Mining Operations
Introduction

◘ Information explosive
◘ 80% information stored in text
Documents, Journals, Reports, Web Pages, E-mails...
◘ Difficult to extract special information
◘ It is necessary to automatically analyze, organize, summarize...

100
90
80
70
60
50 Unstructured
40 Structured

30
20
10
0
Data volume Market Cap
Text Resources

• E-books 
• Scientific Articles
• News Articles
• Technical Documents Internet
• Web Pages
• Email
• Contracts
• Patent Portfolios
• Customer Complaint Letters
• Transcripts of Phone Calls with Customers
• ...
What is Text Mining?

◘ Text Mining is the process of analyzing the relations, the patterns, and
the rules among textual data (semi-structured or unstructured text)
◘ Text Mining is the discovery by computer of new, previously unknown
information, by automatically extracting information from different
written resources.

ge
wl ed
o
Kn
Text Mining Research Communities

◘ Text mining research – integrate research from several research


communities such as:
– Information Retrieval (IR)
– Information Extraction (IE)
– Natural Language Processing
– ...
“Search” versus “Discover”

Search Discover
(goal-oriented) (opportunistic)

Structured Data Data


Data Retrieval Mining

Unstructured Information Text


Data (Text) Retrieval Mining
Data Mining vs. Text Mining

Data Mining Text Mining

Data Object Numerical & categorical data Textual data

Data structure Structured Unstructured & Semi-


Structured

Data representation Straightforward Complex

Space dimension < tens of thousands > tens of thousands


Methods Data analysis, machine learning, Information retrieval, NLP,...
statistic, neural networks
Maturity Broad implementation since1994 Broad implementation starting
from 2000
Market Large and mid size companies Corporate workers and
individual users
Text Mining Examples

◘ Information that not even the writer knows.


– e.g., Discovering a new method for a hair growth that is described as a
side effect for a different procedure

◘ Rediscover the information that the author encoded in the text


– e.g., Automatically extracting a product’s name from a web-page.
Goals of Text Mining

◘ Goals:
– Minimize the amount of text we must read.
– Isolate pieces of information, such as the name of person and
the person’s role in the company.
– Discover relationships between terms
– Discover valid, useful, and previously unknown knowledge
Semi-Structured Data

◘ Text databases are, in general, semi-structured

Example:

• Title
• Author
• Publication_Date Structured attribute/value pairs
• Length
• Category
• Abstract
Unstructured
• Content
Unstructured Text

◘ Morphology - Structure of words


– Prefix - word stem - suffix, inflected word forms (e.g. manag- e/es/ing/ed)

◘ Syntax - Combination of word to form phrases


– Noun and verb groups, subject – predicate – object

◘ Semantics - Meaning of words and phrases


– Synonyms: Different words, but same meaning
True, right, correct
– Homonyms: Same word, but different meanings
Fall - to drop down
Fall - the season between summer and winter
Challenges of Text Mining

◘ Large textual data base


– over 50,000,000,000 web pages
◘ High dimensionality
– Consider each word/phrase as a dimension
◘ Ambiguity
– Word ambiguity
• Pronouns (he, she …)
"John told Tom that he should go to the store."
• “buy”, “purchase”
– Semantic ambiguity
• The king saw the rabbit with his glasses.
◘ Noisy data
• Example: Spelling mistakes
◘ Not well structured text
– Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”
Text Mining Process (6 Steps)
1. Collect relevant documents
– Identify source and documents
– Retrieve them from Web or from internal file systems
2. Pre-processing documents
– Syntactic/Semantic text analysis
3. Features Generation
– Text document is represented by the words it contains (e.g., “Lord of the rings”  {“the”,
“Lord”, “rings”, “of”})
– Identifies a word by its root (e.g., flying, flew  fly)
– Stop words: The most common words are unlikely to help text mining
e.g., “the”, “a”, “an”, “you” … ([Link]
4. Features Selection
– Reduce dimensionality - Irrelevant features
e.g., the existence of a noun in a news
article is unlikely to help classify it as “politics” or “sport”
5. Text mining operations
– Patterns and relationships are discovered.
e.g., classification, clustering, association rule mining

6. Evaluation of text mining results


Feature Generation Example
Feature Generation Example
Hi,
Here is your weekly update (that unfortunately hasn't gone out in about a month). Not
much action here right now.
1) Due to the unwavering insistence of a member of the group, the
[Link] package is now completely independent of the d2k
application.
2) Transformations are now handled differently in Tables. Previously, transformations were
done using a TransformationModule. That module could then be added to a list that an
ExampleTable kept. Now, there is an interface called Transformation and a sub-interface
called ReversibleTransformation.

hi, weekly update (that unfortunately gone out month). much action here right now. 1) due
unwavering insistence member group, [Link] package now
completely independent d2k application. 2) transformations now handled differently tables.
previously, transformations done using transformationmodule. module added list
exampletable kept. now, interface called transformation sub-interface called
reversibletransformation.

hi week update unfortunate go out month much action here right now 1 due unwaver
insistence member group ncsa d2k modules core datatype package now complete
independence d2k application 2 transformation now handle different table previous
transformation do use transformationmodule module add list exampletable keep now
interface call transformation sub-interface call reversibletransformation
Feature Selection Example
hi week update unfortunate go out month much action here right now 1 due unwaver
insistence member group ncsa d2k modules core datatype package now complete
independence d2k application 2 transformation now handle different table previous
transformation do use transformationmodule module add list exampletable keep now
interface call transformation sub-interface call reversibletransformation
hi core hi core
week datatype week datatype
update package update package
unfortunate complete unfortunate complete
go independence go independence
out application out application
month 2 month transformation
much transformation much handle
action handle action different
here different here table
right table right previous
now previous now use
1 use due add
due transformationmodule insistence list
unwaver add member keep
insistence list group interface
member exampletable modules call
group keep do sub-interface
ncsa interface
d2k call
modules sub-interface
do reversibletransformation
BoW (Bag of Words)
Document-Term Matrix

◘ Doc 1: I love dogs.


◘ Doc 2: I hate dogs and knitting.
◘ Doc 3: Knitting is my hobby and my passion.

Document-Term Matrix
TF-IDF

signature words
TF-IDF Example

◘ Consider a document containing 100 words wherein the word


«dog» appears 3 times.

TF(dog) = 3 / 100 = 0.03.

◘ Assume we have 10 million documents and the word


«dog» appears in one thousand of these.

IDF(dog) = log(10,000,000 / 1,000) = 4.

◘ Thus, the TF-IDF weight for the word «dog» is

TF-IDF(dog) = TF(dog) × IDF(dog) = 0.03 * 4 = 0.12.


Text Mining Applications

◘ Email: Spam filtering


◘ Marketing: Discover distinct groups of
potential buyers according to a user text
based profile
– e.g. amazon
◘ Industry: Identifying groups of
competitors web pages
– e.g., competing products and their prices
◘ Job seeking: Identify parameters in
searching for jobs
– e.g., [Link]
◘ Biomedical applications: Gene cluster
identification, Protein interactions, Gene-
disease associations, Protein-disease
associations
Example – Job Information Extraction

[Link]-Job2

JobTitle: Ice Cream Guru


Employer: [Link]
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: [Link]/jobs_midwest.html
OtherCompanyJobs: [Link]-Job1
Text Mining Operations

Clustering groups of automatically retrieved text into a list


1. Text clustering of meaningful categories

2. Text classification Cataloguing texts into categories

Creating a shortened version of a text containing the


3. Document summarization most important elements

4. Associations Occurrence of A together with B

5. Question answering Giving the user a (short) answer to their question

Identifying and extracting subjective information in source


6. Sentiment analysis materials (e.g., emotion, beliefs)
Text Mining Operations
1-Clustering
◘ Given:
– Set of documents and a similarity measure among documents

◘ Similarity measure:
– Problem-specific measures
e.g., how many words are common in these documents

◘ Find: Documents
– Clustering of related documents source
• Documents in one cluster are more similar to one another
• Documents in separate clusters are less similar to one another

Similarity
◘ Methods: Clustering
measure
– Partitioning Methods System
– Hierarchical Methods

Doc
Doc Doc
Doc Doc
Doc Doc
Doc
Doc
Doc
Text Mining Operations
2-Classification
BOISE, Idaho (CNN) -- Cooler weather Monday in the U.S. Pacific Northwest may help more than
27,000 firefighters in their marathon battle against dozens of wildfires. Ten new large fires were reported
overnight into Monday, bringing to 40 the number of large fires ablaze, said officials at the National
Interagency Fire Center. They said between 300,000 and 400,000 acres are aflame. The good news is
that five large fires were contained Sunday. So far, about 2.8 million acres of forest have burned this
year, making 2001 an average year for fires. Center officials said 241 fires were reported Sunday into
Monday but 96 percent were contained or extinguished in "initial attacks" by firefighters. A large fire is
defined as a fire burning uncontained and extending over 100 acres or more.

• Politics
• Economic
• Social
• Enviromental
• Cultural

e.g., classifying “football” document as a “basketball” document is


not as bad as classifying it as “crime”.
Text Mining Operations
2-Classification
◘ Use training set to build the model

◘ Use testing set to test the model

◘ Classification Techniques
– Decision trees
– Neural networks
– Bayesian classification
Text Mining Operations
2-Classification – Decision Tree
Categorization of Documents

Model Construction Model Usage

New Documents:
Training Archive: 1. Training
a decision tree
2. Applying the
decision tree
Test Archive:

Classified
Documents:

Example: Annotation of the training Example: Tagging new documents


archive with n types of events with one of n known events
Text Mining Operations
2-Classification – Decision Tree

xt ass
te cl
Ex#
Hooligan

An English football fan


1 Yes
… Hooligan
During a game in Italy
2 Yes

England has been A Danish football fan ?
3 Yes
beating France … Turkey is playing vs. France.
?
Italian football fans were The Turkish fans …
4 No 10

cheering …
An average USA
5 No
salesman earns 75K
The game in London
6 Yes
was horrific Test
7
Manchester city is likely
Yes Set
to win the championship
Rome is taking the lead
8 Yes
in the football league
Learn
10

Training
Set Classifier Model
Text Mining Operations
2-Classification – Neural Network

Threshold Hooligan

Weights vector 2.5 -0.9 1.2


Hidden nodes

Weights vector 0.9 3 -0.4 -0.8 0 0.7 1.2 -0.2 1 -2

Input
vector as ball cool pain Spain foot break FC soccerfootball

◘ Multi-layer
Text Mining Operations
2-Classification – Bayesian Classification
◘ The classification problem may be formalized using probabilities:
P(C|X) = prob. that the example is of class C
e.g. P(Hooligan | English, fan, married…)

◘ Idea: assign to example X the class label C such that P(C|X) is maximal
◘ Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
Text Mining Operations
3-Associations
◘ Example:

In a given a food-industry corpus:

“98% of the documents which are interested on apple juice does it


related with the chromatography analytic technique”

X  Y : “apple juice  chromatography”


Text Mining Operations
4-Summarizing
◘ High-level summary or survey of all main points

◘ Example: sentence extraction from a single document

◘ Method (How to summarize a collection?)


– Start with training set, allows evaluation
– Create heuristics to identify important sentences
– Classification function estimates the probability a given sentence is
included in the abstract
Text Mining Operations
5-Question Answering
◘ Question Answering is a computer science discipline within the
fields of information retrieval and natural language processing.
◘ QA system generates valid answers to questions asked by an user.

◘ Examples
– Question: Where is the Louvre Museum located?
Answer: The Louvre Museum is located in Paris
– Question: Who was the prime minister of Australia during the Great
Depression?
Answer: James Scullin (Labor) 1929–31

◘ Method:
– Heuristics about question type: who, when, where
– Match up noun phrases within and across documents
Text Mining Operations
6-Sentiment analysis
◘ Computational study of opinions, sentiments and emotions in text
◘ Task: Classifying the expressed opinion of a text (positive, negative,
neutral)
◘ Sentiment analysis uses NLP, text analysis, and computational
techniques to automate the extraction or classification of sentiment
from text
◘ Different levels of Sentiment Analysis:
– Document based: One score for the whole document (e.g. review)
– Sentence based: Does sentence express positive, neutral, negative opinion
– Aspect based: Focus on a specific aspect and identify what people like/dislike
◘ It becomes a hot area in decision-making
– 97% of customer’s read online reviews for local business in 2017 (Local Consumer
Review Survey 2017)
– 85% of consumers trust online reviews as much as personal recommendations
(Local Consumer Review Survey 2017)
Text Mining Operations
6-Sentiment analysis
Keras IMDB Movie Review Dataset [1]
◘ 50.000 reviews (25.000 for training and 25.000 for testing with each 12.500
reviews marked as positive or negative)
◘ It is a binary (0 = negative or 1 = positive) classification problem.
– The negative reviews have a score from 4 out of 10,
– The positive reviews have a score from 7 out of 10.
– Thus, neutral rated reviews are not included in the train/test sets.

◘ Convolutional neural network (CNN) was


used for sentiment classification task.
◘ Input lenght of the input sequences was
set to 500 (truncate longer reviews or pad
shorter reviews with zeros at the end)
◘ Total of 2813 wrong, 22187 true
classifications (Accuracy: 88,75%)
◘ [1] Maas et al. (2011): Learning Word Vectors for Sentiment Analysis
Language Models

◘ A language model uses machine learning to conduct a probability


distribution over words used to predict the most likely next word in a
sentence based on the previous entry.
◘ Language models learn from text and can be used for
– producing original text,
– predicting the next word in a text,
– part-of-speech (POS) tagging,
– text summarization,
– machine translation,
– question answering,
– speech recognition,
– optical character recognition and
– handwriting recognition.
Language Models

◘ PROBABILISTIC LANGUAGE MODEL


– N-grams

◘ NEURAL NETWORK-BASED LANGUAGE MODELS


– RECURRENT NEURAL NETWORKS (RNN)
Short-term memory (LSTM) or a gated recurrent unit (GRU)
– TRANSFORMERS
OpenAI GPT, Google’s BERT
TRANSFORMER MODELS

◘ BERT uses encoders only, GPT uses


decoders only.
◘ Both options understand language
including syntax and semantics.
• BERT: classification, questions
and answers, summarization,
named entity recognition
• GPT: translation, text generation
◘ The outputs of the core models are
different:
• BERT (encoder): Embeddings
representing words with attention
information in a certain context
• GPT (decoder): Next words with
probabilities

You might also like