Statistical Natural Language Processing
Vincent Ng
Human Language Technology Research Institute
University of Texas at Dallas
September1999
Honors AI
First offering in Spring 2010
MW
7:00-8:15pm
Whether it will be offered again depends on enrollment
Isnt more difficult than the regular section, but covers
more topics at a faster pace
Web
search technologies
Small-scale project to be done in teams of 1-2 people
September1999
Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
COGS/CS 4315: Intelligent Systems Design
CS 4365: Artificial Intelligence
CS 4365 (Honors): Artificial Intelligence
CS 4375: Introduction to Machine Learning
CS 4391: Introduction to Computer Vision
September1999
Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
Not
offered every year
COGS/CS 4315: Intelligent Systems Design
Not
offered every year
CS 4365: Artificial Intelligence
Spring
and probably Fall
CS 4365 (Honors): Artificial Intelligence
Spring
only
CS 4375: Introduction to Machine Learning
Fall
only
CS 4391: Introduction to Computer Vision
Not
September1999
offered every year, may be offered Fall or Spring
4
Graduate AI Courses
CS 6320: Natural Language Processing
CS 6322: Information Retrieval
CS 6364: Artificial Intelligence
CS 6373: Intelligent Systems
CS 6375: Machine Learning
CS 6384: Computer Vision
CS 6395: Speech Recognition, Synthesis & Understanding
CS 6v81: Statistical Natural Language Processing
September1999
The Intelligent Systems Group
Dr. Sanda Harabagiu
Information
retrieval, natural language processing
Dr. Vasileios Hatzivassiloglu
Natural
language processing, bioinformatics
Dr. Yang Liu
Speech
and language processing
Dr. Dan Moldovan
Natural
language processing, knowledge representation
Dr. Vincent Ng
Natural
language processing
Dr. Haim Schweitzer
Computer
vision
September1999
Statistical Natural Language Processing
September1999
Where are the Flying Cars?
According to science fiction, the future has talking machines.
(1926): false Maria
Star Wars: Episode IV: C3PO (Marias influence?)
2001: A Space Odyssey (1968): HAL (the HAL-9000)
Metropolis
Dave: Open the pod bay doors, HAL.
HAL: Im sorry Dave, Im afraid I cant do that.
Dave: Whats the problem?
HAL: I think you know what the problem is just as well as I do.
September1999
Where are the Flying Cars?
According to science fiction, the future has talking machines.
(1926): false Maria
Star Wars: Episode IV: C3PO (Marias influence?)
2001: A Space Odyssey (1968): HAL (the HAL-9000)
Metropolis
Dave: Open the pod bay doors, HAL.
HAL: Im sorry Dave, Im afraid I cant do that.
Dave: Whats the problem?
HAL: I think you know what the problem is just as well as I do.
Requires both understanding and generation
September1999
Natural Language Processing (NLP)
natural language
Languages
that people use to communicate with one another
Ultimate goal
To
build computer systems that perform as well at using
natural languages as humans do
Immediate goal
To
build computer systems that can process text and speech
more intelligently
September1999
10
Natural Language Processing (NLP)
natural language
Languages
that people use to communicate with one another
Ultimate goal
To
build computer systems that perform as well at using
natural languages as humans do
Immediate goal
To
build computer systems that can process text and speech
more intelligently
language
Understanding
computer
language
Generation
September1999
11
Why NLP?
Lots of information is in natural language format
Documents
News
broadcasts
User utterances
Lots of users want to communicate in natural language
Do
what I mean!
September1999
12
NLP is Useful
Application: Text Summarization
Summarize the public commentary regarding
the prohibition of potassium hydroxide for
peeling peaches.
E-mail, letters,
editorials, technical
reports, newswires
multi-document
summary
September1999
13
NLP is Useful
Application: Information Retrieval
Topic: Advantages of
using potassium
hydroxide in any
aspect of organic
farming, especially
doc 1
score
doc 2
score
doc 3
score
doc n
score
relevant documents
(ranked)
information need
text collection
September1999
14
NLP is Useful
Application: Question Answering
Retrieve not just relevant documents, but return the answer
Answer
Query
Which country has the
largest part of the
Amazon forest?
text collection
Brazil
September1999
15
NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST
Thousands of people are feared
dead following... (voice-over)
...a powerful earthquake that hit
Afghanistan today. The quake
registered 6.9 on the Richter
scale. (on camera) Details now
hard to come by, but reports say
entire villages were buried by
the quake.
Disaster Type:
location:
date:
magnitude:
magnitude-confidence:
damage:
human-effect:
victim:
number:
outcome:
physical-effect:
object:
outcome:
September1999
16
NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST
Thousands of people are feared
dead following... (voice-over)
...a powerful earthquake that hit
Afghanistan today. The quake
registered 6.9 on the Richter
scale. (on camera) Details now
hard to come by, but reports say
entire villages were buried by
the quake.
Disaster Type: earthquake
location: Afghanistan
date: today
magnitude: 6.9
magnitude-confidence: high
damage:
human-effect:
victim: Thousands of people
number: Thousands
outcome: dead
physical-effect:
object: entire villages
outcome: damaged
September1999
17
NLP is Useful
Application: Machine Translation
?
Japaneseto-English
Translator
How do you say
octopus in Japanese?
September1999
18
NLP is Useful
Application: Machine Translation
?
Japaneseto-English
Translator
How do you say
octopus in Japanese?
Bill Gates, 1997 now were betting the company on these
natural interface technologies
September1999
19
NLP is
Interdisciplinary
Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological and/or cognitive plausibility
Mathematics
models of cognitive processes
and statistics: properties of models
emphasizes formal aspects
September1999
20
NLP is
Interdisciplinary
Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological and/or cognitive plausibility
Mathematics
vs.
NLP
Computational study of language use
Definite engineering aspect in addition to a scientific one
and statistics: properties of models
emphasizes formal aspects
models of cognitive processes
Scientific: to explore the nature of linguistic communication
Engineering: to enable effective human-machine communication
Emphasis on computational, not cognitive plausibility
Models of language: 95% correct is OK
September1999
21
Why study NLP?
Challenging
AI-complete
borrows from the notion of NP-completeness
to solve NLP, youd need to solve all of the problems in AI
Turing
test
Turing (1950): "Computing machinery and intelligence
posits that engaging effectively in linguistic behavior is a
sufficient condition for having achieved intelligence.
September1999
22
The Turing Test
Turning predicted that by 2000, a machine might have a
30% chance of fooling a lay person for 5 minutes
September1999
23
But little kids can do NLP
September1999
24
But little kids can do NLP
Why is NLP hard?
September1999
25
Why is NLP hard?
Ambiguity!!! at all levels of analysis
Phonetics and phonology
Concerns
how words are related to the sounds that realize them
Important for speech-based systems
I scream vs. ice cream
Its very hard to recognize speech vs.
Its very hard to wreck a nice beach
September1999
26
Why is NLP hard?
Ambiguity!!! at all levels of analysis
Morphology
Concerns
how words are constructed from sub-word units
Unionized
Union-ized?
Un-ionized in chemistry?
September1999
27
Why is NLP hard?
Ambiguity!!! at all levels of analysis
Syntax
Concerns
sentence structure
Different syntactic structure implies different interpretation
Squad helps dog bite victim.
[np squad] [vp helps [np dog bite victim]]
[np squad] [vp helps [np dog] [inf-clause bite victim]
September1999
28
Why is NLP hard?
Ambiguity!!! at all levels of analysis
Semantics
Concerns
what words mean and how these meanings combine
to form sentence meanings.
Jack invited Mary to the Halloween ball.
dance vs. some big sphere with Halloween decorations?
September1999
29
Why is NLP hard?
Ambiguity!!! at all levels of analysis
Discourse
Concerns
how the immediately preceding sentences affect the
interpretation of the next sentence
The city council refused to give the women a permit because
they feared violence.
The city council refused to give the women a permit because
they advocated violence.
September1999
30
Im Afraid I Cant Do That
The task seems so difficult! What resources do we need?
Knowledge
about language
Knowledge about the world
September1999
31
An Idea
Have computers learn models of language
Statistical
NLP:
learns statistical models that capture language properties
from a corpus (text samples)
helps ease the knowledge acquisition bottleneck
Why
is statistical language learning possible?
usage of words exhibits statistical regularities.
September1999
32
Probabilities are Realistic
Its hard to recognize speech
vs.
Its hard to wreck a nice beach
Which is more likely? (both are grammatical)
Applications: speech recognition, handwriting recognition,
spelling correction,
General problem in statistical NLP: density estimation
P(Its hard to recognize speech)
P(Its hard to wreck a nice beach)
September1999
33
No, Really, Its a Crazy Idea
Late 50s-80s: statistical NLP in disfavor
It is fair to assume that neither sentence
(1) Colorless green ideas sleep furiously
nor
(2) Furiously sleep ideas green colorless
has ever occurred Hence, in any statistical model
these sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though nonsensical,
is grammatical, while (2) is not. [Chomsky 1957]
September1999
34
Who Are You Calling Crazy?
I dont believe in this statistics stuff
Thats not learning, thats statistics
Knowledge-intensive NLP is going nowhere fast
Every time I fire a linguist, my performance goes up
September1999
35
Which sentence is more likely?
Its hard to recognize speech
vs.
Its hard to wreck a nice beach
Statistical approach: density estimation
P(Its hard to recognize speech)
P(Its hard to wreck a nice beach)
Estimate these probabilities from a corpus (text sample)
Count
the number of times sentences appears in corpus
Divide the count by the total number of sentences in corpus
September1999
36
Is there any problem with this approach?
September1999
37
Solution 1: Use a larger corpus
Many sentences may still not appear in a larger corpus.
probability
will be zero!
September1999
38
Problems
We may not be able to find these sentences even in a very
large corpus
Even if we do, each of them may appear only once and
twice
September1999
39
Solution 2: Use a language model
A language model assigns a probability to a sentence
How?
September1999
40
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
September1999
41
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
September1999
42
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
Assume
N=3.
September1999
43
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
Assume
N=3. Compute
P(lets | 2nd-prev-word=null, prev-word=null)
September1999
44
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
Assume
N=3. Compute
P(lets | 2nd-prev-word=null, prev-word=null)
P(take | 2nd-prev-word=null, prev-word=lets)
September1999
45
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
Assume
N=3. Compute
P(lets | 2nd-prev-word=null, prev-word=null)
P(take | 2nd-prev-word=null, prev-word=lets)
P(a | 2nd-prev-word=lets, prev-word=take)
September1999
46
A Simple Two-Step Approach
Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
Assume
N=3. Compute
P(lets | 2nd-prev-word=null, prev-word=null)
P(take | 2nd-prev-word=null, prev-word=lets)
P(a | 2nd-prev-word=lets, prev-word=take)
P(talk | 2nd-prev-word=take, prev-word=a)
September1999
47
A Simple Two-Step Approach
How to compute
P(talk | 2nd-prev-word=take, prev-word=a)?
Collect statistics from corpus!
Count
number of times we see take a talk in corpus
Count number of times we see take a in corpus
Divide these two numbers
September1999
48
A Simple Two-Step Approach
How to compute
P(talk | 2nd-prev-word=take, prev-word=a)?
Collect statistics from corpus!
Count
number of times we see take a talk in corpus
Count number of times we see take a in corpus
Divide these two numbers
Now we know how to compute probability of each word
September1999
49
A Simple Two-Step Approach
How to compute
P(talk | 2nd-prev-word=take, prev-word=a)?
Collect statistics from corpus!
Count
number of times we see take a talk in corpus
Count number of times we see take a in corpus
Divide these two numbers
Now we know how to compute probability of each word
Step 2: Multiply probability of each word to get probability
of sentence
September1999
50
An Example
P(Lets take a talk)
= P(Lets | null, null) * P(take | outside and)
* P(a | and take) * P(talk | take a)
September1999
51
An Example
P(Lets take a talk)
= P(Lets | null, null) * P(take | outside and)
* P(a | and take) * P(talk | take a)
Does language modeling solve the problems of
(1) not seeing a sentence in a corpus at all?
(2) not seeing a sentence frequently enough?
September1999
52
Problems Solved???
To some extent
More
likely to be able to find short word sequences
than long word sequences in a corpus
Still, there is no guarantee that we will be able to find
Lets take a
If we cannot, probability of sentence will be zero,
even if the sentence is sensible
September1999
53
Solution 3: Use a Language Model with Small N
Use N=2
P(Lets take a talk)
= P(Lets | null) * P(take | Lets) * P(a | take)
* P(talk | a)
Use N = 1
P(Lets take a talk)
= P(Lets) * P(take) * P(a) * P(talk)
September1999
54
Problems Solved???
To a larger extent
It is less likely, though not impossible, to see word
sequences of one or two not appearing in a corpus
Other problems?
September1999
55
Comparing Language Models
Is a language model with N=3 better than one with N=2?
If yes, how to compare?
Generate a sentence using the language model
Generate
each word from left to right
At each point, we are in a different state
Throw a dice to determine which word to output
September1999
56
Example
To generate Lets go outside and take a talk with N=3:
Current state: <null, null>. Throw a dice that generates
Lets with a probability of P(Lets | null, null)
Current state: <null, Lets>. Throw a dice that generates
go with a probability of P(go | null, Lets)
Current state: <Lets, go>. Throw a dice that generates
outside with a probability of P(outside | Lets, go)
September1999
57
Experimental Results
Corpus: Complete works of Shakespeare
N=1: Will rash been and by I the me loves gentle me not
slavish page, the and hour; ill let
N=2: What means, sir. I confess she? Then all sorts, he is
trim, captain.
N=3: Fly, and will rid me these news of price. Therefore
the sadness of parting, as they say, tis done.
N=4: They say all lovers swear more performance than
they are wont to keep obliged faith unforfeited!
September1999
58
Solution 4: Smoothing
Goal: make sure no N-gram (i.e., word sequence of length
N) has zero probability
Currently, P(take a talk) =
number of times take a talk is seen in corpus
number of 3-grams in corpus
After smoothing, P(take a talk) =
number of times take a talk is seen in corpus + 1
number of 3-grams in corpus + number of distinct 3-grams
September1999
59
Linguistics vs. Statistics
It is not our intention to ignore linguistics, neither to replace it.
Rather, we hope to enfold it in the embrace of a secure
probabilistic framework so that the two together may draw
strength from one another and guide us to better natural
language processing systems in general and to better
machine translation systems in particular.
--- The mathematics of Statistical Machine Translation
[Brown, Della Pietra, Della Pietra, and Mercer, 1993]
September1999
60
Summary
Different NLP tasks require the collection of different
statistics from a text corpus. Just need to be creative.
Language learning is in general a difficult task due to
ambiguities at different levels, and is complicated by the
fact there are almost always exceptions to a general rule
It is relatively easy to start doing research in NLP: only
basic knowledge of statistics and linguistics is needed
September1999
61
Some Famous Imitation Games
1960s
1990s
ELIZA
ALICE
Loebner prize
win $100,000 if you pass the Turing test
September1999
62
ELIZA: Impersonating a Pyschotherapist
1960s ELIZA (Joseph Weizenbaum)
[Link]
Eliza: Hello. I am ELIZA. How can I help you?
You: Well I feel sad
Eliza: Do you often feel sad?
You: not very often, but it's becoming more common
Eliza: Please go on.
September1999
63
ALICE
Human: hi my name is Carla
ALICE: How are you doing, Carla.
Human: what is your name?
ALICE: My name is ALICE.
Human: what is your full name?
ALICE: ALICE = Artificial Linguistic Internet Computer Entity.
Human: when were you born?
ALICE: I was connected to the net on November 23, 1995.
Human: Have you passed the Loebner Prize?
ALICE: The Loebner Prize is an annual Turing Test sponsored
by Hugh Loebner. Dr. Wallace and ALICE won it in 2000. The
ALICE program was ranked "most human computer" by the
panel of judges.
Human: who is Dr. Wallace?
ALICE: He is a famous computer scientist, the author of ALICE,
and a graduate of Carnegie Mellon.
Human: Where does Dr. Wallace work?
ALICE: he works from his home in San Francisco.
September1999
64
The Loebner Contest
A modern version of the Turing Test, held annually, with
a $100,000 cash prize.
Restricted topic and limited time.
Participants include a set of humans and a set of
computers and a set of judges.
Scoring
Rank
from least human to most human.
If better than a human, win $100,000. (Nobody
yet)
September1999
65
Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly
= un + friend + ly
September1999
66
Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly
= un + friend + ly
Some words do not have a prefix
validate
= valid + ate
Some words do not have a suffix
devalue
= de + value
How to automatically segment a word by computing statistics
from a corpus?
September1999
67
Morphological Segmentation
Input: Text corpus
Output: Segmented Words
Word
Frequency
Word
Segmentation
aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons
157
6
3
77
2781
4696
1082
378
23
117
.......
aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons
....
aback
abacus
abacus+es
abalone
abandon
abandon+ed
abandon+ing
abandon+ment
abandon+ment+s
abandon+s
....
....
September1999
68
A Word Segmentation Algorithm
Basic idea:
1.
2.
Learn prefixes, suffixes and roots from corpus
Segment the words using the learned prefixes and
suffixes
September1999
69
A Word Segmentation Algorithm
Let V be the vocabulary (i.e., set of words in corpus)
Let A and B be two character sequences.
Let AB be the concatenation of A and B.
Prefix and suffix learning algorithm:
and A in V B is a suffix
singing and sing ing is a suffix
AB and B in V A is a prefix
preset and set pre is a prefix
AB
September1999
70
A Word Segmentation Algorithm
Problem: Assumption does not always hold
diverge
and diver are in V ge is a suffix Wrong!
Many of the learned prefixes and suffixes are erroneous
Solution: score each learned prefix and suffix and retain only
those whose scores are above a pre-defined threshold
After learning, we can try to use them to segment words.
Suppose we learn that ate is a suffix. Then:
candidate = candid + ate
September1999
71
Determining Most Frequent Part-of-Speech
Task: determine the most frequent POS of a word
a:
DET
buy: VERB
mother: NOUN
beautiful: ADJECTIVE
beautifully: ADVERB
carry: VERB
Useful for part-of-speech tagging
Too time-consuming to do this by hand, so lets learn
September1999
72
Determining Most Frequent Part-of-Speech
Approach:
Group
words that are likely to have the same POS together
(to form, e.g., 100 groups)
Hand-label each group with a POS tag
How to generate groups of words with similar POS?
Idea:
use contextual information
The
boy
is going to the library.
The
lady went to the market.
Noun
The left word and/or the right word are useful indicators.
September1999
73
Determining Most Frequent Part-of-Speech
Create a profile for each word w in the vocabulary that
tells us whether a word has ever appeared to the left/right
of w.
Example
profile for boy:
the-left: yes, a-left: yes, happy-left: no, cry-left: no
the-right: no, a-right: no, happy-right: no, cry-right: no
Compare profiles
Words with similar profiles tend to have the same POS?
September1999
74
Determining Most Frequent Part-of-Speech
Profiles too big
Use
only the most frequent N left-words and N right-words
Determiners are more likely to remain in profile than verbs,
for instance
September1999
75
Identifying Semantically Similar Nouns
Task: identify/group nouns that are semantically similar
How do we know that boy is more similar to words like
girl, man, woman, individual than car, ship,
aeroplane, etc.?
Idea: use contextual information
Similar
words tend to occur in similar context
What kind of context is useful to capture?
September1999
76
Identifying Semantically Similar Nouns
For each noun, collect all verbs for which the noun can
serve as subjects
boy:
speak, play, cry, laugh, jump,
capture context using the governing verbs
The profile for each noun consists of these verbs
Compare profiles
Words with similar profiles tend to be semantially similar?
September1999
77
Pronoun Resolution
Task: find the noun phrase to which it refers
They know full well that companies held tax money aside for
collection later on the basis that the government said it1 was
going to collect it2.
Given a corpus, what kind of statistics can we collect that
can help us resolve occurrences of it correctly?
is the subject of collect
it2 is the object of collect
it1
September1999
78
Pronoun Resolution
Using the corpus, compute the number of times each noun
phrase in the paragraph serves as the subject of collect
The
ones that have high counts are likely to be the referent
of it1
Similarly, compute the number of times each noun phrase
in the paragraph serves as the object of collect
The
ones that have high counts are likely to be the referent
of it2
September1999
79
Supervised Learning
Learning from an annotated text corpus
Corpus
annotated with part-of-speech tags
Human knowledge encoded in the form of annotations
Machine learning algorithms can be used to learn from
annotated corpora
Supervised methods typically outperform unsupervised
methods
September1999
80
Learning for a Resource-Scarce Language
Project annotations from a resource-rich language to a
resource-scarce language
NP
NP
[That] perhaps
[
NP
was
NP
[the happiest moment] of [his life].
]
NP
]
NP
September1999
81
September1999
82
Learning
I like candy
I candy like
September1999
83
F-measure
MUC scoring program
F-measure :=
Want high
F-measure
Harmonic mean of Recall and Precision
% coref links correctly
found by the system
% coref links found by the
system that are correct
Measure of coverage
Want high recall
Measure of accuracy
Want high precision
September1999
84
NLP is Challenging
It is often said that NLP is AI-complete:
All the difficult problems in artificial intelligence manifest
themselves in NLP problems.
This idea dates back at least to the Turing Test:
The question and answer method seems to be suitable for
introducing almost any one of the fields of human endeavour
that we wish to include [Turing, Computing Machinery and
Intelligence, 1950]
September1999
85
NLP is Cross-Disciplinary
Excellent opportunities for interdisciplinary work
Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological/cognitive plausibility
Mathematics
models of cognitive processes
and statistics: properties of models
emphasizes formal aspects
On the whole, NLP tends to be applications-oriented
95%
is OK
Models need be neither biologically plausible nor
mathematically satisfying
September1999
86
Statistical NLP
Statistical NLP: Infer language properties from text samples
Helps ease the knowledge acquisition bottleneck
September1999
87
The Turing Test
Three rooms contain a person, a computer, and an
interrogator.
The interrogator can communicate with the other two by
teleprinter.
The interrogator tries to determine which is the person and
which is the machine.
The machine tries to fool the interrogator into believing that
it is the person.
If the machine succeeds, then we conclude that the
machine has exhibited intelligence.
Turning predicted that by 2000, a machine might have a
30% chance of fooling a lay person for 5 minutes
September1999
88