0% found this document useful (0 votes)

123 views47 pages

CSCI 5832 Natural Language Processing: Jim Martin

This document discusses natural language processing and statistical sequence classification techniques. It reviews hidden Markov models (HMMs) for part-of-speech (POS) tagging and introduces maximum entropy Markov models (MEMMs). It explains how HMMs, MaxEnt models, and MEMMs can be used for statistical sequence classification tasks like POS tagging. The document also covers concepts like transition probabilities, observation probabilities, the Viterbi algorithm, and entropy as it relates to uncertainty in language modeling.

Uploaded by

Eman Asem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views47 pages

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Eman Asem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

CSCI 5832

Natural Language Processing

Jim Martin
Lecture 9

10/11/20 1
Today 2/19

• Review HMMs for POS tagging

• Entropy intuition
• Statistical Sequence classifiers
 HMMs
 MaxEnt
 MEMMs

2
10/11/20
Statistical Sequence
Classification
• Given an input sequence, assign a label
(or tag) to each element of the tape
 Or... Given an input tape, write a tag out to an
output tape for each cell on the input tape
• Can be viewed as a classification task if
we view
 The individual cells on the input tape as things
to be classified
 The tags written on the output tape as the
class labels
3
10/11/20
POS Tagging as Sequence
Classification

• We are given a sentence (an “observation” or

“sequence of observations”)
 Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Probabilistic view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
4
10/11/20
Statistical Sequence
Classification
• We want, out of all sequences of n tags t1…tn the single
tag sequence such that P(t1…tn|w1…wn) is highest.

• Hat ^ means “our estimate of the best one”

• Argmaxx f(x) means “the x such that f(x) is maximized”

5
10/11/20
Road to HMMs

• This equation is guaranteed to give us the

best tag sequence

• But how to make it operational? How to

compute this value?
• Intuition of Bayesian classification:
 Use Bayes rule to transform into a set of other
probabilities that are easier to compute
6
10/11/20
Using Bayes Rule

7
10/11/20
Likelihood and Prior

8
10/11/20
Transition Probabilities

• Tag transition probabilities p(ti|ti-1)

 Determiners likely to precede adjs and nouns
 That/DT flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 Compute P(NN|DT) by counting in a labeled
corpus:

9
10/11/20
Observation Probabilities

• Word likelihood probabilities p(wi|ti)

 VBZ (3sg Pres verb) likely to be “is”
 Compute P(is|VBZ) by counting in a
labeled corpus:

10
10/11/20
An Example: the verb “race”

• Secretariat/NNP is/VBZ expected/VBN to/TO

race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag?

11
10/11/20
Disambiguating “race”

12
10/11/20
Example

14
10/11/20
Markov chain = “First-order
Observable Markov Model”
• A set of states
 Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
 a set of probabilities A = a01a02…an1…ann.
 Each aij represents the probability of transitioning from
state i to state j
 The set of these is the transition probability matrix A

• Current state only depends on previous state

P(qi | q1 ...qi−1) = P(qi | qi−1 )
15
10/11/20
Hidden Markov Models

• States Q = q1, q2…qN;

• Observations O= o1, o2…oN;
 Each observation is a symbol from a vocabulary V = {v1,v2,…
vV}
• Transition probabilities
 Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
• Observation likelihoods
 Output probability matrix B={bi(k)}
b (k) = P(X t = ok | qt = i)
probability vector i
• Special initial €
16
10/11/20 π i = P(q1 = i) 1 ≤ i ≤ N
Transitions between the hidden
states of HMM, showing A probs

17
10/11/20
B observation likelihoods for
POS HMM

18
10/11/20
The A matrix for the POS HMM

19
10/11/20
The B matrix for the POS HMM

20
10/11/20
Viterbi intuition: we are looking
for the best ‘path’
S1 S2 S3 S4 S5
RB

VBN
JJ DT VB
TO
VBD
VB NNP NN

promised to back the bill

21
10/11/20
The Viterbi Algorithm

22
10/11/20
Viterbi example

23
10/11/20
Information Theory

• Who is going to win the World Series next

year?
• Well there are 30 teams. Each has a
chance, so there’s a 1/30 chance for any
team…? No.
 Rockies? Big surprise, lots of information
 Yankees? No surprise, not much information

24
10/11/20
Information Theory

• How much uncertainty is there when you

don’t know the outcome of some event
(answer to some question)?
• How much information is to be gained by
knowing the outcome of some event
(answer to some question)?

25
10/11/20
Aside on logs

• Base doesn’t matter. Unless I say

otherwise, I mean base 2.
• Probabilities lie between 0 an 1. So log
probabilities are negative and range from
0 (log 1) to –infinity (log 0).
• The – is a pain so at some point we’ll
make it go away by multiplying by -1.

26
10/11/20
Entropy

• Let’s start with a simple case, the

probability of word sequences with a
unigram model
• Example
 S = “One fish two fish red fish blue fish”
 P(S) =
P(One)P(fish)P(two)P(fish)P(red)P(fish)P(blue)P(fish)
 Log P(S) = Log P(One)+Log P(fish)+…Log P(fish)

27
10/11/20
Entropy cont.

• In general that’s log P ( S ) = ∑logP ( w)

• But note that w∈S

 the order doesn’t

matter
 that words can occur
multiple times
 and that they always
contribute the same
each time
 so rearranging…
log P( s ) = ∑Count (w) * log P(w)
w∈V

28
10/11/20
Entropy cont.

• One fish two fish red fish blue fish

• Fish fish fish fish one two red blue

LogP ( s ) = 4 * log P ( fish) +1* log P (one) +1* log P (two) +1* log P (red ) +1* log P(blue)

29
10/11/20
Entropy cont.

• Now let’s divide both sides by N, the length of the

sequence:

1 1
log P( S ) = ∑Count (w) log P(w)
N N w∈V

• That’s basically an average of the logprobs

30
10/11/20
Entropy

• Now assume the sequence is really really

long.
• Moving the N into the summation you get

Count ( w)
∑ log P( w)
w∈V N
• Rewriting and getting rid of the minus sign
H ( S ) = −∑ P ( w) log P( w)
w∈V 31
10/11/20
Entropy

• Think about this in terms of uncertainty or

surprise.
 The more likely a sequence is, the lower the
entropy. Why?

H ( S ) = −∑ P( w) log P( w)
w∈V

32
10/11/20
Model Evaluation

• Remember the name of the game is to

come up with statistical models that
capture something useful in some body of
text or speech.
• There are precisely a gazzilion ways to do
this
 N-grams of various sizes
 Smoothing
 Backoff…
33
10/11/20
Model Evaluation

• Given a collection of text and a couple of

models, how can we tell which model is
best?
• Intuition… the model that assigns the
highest probability to a set of withheld text
 Withheld text? Text drawn from the same
distribution (corpus), but not used in the
creation of the model being evaluated.

34
10/11/20
Model Evaluation

• The more you’re surprised at some event

that actually happens, the worse your
model was.
• We want models that minimize your
surprise at observed outcomes.
• Given two models and some training data
and some withheld test data… which is
better?
35
10/11/20
Three HMM Problems

• Given a model and an observation

sequence
 Compute Argmax P(states | observation seq)
 Viterbi
 Compute P(observation seq | model)
 Forward
 Compute P(model | observation seq)
 EM (magic)

36
10/11/20
Viterbi

• Given a model and an observation

sequence, what is the most likely state
sequence
 The state sequence is the set of labels
assigned
 So using Viterbi with an HMM solves the
sequence classification task

37
10/11/20
Forward

• Given an HMM model and an observed

sequence, what is the probability of that
sequence?
 P(sequence | Model)
 Sum of all the paths in the model that could have
produced that sequence
 So...
• How do we change Viterbi to get Forward?

38
10/11/20
Who cares?

• Suppose I have two different HMM models

extracted from some training data.
• And suppose I have a good-sized set of
held-out data (not used to produce the
above models).
• How can I tell which model is the better
model?

39
10/11/20
Learning Models

• Now assume that you just have a single

HMM model (pi, A, and B tables)
• How can I produce a second model from
that model?
 Rejigger the numbers... (in such a way that
the tables still function correctly)
 Now how can I tell if I’ve made things better?

40
10/11/20
EM

• Given an HMM structure and a sequence,

we can learn the best parameters for the
model without explicit training data.
 In the case of POS tagging all you need is
unlabelled text.
 Huh? Magic. We’ll come back to this.

41
10/11/20
Generative vs. Discriminative
Models

• For POS tagging we start with the

question... P(tags | words) but we end up
via Bayes at
 P(words| tags)P(tags)
 That’s called a generative model
 We’re reasoning backwards from the models
that could have produced such an output

42
10/11/20
Disambiguating “race”

43
10/11/20
Discriminative Models

• What if we went back to the start to

 Argmax P(tags|words) and didn’t use Bayes?
 Can we get a handle on this directly?
 First let’s generalize to P(tags|evidence)
 Let’s make some independence assumptions and
consider the previous state and the current word as
the evidence. How does that look as a graphical
model?

44
10/11/20
MaxEnt Tagging

45
10/11/20
MaxEnt Tagging

• This framework allows us to throw in a

wide range of “features”. That is, evidence
that can help with the tagging.

46
10/11/20
Statistical Sequence
Classification

47
10/11/20

CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
46 pages
L4 Tagging
No ratings yet
L4 Tagging
107 pages
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
No ratings yet
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
51 pages
POS Tagging with Hidden Markov Models
No ratings yet
POS Tagging with Hidden Markov Models
37 pages
POS HMM Viterbi Algo 2025
No ratings yet
POS HMM Viterbi Algo 2025
52 pages
Lec PoS Tagging 2022
No ratings yet
Lec PoS Tagging 2022
67 pages
HMM Isolated Word Recognition
No ratings yet
HMM Isolated Word Recognition
23 pages
HMM Detailed
No ratings yet
HMM Detailed
41 pages
HMMs for AI & Web Data Extraction
No ratings yet
HMMs for AI & Web Data Extraction
34 pages
Hidden Markov Models for CS Students
No ratings yet
Hidden Markov Models for CS Students
35 pages
Classical NLP Optimization Techniques
No ratings yet
Classical NLP Optimization Techniques
23 pages
Sequence Model:: Hidden Markov Models
No ratings yet
Sequence Model:: Hidden Markov Models
60 pages
5 Sequence Learning
No ratings yet
5 Sequence Learning
50 pages
24f 09 Hidden Markov Models
No ratings yet
24f 09 Hidden Markov Models
79 pages
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
No ratings yet
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
55 pages
NLP Mod5 Lec1 Markov Model and Pos
No ratings yet
NLP Mod5 Lec1 Markov Model and Pos
21 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
51 pages
Week 9
No ratings yet
Week 9
36 pages
2024 Fall CSE366 12 HMM
No ratings yet
2024 Fall CSE366 12 HMM
46 pages
HMMs in Speech Recognition
No ratings yet
HMMs in Speech Recognition
35 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Lecture05-Hmm Pos Tagging
No ratings yet
Lecture05-Hmm Pos Tagging
38 pages
Part of Speech Tagging in NLP
No ratings yet
Part of Speech Tagging in NLP
57 pages
19CSE453 - Natural Language Processing: Part of Speech Tagging
No ratings yet
19CSE453 - Natural Language Processing: Part of Speech Tagging
59 pages
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
No ratings yet
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
29 pages
9.chapter7 POS Tagging
No ratings yet
9.chapter7 POS Tagging
37 pages
Lsa352 Lec5
No ratings yet
Lsa352 Lec5
70 pages
Labman 2
No ratings yet
Labman 2
16 pages
Markov Chains
No ratings yet
Markov Chains
24 pages
PR l23 PDF
No ratings yet
PR l23 PDF
23 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
56 pages
PoS Tagging and HMM in NLP
No ratings yet
PoS Tagging and HMM in NLP
50 pages
MLRD 8
No ratings yet
MLRD 8
39 pages
Class Notes (Unit I - HMM & MaxEnt)
No ratings yet
Class Notes (Unit I - HMM & MaxEnt)
28 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
55 pages
5 Natural Language Processing
No ratings yet
5 Natural Language Processing
7 pages
Lec18 HMMs
No ratings yet
Lec18 HMMs
56 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Introduction to Conditional Random Fields
No ratings yet
Introduction to Conditional Random Fields
41 pages
Sequence Labeling for NLP Students
No ratings yet
Sequence Labeling for NLP Students
79 pages
NLP Algorithms for Students
No ratings yet
NLP Algorithms for Students
5 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
35 pages
Hidden Markovnikov Model
No ratings yet
Hidden Markovnikov Model
32 pages
Tutorial I
No ratings yet
Tutorial I
6 pages
CSE291D 10a
No ratings yet
CSE291D 10a
55 pages
SP14 CS188 Lecture 14 - Hidden Markov Models - Print
No ratings yet
SP14 CS188 Lecture 14 - Hidden Markov Models - Print
26 pages
Forward-Backward Algorithm & HMM for POS Tagging
No ratings yet
Forward-Backward Algorithm & HMM for POS Tagging
5 pages
Average Information & Markov Models
No ratings yet
Average Information & Markov Models
22 pages
Bayesian Learning Essentials
No ratings yet
Bayesian Learning Essentials
49 pages
Hidden Markov Models and Sequential Data
No ratings yet
Hidden Markov Models and Sequential Data
45 pages
Recognition of Socphatic Speaking
No ratings yet
Recognition of Socphatic Speaking
7 pages
Language Models
No ratings yet
Language Models
34 pages
WSDM 1 31 15
No ratings yet
WSDM 1 31 15
108 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
Unit 5
No ratings yet
Unit 5
8 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
No ratings yet
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
40 pages
CPIT 110 First Semester 2020 Schedule
No ratings yet
CPIT 110 First Semester 2020 Schedule
4 pages
Web Concepts and Terminology Quiz
No ratings yet
Web Concepts and Terminology Quiz
2 pages
Computer Skills - CPIT 100
No ratings yet
Computer Skills - CPIT 100
4 pages
Open Source Arabic Corpora Overview
No ratings yet
Open Source Arabic Corpora Overview
7 pages
The American Revolution
No ratings yet
The American Revolution
4 pages
1 1 550 Signalling Principles Designer v6
No ratings yet
1 1 550 Signalling Principles Designer v6
31 pages
Business Writing Fundamentals Explained
No ratings yet
Business Writing Fundamentals Explained
3 pages
Music 3rd Quarter Week 1 Day 1
No ratings yet
Music 3rd Quarter Week 1 Day 1
20 pages
Le Déjeuner Sur L'herbe: The Luncheon On The Grass
No ratings yet
Le Déjeuner Sur L'herbe: The Luncheon On The Grass
7 pages
DepEd Child Protection Policy Booklet
93% (27)
DepEd Child Protection Policy Booklet
40 pages
Browning's Optimism in "The Last Ride Together"
No ratings yet
Browning's Optimism in "The Last Ride Together"
2 pages
15 Substance Related and Addictive Disorders
100% (1)
15 Substance Related and Addictive Disorders
100 pages
English Lesson Plan on Personal Recounts
No ratings yet
English Lesson Plan on Personal Recounts
3 pages
Survey Question Tofu Bites 5 1
No ratings yet
Survey Question Tofu Bites 5 1
1 page
Modal Verbs: Can - Could May - Might Shall - Should Will - Would Must - Ought To Need To Have To
100% (1)
Modal Verbs: Can - Could May - Might Shall - Should Will - Would Must - Ought To Need To Have To
25 pages
TOP 90 AI Tools in 2024-2025
No ratings yet
TOP 90 AI Tools in 2024-2025
14 pages
Automated Student Management Kiosk Using Devops Minor Project I
No ratings yet
Automated Student Management Kiosk Using Devops Minor Project I
4 pages
Media Jukebox Setup Guide
No ratings yet
Media Jukebox Setup Guide
8 pages
Mba - Group 5 - Cooperative Strategy Final
No ratings yet
Mba - Group 5 - Cooperative Strategy Final
21 pages
Mugaiyur Block
No ratings yet
Mugaiyur Block
5 pages
Global Sourcing Strategy: R&D, Manufacturing, and Marketing Interfaces
No ratings yet
Global Sourcing Strategy: R&D, Manufacturing, and Marketing Interfaces
30 pages
2 Properties of Numerical Techniques
No ratings yet
2 Properties of Numerical Techniques
16 pages
Elementary English Teaching Quiz
No ratings yet
Elementary English Teaching Quiz
9 pages
Warren Buffet Talks About The History of The Market
100% (7)
Warren Buffet Talks About The History of The Market
4 pages
Sa Ii - FDP - Brochure
No ratings yet
Sa Ii - FDP - Brochure
2 pages
Rosa Tee (Lady) by Anna Dervou
100% (4)
Rosa Tee (Lady) by Anna Dervou
13 pages
Spec. Engl. 4 - Teaching of English in The Elementary Grades
No ratings yet
Spec. Engl. 4 - Teaching of English in The Elementary Grades
10 pages
License Comparison Chart For SAP Business One
No ratings yet
License Comparison Chart For SAP Business One
6 pages
Boosting 4G Productivity in Guwahati
No ratings yet
Boosting 4G Productivity in Guwahati
37 pages
Intelligent Life in The Universe (Principles and Requirements Behind Its Emergence) (2nd Edition) Ulmschneider
No ratings yet
Intelligent Life in The Universe (Principles and Requirements Behind Its Emergence) (2nd Edition) Ulmschneider
10 pages
100+ Strong Verbs For Research Writing
No ratings yet
100+ Strong Verbs For Research Writing
8 pages
The Road Less Traveled
No ratings yet
The Road Less Traveled
94 pages
Nexus
No ratings yet
Nexus
6 pages
Implementation of Six Sigma Technique For Welding Defect Rejection
No ratings yet
Implementation of Six Sigma Technique For Welding Defect Rejection
11 pages