Introduction to maximum entropy
models
Adwait Ratnaparkhi
Yahoo! Labs
Introduction
• This talk is geared for Technical Yahoos…
– Who are unfamiliar with machine learning or maximum entropy
– Who are familiar with machine learning but not maximum
entropy
– Who want to see an example of machine learning on the grid
– Who are looking for an introductory talk
• Lab session is geared for Technical Yahoos…
– Who brought their laptop
– Who want to obtain more numerical intuition about the
optimization algorithms
– Who don’t mind coding a little PERL
– Who have some patience in case things go awry!
-2- Yahoo! Confidential
Outline
• What is maximum entropy modeling?
• Example applications of maximum entropy framework
• Lab session
– Look at the mapper and reducer of a maximum entropy
parameter estimation algorithm
-3- Yahoo! Confidential
What is a maximum entropy model?
• A framework for combining evidence into a probability model
• Probability can then be used for classification, or input to
next component
– Content match: P(click | page, ad, user)
– Part-of-speech tagging: P(part-of-speech-tag | word )
– Categorization: P(category | page )
• Principle of maximum entropy is used as optimization
criterion
– Relationship to maximum likelihood optimization
– Same model form as multi-class logistic regression
-4- Yahoo! Confidential
Machine learning approach
• Data with labels
– Automatic: ad click
• sponsored search, content match, graphical ads
– Manual: editorial process
• page categories, part-of-speech tags
• Objective
– Training phase
• Find a (large) training set
• Use training set to construct a classifier or probability model
– Test phase
• Find an (honest) test set
• Use model to assign labels to previously unseen data
-5- Yahoo! Confidential
Why machine learning? Why maxent?
• Why machine learning?
– Pervasive ambiguity
• Any task with natural language features
– For these tasks, easier to collect and annotate data vs. hand-
coding expert system
– For certain web tasks, annotation is free (e.g., clicks)
• Why maxent?
– Little restriction on kinds of evidence
• No independence assumption
– Works well with sparse features
– Works well in parallel settings, like Hadoop
– Appeal of maxent interpretation
-6- Yahoo! Confidential
Sentence Boundary Detection
• Which “.” denotes a sentence boundary?
Mr. B. Green from X.Y.Z. Corp. said that the U.S. budget
deficit hit $1.42 trillion for the year that ended Sept. 30. The
previous year’s deficit was $459 billion.
Correct Boundary
• Model: P({yes|no}| candidate boundary)
-7- Yahoo! Confidential
Part-of-speech tagging
• What is the part of speech of flies ?
– Fruit flies like a banana.
– Time flies like an arrow.
• Model: P(tag | current word, surrounding words, …)
– End goal is sequence of tags
-8- Yahoo! Confidential
Content match ad clicks
P(click|page, ad, user)
-9- Yahoo! Confidential
Ambiguity resolution: an artificial example
• Tagging unknown, or “out-of-vocabulary”, words:
– Given word, predict POS tag based on spelling features
• Assume all words are either:
– Proper Noun (NNP)
– Verb Gerund (VBG)
• Find a model: P(tag | word) where tag is in {NNP,VBG}
• Assume a training set
– Editorially derived (word, tag) pairs
– Training data tags are { NNP, VBG }
– Reminder: artificial example!
- 10 - Yahoo! Confidential
Ambiguity resolution (cont’d)
• Ask the data
– q1: Is first letter of word capitalized ?
– q2: Does word end in ing ?
• Evidence for (q1, NNP)
– Clinton, Bush, Reagan as NNP
• Evidence for (q2, VBG)
– trading, offering, operating as VBG
• Rules based on q1,q2 can conflict
– Boeing
• How to choose in ambiguous cases?
- 11 - Yahoo! Confidential
Features represent evidence
• a = what we are predicting
• b = what we observe
• Terminology
– Here, input to feature is (a,b) pair
– But in a lot of ML literature, input to feature is just (b)
⎧1 if a = y AND q(b) = true
f y,q (a,b) = ⎨
⎩0 otherwise
f1 (a,b) = 1 if a = NNP and q1(b) = true
f 2 (a,b) = 1 if a = VBG and q2(b) = true
- 12 - Yahoo! Confidential
Combine the features: probability model
k
1
p(a | b) = ∏
Z(b) j=1
α j
f j ( a,b )
k
Z(b) = ∑ ∏α
f j (a',b )
j
a' j=1
f j (a,b) ∈ {0,1} : features
α j > 0 : parameters
• Probability is product of weights of active features
• Why this form?
€
- 13 - Yahoo! Confidential
Probabilities for ambiguity resolution
1 f1 (a,b ) f 2 (a,b )
p(NNP | Boeing) = α1 α2
Z
1 f1 ( a,b )
= α1
Z
1 f1 (a,b ) f 2 (a,b )
p(VBG | Boeing) = α1 α2
Z
1 f ( a,b )
= α2 2
Z
• How do we find optimal parameter values?
€
- 14 - Yahoo! Confidential
Maximum likelihood estimation
1 k f j (a ,b )
Q = {p |p (a |b)= ∏ αj }
Z (b) j =1
r (a,b)= {Normalized frequency
of(a,b)}
L(p )= ∑r (a,b)logp (a |b)
a ,b
pML = argmax L(p )
p∈Q
- 15 - Yahoo! Confidential
Principle of maximum entropy (Jaynes, 1957)
• Use probability model that is maximally uncertain, w.r.t.
to observed evidence
• Why? Anything else assumes a fact you have not
observed.
P = {models consistent withevidence}
H (p )= {entropyofp}
pME = argmax H (p )
p∈P
- 16 - Yahoo! Confidential
Maxent example
• Task: estimate joint distribution p(A,B)
– A is in {x,y}
– B is in {0,1}
• Define a feature f
– Assume some expected value over a training set
f (a,b) = 1 iff (a = y & b = 1), 0 otherwise
E[ f ] = 7 /10 = .7
∑ p(a,b) = 1
a,b
p(A,B) 0 1
x ? ?
y ? 0.7
€
- 17 - Yahoo! Confidential
Maxent example (cont’d)
• Define entropy H( p) = −∑ p(a,b)log p(a,b)
a,b
• One way to meet the constraints. H(p) = 1.25
p(A,B) 0 1
x .05
€ 0.2
y .05 0.7
• Maxent way to meet constraints. H(p) = 1.35
p(A,B) 0 1
x 0.1 0.1
y 0.1 0.7
- 18 - Yahoo! Confidential
Conditional maximum entropy (Berger et al, 1995)
E r f j = {observed expectation of f}
= ∑ r(a,b) f j (a,b)
a,b
E p f j = {model's expectation of f}
= ∑ r(b) p(a | b) f j (a,b)
a,b
P = { p | E p f j = E r f j , j = 1...k}
H( p) = −∑ r(b) p(a | b)log p(a | b)
a,b
pME = arg max H( p)
p ∈P
- 19 - Yahoo! Confidential
Duality of ML and ME
• Under ME it must be the case that:
k
1
pME (a | b) = ∏
Z(b) j=1
α j
f j (a,b )
• ML and ME solutions are the same
– pme=pml
€
– ML: form is assumed without justification
– ME: constraints are assumed, form is derived
- 20 - Yahoo! Confidential
Extensions: Minimum divergence modeling
• Kullback-Leibler divergence
– Measures “distance” between 2 probability distributions
– Not symmetric!
– See (Cover and Thomas, Elements of Information Theory)
p(a | b)
D( p,q) = ∑ r(b) p(a | b)log
a,b q(a | b)
D( p,q) ≥ 0
D( p,q) = 0 iff p = q
- 21 - Yahoo! Confidential
Extensions : Minimum divergence models (cont’d)
P = {p | E p f j = E r f j , j = 1...k}
pMD = arg min D( p,q)
p ∈P
k
q(a | b)∏α
f j (a,b )
j
j=1
pMD (a | b) = k
∑ q(a'| b)∏α j
f j (a',b )
a' j=1
• Minimum divergence framework:
– Start with prior model q
– From the set of consistent models P, minimize KL divergence to q
€
• Parameters will reflect deviation from prior model
– Use case: prior model is static
• Same as maximizing entropy when q is uniform
• See (Della Pietra et al, 1992) for an example in language modeling
- 22 - Yahoo! Confidential
Parameter estimation (an incomplete list)
• Generalized Iterative Scaling
(Darroch & Ratcliff, 1972)
– Find correction feature and
Define correction feature :
constant k
f k +1 (a,b) = C − ∑ f j (a,b)
– Iterative updates
• Improved iterative scaling j=1
(Della Pietra et. al., 1997)
• Conjugate gradient GIS :
• Sequential conditional GIS α (0)
j =1
(Goodman, 2002)
1/C
• Correction-free GIS ⎡ Er f j ⎤
(n ) (n−1)
(Curran and Clark, 2003) α j =α j ⎢ (n−1) ⎥
⎣E p f j ⎦
€ - 23 - Yahoo! Confidential
Comparisons
• Same model form as multi-class logistic regression
• Diverse forms of evidence
• Compared to decision trees:
– Advantage: No data fragmentation
– Disadvantage: No feature construction
• Compared to naïve bayes
– No independence assumptions
• Scales well on sparse feature sets
– Parameter estimation (GIS) is
O( [# of training samples]
[# of predictions]
[avg. # of features per training event])
- 24 - Yahoo! Confidential
Disadvantages
• “Perfect” predictors cause parameters to diverge
– Suppose the word the only occurred with tag DT
– Estimation algorithm is forcing p(a|b) = 1 in order to meet
constraints
• Parameter for (the, DT) will diverge to infinity
• May beat out other parameters estimated from many examples!
• A remedy
– Gaussian priors or “fuzzy maximum entropy” (Chen &
Rosenfeld, 2000)
– Discount observed expectations
- 25 - Yahoo! Confidential
How to specify a maxent model
• Outcomes
– What are we predicting?
• Questions
– What information is useful for predicting ?
• Feature selection
– Candidate feature set consists of all (outcome, question) pairs
– Given candidate feature set, what subset do we use?
- 26 - Yahoo! Confidential
Finding the part-of-speech
• Part of Speech (POS) Tagging
– Return a sequence of POS tags
Input: Fruit flies like a banana
Output: N N V D N
Input: Time flies like an arrow
Output: N V P D N
• Train maxent models from POS tags of Penn treebank
(Marcus et al, 1993)
• Use heavily pruned search procedure to find highest
probability tag sequence
- 27 - Yahoo! Confidential
Model for POS tagging
(Ratnaparkhi, 1996)
• Outcomes
– 45 POS tags (Penn Treebank)
• Question Patterns:
– common words: word identity
– rare words: presence of prefix, suffix, capitalization, hyphen,
and numbers
– previous 2 tags
– surrounding 2 words
• Feature Selection
– count cutoff of 10
- 28 - Yahoo! Confidential
A training event
• Example:
– ...stories about well-heeled communities and ...
NNS IN ???
• Outcome: JJ (adjective)
• Questions
– w[i-1]=about, w[i-2]=stories, w[i+1]=communities, w[i+2]=and,
t[i-1]=IN, t[i-2][i-1]=NNS IN,
pre[1]=w, pre[2]=we, pre[3]=wel, pre[4]=well,
suf[1]=d,suf[2]=ed,suf[3]=led,suf[4]=eled
- 29 - Yahoo! Confidential
Finding the best POS sequence
n
a1* ...an* = arg max ∏ p (ai | bi )
a1 ...an i =1
• Find maximum probability sequence of n tags
– Use "top K" breadth first search:
– Tag left-to-right, but maintain only top K ranked hypotheses
• Best ranked hypothesis is not guaranteed to be optimal
• Alternative: Conditional random fields (Lafferty et al,
2001)
- 30 - Yahoo! Confidential
Performance
Domain Word accuracy Unknown word Sentence accuracy
accuracy
English: Wall St. 96.6% 85.3% 47.3%
Journal
Spanish: CRATER 97.7% 83.3% 60.4%
corpus (with re-
mapped tagset)
- 31 - Yahoo! Confidential
Summary
• Errors:
– are typically the words that are difficult to annotate
• that, about, more
• Architecture
– Can be ported easily for similar tasks with different tag sets,
esp. named entity detection
• Name detector
– Tags = { begin_Name, continue_Name, other }
– Sequence probability can be used downstream
• Available for download
– MXPOST & MXTERMINATOR
- 32 - Yahoo! Confidential
Maxent model for Keystone content-match
• Yahoo’s Keystone content-match uses a click model
1
P(click | page,ad,user) = ∏
Z( page,ad,user) j
α
f j (click, page,ad ,user)
j
• Use features of the page,ad,user to predict click
• Use cases:
– Select ads with (page,ad) cosine score, use click model to re-
rank
– Select ads directly with click model score
- 33 - Yahoo! Confidential
Maxent model for Keystone content-match
• Outcomes: click (1) or no click (0)
• Questions:
– Unigrams, phrases, categories on the page side
– Unigrams, phrases, categories on the ad side
– The user’s BT category
• Feature selection
– Count cutoff, mutual information
- 34 - Yahoo! Confidential
Some recent work
• Recent work
– Using the (page,ad) cosine score (Opal) as a feature
– Using page domain ad bid phrase mappings
– Using user BT ad bid phrase mappings
– Using user age+gender bid phrase mappings
• Contact
– Andy Hatch (aohatch)
– Abraham Bagherjeiran (abagher)
- 35 - Yahoo! Confidential
Maxent on the grid
• Several grid maxent implementations at Yahoo
• Correction-Free GIS (Curran and Clark, 2003)
– Implemented for verification and lab session
– Not product-ready code!
• Input for each iteration
– Training data format:
[label] [weight] [q1 … qN]
– Feature set format:
[q] [label] [parameter]
• Output for each iteration
– New feature set format:
[q] [label] [new parameter]
- 36 - Yahoo! Confidential
Maxent on the grid (cont’d)
• Map phase (parallelization across training data)
– Collect observed feature expectations
– Collect model’s feature expectations w.r.t. current model
– Use feature name as key
• Reduce phrase (parallelization across model parameters)
– For each key (feature):
• Sum up observed and model feature expectations
• Do the parameter update
– Write the new model
• Repeat for N iterations
- 37 - Yahoo! Confidential
Maxent on the grid (cont’d)
• maxent_mapper
– args: [file of model params]
– stdin: [training data, one instance per line]
– stdout: [feature name] [observed val] [current param] [model val]
• maxent_reducer
– args: [iteration] [correction constant]
– stdin: ( input from maxent_mapper, sorted by key )
– stdout: [feature name] [new param]
• Use hadoop streaming, can be used off the grid
- 38 - Yahoo! Confidential
Lab session
• Login to a UNIX machine or Mac
• mkdir maxent_lab; cd maxent_lab
• svn checkout
svn+ssh://[Link]/yahoo/adsciences/contextual
advertising/streaming_maxent/trunk/GIS .
• cd src; make clean all
• cd ../unit_test
• ./doloop
- 39 - Yahoo! Confidential
Lab Exercise: Sentence Boundary Detection
• Problem: given a “.” in free text, classify it:
– Yes, it is a sentence boundary
– No, it is not
• Not a super-hard problem, but not super-easy!
– Hand-coded baseline can get high result
– With foreign languages, hand-coding is tougher
• cd data
• The Penn Treebank corpus : [Link]
– gunzip –c [Link] | head
– Newline indicates sentence boundary
– Penn treebank tokenizes text for NLP
• Undid tokenization as much as possible for this exercise
- 40 - Yahoo! Confidential
Lab: data generation
• cd ../lab/data
• Look at [Link]
– creates train/development test/test
– Feature extraction
• [true label] [weight] [q1] … [qN]
no 1.0 *default* prefix=Nov
yes 1.0 *default* prefix=29
• *default* is always on
– Model can estimate prior probability of Yes and No
• Prefix: char sequence before “.” up to space char
• run ./[Link]
- 41 - Yahoo! Confidential
Lab: feature selection and training
• cd ../train
• Look at dotrain
– selectfeatures
• select features based on freq => cutoff
– train
• find correction constant
• Iterate (each iteration is one map/reduce job)
– maxent_mapper: collect stats
– maxent_reducer: find update
• Look at dotest
– [Link]: Classifies as “yes” if prob > 0.5
– Evaluation: # of correctly classified test instances
• Run ./dotrain
- 42 - Yahoo! Confidential
Lab: matching expectations
• GIS should bring model expectation closer to observed
expectation
• After 1st map ([Link])
Feature observed parameter model
prefix=$1 no 462 1 233
prefix=$1 yes 4 1 233
• After 9th map ([Link])
Feature observed parameter model
prefix=$1 no 462 1.51773 461.574
prefix=$1 yes 4 0.00731677 4.42558
- 43 - Yahoo! Confidential
Lab: results
• Log-likelihood of training data must increase
Log Likelihood
0
1 2 3 4 5 6 7 8 9
-5000
-10000
Log Likelihood
-15000
-20000
-25000
-30000
-35000
Accuracy:
– Train: 46670 correct out of 47771, or 97.6952544430722%
– Dev: 14940 correct out of 15579, or 95.8983246678221%
- 44 - Yahoo! Confidential
Lab: your turn!!!
• Beat this result (on the development set only)!
• Things to try
– Feature extraction
• Look at data and find other things that would be useful for
sentence boundary detection
• data/[Link]
– Suffix features
– Feature classes
– Feature selection
• Pay attention to the number of features
– Number of iterations
– Pay attention to train vs. test accuracy
- 45 - Yahoo! Confidential
Lab: Now let’s try the real test set
• Take your best model, and try it on the test set
./dotest [Link] ../data/te
• Did you beat the baseline?
13937 correct out of 14586, or 95.5505279034691%
• Who has the highest result?
- 46 - Yahoo! Confidential