0% found this document useful (0 votes)
254 views46 pages

Maxent Machine Learning Overview

This document provides an introduction to maximum entropy models. It begins with an outline of the topics to be covered, including what maximum entropy modeling is, example applications, and an upcoming lab session on parameter estimation algorithms. It then defines maximum entropy models as frameworks that combine evidence into probability models using maximum entropy as the optimization criterion. Example applications discussed include sentence boundary detection, part-of-speech tagging, and content match for ad clicks. The document also discusses machine learning approaches, features, probability distributions, maximum likelihood estimation, and the relationship between maximum likelihood and maximum entropy solutions.

Uploaded by

M
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
254 views46 pages

Maxent Machine Learning Overview

This document provides an introduction to maximum entropy models. It begins with an outline of the topics to be covered, including what maximum entropy modeling is, example applications, and an upcoming lab session on parameter estimation algorithms. It then defines maximum entropy models as frameworks that combine evidence into probability models using maximum entropy as the optimization criterion. Example applications discussed include sentence boundary detection, part-of-speech tagging, and content match for ad clicks. The document also discusses machine learning approaches, features, probability distributions, maximum likelihood estimation, and the relationship between maximum likelihood and maximum entropy solutions.

Uploaded by

M
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Introduction to maximum entropy

models
Adwait Ratnaparkhi
Yahoo! Labs
Introduction
• This talk is geared for Technical Yahoos…
– Who are unfamiliar with machine learning or maximum entropy
– Who are familiar with machine learning but not maximum
entropy
– Who want to see an example of machine learning on the grid
– Who are looking for an introductory talk
• Lab session is geared for Technical Yahoos…
– Who brought their laptop
– Who want to obtain more numerical intuition about the
optimization algorithms
– Who don’t mind coding a little PERL
– Who have some patience in case things go awry!

-2- Yahoo! Confidential


Outline
• What is maximum entropy modeling?
• Example applications of maximum entropy framework
• Lab session
– Look at the mapper and reducer of a maximum entropy
parameter estimation algorithm

-3- Yahoo! Confidential


What is a maximum entropy model?
• A framework for combining evidence into a probability model
• Probability can then be used for classification, or input to
next component
– Content match: P(click | page, ad, user)
– Part-of-speech tagging: P(part-of-speech-tag | word )
– Categorization: P(category | page )
• Principle of maximum entropy is used as optimization
criterion
– Relationship to maximum likelihood optimization
– Same model form as multi-class logistic regression

-4- Yahoo! Confidential


Machine learning approach
• Data with labels
– Automatic: ad click
• sponsored search, content match, graphical ads
– Manual: editorial process
• page categories, part-of-speech tags
• Objective
– Training phase
• Find a (large) training set
• Use training set to construct a classifier or probability model
– Test phase
• Find an (honest) test set
• Use model to assign labels to previously unseen data

-5- Yahoo! Confidential


Why machine learning? Why maxent?
• Why machine learning?
– Pervasive ambiguity
• Any task with natural language features
– For these tasks, easier to collect and annotate data vs. hand-
coding expert system
– For certain web tasks, annotation is free (e.g., clicks)
• Why maxent?
– Little restriction on kinds of evidence
• No independence assumption
– Works well with sparse features
– Works well in parallel settings, like Hadoop
– Appeal of maxent interpretation

-6- Yahoo! Confidential


Sentence Boundary Detection
• Which “.” denotes a sentence boundary?

Mr. B. Green from X.Y.Z. Corp. said that the U.S. budget
deficit hit $1.42 trillion for the year that ended Sept. 30. The
previous year’s deficit was $459 billion.

Correct Boundary

• Model: P({yes|no}| candidate boundary)

-7- Yahoo! Confidential


Part-of-speech tagging
• What is the part of speech of flies ?
– Fruit flies like a banana.
– Time flies like an arrow.

• Model: P(tag | current word, surrounding words, …)


– End goal is sequence of tags

-8- Yahoo! Confidential


Content match ad clicks

P(click|page, ad, user)

-9- Yahoo! Confidential


Ambiguity resolution: an artificial example
• Tagging unknown, or “out-of-vocabulary”, words:
– Given word, predict POS tag based on spelling features
• Assume all words are either:
– Proper Noun (NNP)
– Verb Gerund (VBG)
• Find a model: P(tag | word) where tag is in {NNP,VBG}
• Assume a training set
– Editorially derived (word, tag) pairs
– Training data tags are { NNP, VBG }
– Reminder: artificial example!

- 10 - Yahoo! Confidential
Ambiguity resolution (cont’d)
• Ask the data
– q1: Is first letter of word capitalized ?
– q2: Does word end in ing ?
• Evidence for (q1, NNP)
– Clinton, Bush, Reagan as NNP
• Evidence for (q2, VBG)
– trading, offering, operating as VBG
• Rules based on q1,q2 can conflict
– Boeing
• How to choose in ambiguous cases?

- 11 - Yahoo! Confidential
Features represent evidence
• a = what we are predicting
• b = what we observe
• Terminology
– Here, input to feature is (a,b) pair
– But in a lot of ML literature, input to feature is just (b)

⎧1 if a = y AND q(b) = true


f y,q (a,b) = ⎨
⎩0 otherwise

f1 (a,b) = 1 if a = NNP and q1(b) = true


f 2 (a,b) = 1 if a = VBG and q2(b) = true

- 12 - Yahoo! Confidential
Combine the features: probability model
k
1
p(a | b) = ∏
Z(b) j=1
α j
f j ( a,b )

k
Z(b) = ∑ ∏α
f j (a',b )
j
a' j=1

f j (a,b) ∈ {0,1} : features


α j > 0 : parameters
• Probability is product of weights of active features
• Why this form?


- 13 - Yahoo! Confidential
Probabilities for ambiguity resolution

1 f1 (a,b ) f 2 (a,b )
p(NNP | Boeing) = α1 α2
Z
1 f1 ( a,b )
= α1
Z
1 f1 (a,b ) f 2 (a,b )
p(VBG | Boeing) = α1 α2
Z
1 f ( a,b )
= α2 2
Z

• How do we find optimal parameter values?


- 14 - Yahoo! Confidential
Maximum likelihood estimation

1 k f j (a ,b )
Q = {p |p (a |b)= ∏ αj }
Z (b) j =1
r (a,b)= {Normalized frequency
of(a,b)}
L(p )= ∑r (a,b)logp (a |b)
a ,b

pML = argmax L(p )


p∈Q

- 15 - Yahoo! Confidential
Principle of maximum entropy (Jaynes, 1957)
• Use probability model that is maximally uncertain, w.r.t.
to observed evidence
• Why? Anything else assumes a fact you have not
observed.

P = {models consistent withevidence}


H (p )= {entropyofp}
pME = argmax H (p )
p∈P

- 16 - Yahoo! Confidential
Maxent example
• Task: estimate joint distribution p(A,B)
– A is in {x,y}
– B is in {0,1}
• Define a feature f
– Assume some expected value over a training set

f (a,b) = 1 iff (a = y & b = 1), 0 otherwise


E[ f ] = 7 /10 = .7
∑ p(a,b) = 1
a,b
p(A,B) 0 1
x ? ?
y ? 0.7

- 17 - Yahoo! Confidential
Maxent example (cont’d)
• Define entropy H( p) = −∑ p(a,b)log p(a,b)
a,b
• One way to meet the constraints. H(p) = 1.25

p(A,B) 0 1
x .05
€ 0.2
y .05 0.7

• Maxent way to meet constraints. H(p) = 1.35

p(A,B) 0 1
x 0.1 0.1
y 0.1 0.7

- 18 - Yahoo! Confidential
Conditional maximum entropy (Berger et al, 1995)

E r f j = {observed expectation of f}
= ∑ r(a,b) f j (a,b)
a,b

E p f j = {model's expectation of f}
= ∑ r(b) p(a | b) f j (a,b)
a,b

P = { p | E p f j = E r f j , j = 1...k}
H( p) = −∑ r(b) p(a | b)log p(a | b)
a,b

pME = arg max H( p)


p ∈P

- 19 - Yahoo! Confidential
Duality of ML and ME
• Under ME it must be the case that:
k
1
pME (a | b) = ∏
Z(b) j=1
α j
f j (a,b )

• ML and ME solutions are the same


– pme=pml

– ML: form is assumed without justification
– ME: constraints are assumed, form is derived

- 20 - Yahoo! Confidential
Extensions: Minimum divergence modeling
• Kullback-Leibler divergence
– Measures “distance” between 2 probability distributions
– Not symmetric!
– See (Cover and Thomas, Elements of Information Theory)

p(a | b)
D( p,q) = ∑ r(b) p(a | b)log
a,b q(a | b)
D( p,q) ≥ 0
D( p,q) = 0 iff p = q

- 21 - Yahoo! Confidential
Extensions : Minimum divergence models (cont’d)
P = {p | E p f j = E r f j , j = 1...k}
pMD = arg min D( p,q)
p ∈P
k
q(a | b)∏α
f j (a,b )
j
j=1
pMD (a | b) = k

∑ q(a'| b)∏α j
f j (a',b )

a' j=1
• Minimum divergence framework:
– Start with prior model q
– From the set of consistent models P, minimize KL divergence to q

• Parameters will reflect deviation from prior model
– Use case: prior model is static
• Same as maximizing entropy when q is uniform
• See (Della Pietra et al, 1992) for an example in language modeling

- 22 - Yahoo! Confidential
Parameter estimation (an incomplete list)
• Generalized Iterative Scaling
(Darroch & Ratcliff, 1972)
– Find correction feature and
Define correction feature :
constant k
f k +1 (a,b) = C − ∑ f j (a,b)
– Iterative updates

• Improved iterative scaling j=1


(Della Pietra et. al., 1997)
• Conjugate gradient GIS :
• Sequential conditional GIS α (0)
j =1
(Goodman, 2002)
1/C
• Correction-free GIS ⎡ Er f j ⎤
(n ) (n−1)
(Curran and Clark, 2003) α j =α j ⎢ (n−1) ⎥
⎣E p f j ⎦

€ - 23 - Yahoo! Confidential
Comparisons
• Same model form as multi-class logistic regression
• Diverse forms of evidence
• Compared to decision trees:
– Advantage: No data fragmentation
– Disadvantage: No feature construction
• Compared to naïve bayes
– No independence assumptions
• Scales well on sparse feature sets
– Parameter estimation (GIS) is
O( [# of training samples]
[# of predictions]
[avg. # of features per training event])

- 24 - Yahoo! Confidential
Disadvantages
• “Perfect” predictors cause parameters to diverge
– Suppose the word the only occurred with tag DT
– Estimation algorithm is forcing p(a|b) = 1 in order to meet
constraints
• Parameter for (the, DT) will diverge to infinity
• May beat out other parameters estimated from many examples!
• A remedy
– Gaussian priors or “fuzzy maximum entropy” (Chen &
Rosenfeld, 2000)
– Discount observed expectations

- 25 - Yahoo! Confidential
How to specify a maxent model
• Outcomes
– What are we predicting?
• Questions
– What information is useful for predicting ?
• Feature selection
– Candidate feature set consists of all (outcome, question) pairs
– Given candidate feature set, what subset do we use?

- 26 - Yahoo! Confidential
Finding the part-of-speech
• Part of Speech (POS) Tagging
– Return a sequence of POS tags

Input: Fruit flies like a banana


Output: N N V D N

Input: Time flies like an arrow


Output: N V P D N
• Train maxent models from POS tags of Penn treebank
(Marcus et al, 1993)

• Use heavily pruned search procedure to find highest


probability tag sequence

- 27 - Yahoo! Confidential
Model for POS tagging
(Ratnaparkhi, 1996)

• Outcomes
– 45 POS tags (Penn Treebank)
• Question Patterns:
– common words: word identity
– rare words: presence of prefix, suffix, capitalization, hyphen,
and numbers
– previous 2 tags
– surrounding 2 words
• Feature Selection
– count cutoff of 10

- 28 - Yahoo! Confidential
A training event
• Example:
– ...stories about well-heeled communities and ...
NNS IN ???
• Outcome: JJ (adjective)
• Questions
– w[i-1]=about, w[i-2]=stories, w[i+1]=communities, w[i+2]=and,
t[i-1]=IN, t[i-2][i-1]=NNS IN,
pre[1]=w, pre[2]=we, pre[3]=wel, pre[4]=well,
suf[1]=d,suf[2]=ed,suf[3]=led,suf[4]=eled

- 29 - Yahoo! Confidential
Finding the best POS sequence

n
a1* ...an* = arg max ∏ p (ai | bi )
a1 ...an i =1

• Find maximum probability sequence of n tags


– Use "top K" breadth first search:
– Tag left-to-right, but maintain only top K ranked hypotheses
• Best ranked hypothesis is not guaranteed to be optimal
• Alternative: Conditional random fields (Lafferty et al,
2001)

- 30 - Yahoo! Confidential
Performance

Domain Word accuracy Unknown word Sentence accuracy


accuracy

English: Wall St. 96.6% 85.3% 47.3%


Journal

Spanish: CRATER 97.7% 83.3% 60.4%


corpus (with re-
mapped tagset)

- 31 - Yahoo! Confidential
Summary
• Errors:
– are typically the words that are difficult to annotate
• that, about, more

• Architecture
– Can be ported easily for similar tasks with different tag sets,
esp. named entity detection
• Name detector
– Tags = { begin_Name, continue_Name, other }
– Sequence probability can be used downstream

• Available for download


– MXPOST & MXTERMINATOR

- 32 - Yahoo! Confidential
Maxent model for Keystone content-match
• Yahoo’s Keystone content-match uses a click model
1
P(click | page,ad,user) = ∏
Z( page,ad,user) j
α
f j (click, page,ad ,user)
j

• Use features of the page,ad,user to predict click


• Use cases:
– Select ads with (page,ad) cosine score, use click model to re-
rank
– Select ads directly with click model score

- 33 - Yahoo! Confidential
Maxent model for Keystone content-match
• Outcomes: click (1) or no click (0)
• Questions:
– Unigrams, phrases, categories on the page side
– Unigrams, phrases, categories on the ad side
– The user’s BT category
• Feature selection
– Count cutoff, mutual information

- 34 - Yahoo! Confidential
Some recent work
• Recent work
– Using the (page,ad) cosine score (Opal) as a feature
– Using page domain  ad bid phrase mappings
– Using user BT  ad bid phrase mappings
– Using user age+gender  bid phrase mappings
• Contact
– Andy Hatch (aohatch)
– Abraham Bagherjeiran (abagher)

- 35 - Yahoo! Confidential
Maxent on the grid
• Several grid maxent implementations at Yahoo
• Correction-Free GIS (Curran and Clark, 2003)
– Implemented for verification and lab session
– Not product-ready code!
• Input for each iteration
– Training data format:
[label] [weight] [q1 … qN]
– Feature set format:
[q] [label] [parameter]
• Output for each iteration
– New feature set format:
[q] [label] [new parameter]

- 36 - Yahoo! Confidential
Maxent on the grid (cont’d)
• Map phase (parallelization across training data)
– Collect observed feature expectations
– Collect model’s feature expectations w.r.t. current model
– Use feature name as key
• Reduce phrase (parallelization across model parameters)
– For each key (feature):
• Sum up observed and model feature expectations
• Do the parameter update
– Write the new model
• Repeat for N iterations

- 37 - Yahoo! Confidential
Maxent on the grid (cont’d)
• maxent_mapper
– args: [file of model params]
– stdin: [training data, one instance per line]
– stdout: [feature name] [observed val] [current param] [model val]
• maxent_reducer
– args: [iteration] [correction constant]
– stdin: ( input from maxent_mapper, sorted by key )
– stdout: [feature name] [new param]
• Use hadoop streaming, can be used off the grid

- 38 - Yahoo! Confidential
Lab session
• Login to a UNIX machine or Mac
• mkdir maxent_lab; cd maxent_lab
• svn checkout
svn+ssh://[Link]/yahoo/adsciences/contextual
advertising/streaming_maxent/trunk/GIS .
• cd src; make clean all
• cd ../unit_test
• ./doloop

- 39 - Yahoo! Confidential
Lab Exercise: Sentence Boundary Detection
• Problem: given a “.” in free text, classify it:
– Yes, it is a sentence boundary
– No, it is not
• Not a super-hard problem, but not super-easy!
– Hand-coded baseline can get high result
– With foreign languages, hand-coding is tougher
• cd data
• The Penn Treebank corpus : [Link]
– gunzip –c [Link] | head
– Newline indicates sentence boundary
– Penn treebank tokenizes text for NLP
• Undid tokenization as much as possible for this exercise

- 40 - Yahoo! Confidential
Lab: data generation
• cd ../lab/data
• Look at [Link]
– creates train/development test/test
– Feature extraction
• [true label] [weight] [q1] … [qN]

no 1.0 *default* prefix=Nov


yes 1.0 *default* prefix=29

• *default* is always on
– Model can estimate prior probability of Yes and No
• Prefix: char sequence before “.” up to space char
• run ./[Link]

- 41 - Yahoo! Confidential
Lab: feature selection and training
• cd ../train
• Look at dotrain
– selectfeatures
• select features based on freq => cutoff
– train
• find correction constant
• Iterate (each iteration is one map/reduce job)
– maxent_mapper: collect stats
– maxent_reducer: find update

• Look at dotest
– [Link]: Classifies as “yes” if prob > 0.5
– Evaluation: # of correctly classified test instances
• Run ./dotrain

- 42 - Yahoo! Confidential
Lab: matching expectations
• GIS should bring model expectation closer to observed
expectation
• After 1st map ([Link])
Feature observed parameter model
prefix=$1 no 462 1 233
prefix=$1 yes 4 1 233

• After 9th map ([Link])

Feature observed parameter model


prefix=$1 no 462 1.51773 461.574
prefix=$1 yes 4 0.00731677 4.42558

- 43 - Yahoo! Confidential
Lab: results
• Log-likelihood of training data must increase

Log Likelihood
0
1 2 3 4 5 6 7 8 9
-5000

-10000
Log Likelihood
-15000

-20000

-25000

-30000

-35000

Accuracy:
– Train: 46670 correct out of 47771, or 97.6952544430722%
– Dev: 14940 correct out of 15579, or 95.8983246678221%

- 44 - Yahoo! Confidential
Lab: your turn!!!
• Beat this result (on the development set only)!
• Things to try
– Feature extraction
• Look at data and find other things that would be useful for
sentence boundary detection
• data/[Link]
– Suffix features
– Feature classes
– Feature selection
• Pay attention to the number of features
– Number of iterations
– Pay attention to train vs. test accuracy

- 45 - Yahoo! Confidential
Lab: Now let’s try the real test set
• Take your best model, and try it on the test set

./dotest [Link] ../data/te

• Did you beat the baseline?

13937 correct out of 14586, or 95.5505279034691%

• Who has the highest result?

- 46 - Yahoo! Confidential

You might also like