Ima 2000
Ima 2000
in Computational
Linguistics
Christopher Manning
2
Natural language understanding traditions
The logical tradition
3
Natural language understanding traditions
The formal language theory tradition (Chomsky 1957)
8
Speech and NLP: A probabilistic view
A acoustic signal W words
T syntactic (tree) structures M meanings
In spoken language use, we have a distribution:
P (A, W , T , M)
NP VP
NNP V NP NP PP
Fed raises NN NN CD NN P NP
effort V VP
to V NP
control NN
inflation
10
Where are the ambiguities?
11
The bad effects of V/N ambiguities (1)
NP VP
N V NP
Fed raises N N
interest rates
12
The bad effects of V/N ambiguities (2)
NP VP
N N V NP
rates
13
The bad effects of V/N ambiguities (3)
NP VP
N N N V NP
0.5 %
14
Phrasal attachment ambiguities
S
NP VP
NNP V NP NP PP
Fed raises NN NN CD NN P NP
interest rates 0.5 % in NN VP
effort V VP
to V NP
S control NN
NP VP inflation
NNP V NP NP VP
Fed raises NN NN NP PP V VP
interest rates CD NN P NP to V NP
0.5 % in NN control NN
effort inflation
15
The many meanings of interest [n.]
Readiness to give attention to or to learn about some-
thing
Quality of causing attention to be given
Activity, subject, etc., which one gives time and at-
tention to
The advantage, advancement or favor of an individual
or group
A stake or share (in a company, business, etc.)
Money paid regularly for the use of money
16
Hidden Markov Models – POS example
rai- inte-
hsi F ed rates
ses rest
17
Attachment ambiguities S
NP VP
S PRP V NP PP
I VBD DT NN IN NP
NP VP saw the man with DT NN
a telescope
PRP V NP
I VBD NP PP
saw DT NN IN NP
n2
18
Likelihood ratios for PP attachment
Likely attachment chosen by a (log) likelihood ratio:
P (Attach(p) = v|v, n)
λ(v, n, p) = log2
P (Attach(p) = n|v, n)
22
Simple Bayesian Inference for NLP
Central conception in early work: The “noisy chan-
23
Probabilistic inference in more generality
Overall there is a joint distribution of all the variables
e.g., P (s, t, m, d)
pactly
Some items in this distribution are observed
P (Hidden|Obs = o1 )
24
Machine Learning for NLP
Method \ Problem POS tagging WSD Parsing
Naive Bayes Gale et al. (1992)
(H)MM Charniak et al. (1993)
Decision Trees Schmid (1994) Mooney (1996) Magerman (1995)
Ringuette (1994)
Decision List/TBL Brill (1995) Brill (1993)
kNN/MBL Daelemans et al. (1996) Ng and Zavrel et al. (1997)
Lee (1996)
Maximum entropy Ratnaparkhi (1996) Ratnaparkhi
et al. (1994)
Neural networks Benello et al. (1989) Henderson and
Lane (1998)
25
Distinctiveness of NLP as an ML problem
Language allows the complex compositional encod-
• • •
•
•
••
•
10000
• •••
•
••••••
••••••
••••••
•••••
•••••••
••••••
••••••••
•••••
1000
••••••
•••••
••••••••
•••••
frequency
•••••••
••••••••
•••••••••
••••••••
••••••••
•••••••
••••••••••
••••••••
100
••••••••
••••••••
•••••••••
••••••••
•••••••
•••••••
•••••••
••••••
••••••••
•••••••
•••••••••
10
••••••••
••••••••••
••••••
•••••••
•••••••••
•••••••••••••
••••••••••••••••••••••••
1
1
f ∝ or, there is a k such that f · r = k
r
27
Simple linear models of language
Markov models a.k.a. n-gram models:
tribution
Conditional Probability Table (CPT): e.g., P (X|both )
NPsg VPsg
DT NN PP rises to . . .
Srose
NPprofits VProse
computation is bottom-up)
33
Charniak (1997) example
Srose A. h = profits ; c = NP
B. ph = rose ; pc = S
NP VProse
C. P (h|ph, c, pc)
D. P (r |h, c, pc)
Srose Srose
JJ NNSprofits
34
Charniak (1997) linear interpolation/shrinkage
els
For discrete distributions – common in NLP! – we can
(N = 150)
f1
+hyphen −hyphen
f2 + -al Y: 8 N: 2 Y: 18 N: 27 Y: 26 N: 29
− -al Y: 10 N: 20 Y: 3 N: 62 Y: 13 N: 82
Y: 18 N: 22 Y: 21 N: 89 Y: 39 N: 111
37
Loglinear/exponential (“maxent”) models
Most common modeling choice is a loglinear model:
X
log P (X1 = x1, . . . , Xp = xp ) = C
λC (xC )
1 YK fi (~
x,c)
p(~
x, c) = α
i=1 i
Z
K is the number of features, αi is the weight for
feature fi and Z is a normalizing constant. Log form:
XK
log p(~
x, c) = − log Z + x, c) × log αi
f (~
i=1 i
39
Beyond augmented PCFGs
For branching process models, relative frequency prob-
S
NP VP
PRP VBZ VP
herself[refl [Link]]
Abney (1997) and Johnson et al. (1999) develop log-
as vector spaces
Latent semantic indexing via singular value decompo-
GOFAI
Language
Story
Understanding
Understanding
Complexity
Anaphora
Resolution
Text IR
Categorization
Scale
42
Miller et al. (1996) [BBN]
System over ATIS air travel domain
43
Miller et al. (1996) [BBN]
Three stage n-best pipeline:
50
Bibliography
Abney, S. P. 1997. Stochastic attribute-value grammars. Computa-
tional Linguistics 23(4):597–618.
Benello, J., A. W. Mackie, and J. A. Anderson. 1989. Syntactic
category disambiguation with neural networks. Computer Speech
and Language 3:203–217.
Brill, E. 1993. Automatic grammar induction and parsing free text:
A transformation-based approach. In ACL 31, 259–265.
Brill, E. 1995. Transformation-based error-driven learning and nat-
ural language processing: A case study in part-of-speech tagging.
Computational Linguistics 21(4):543–565.
Carpenter, B. 1999. Type-Logical Semantics. Cambridge, MA: MIT
Press.
Charniak, E. 1997. Statistical parsing with a context-free grammar
and word statistics. In Proceedings of the Fourteenth National
Conference on Artificial Intelligence (AAAI ’97), 598–603.
51
1