0% found this document useful (0 votes)
21 views56 pages

Ima 2000

Uploaded by

yan_tings
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views56 pages

Ima 2000

Uploaded by

yan_tings
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probabilistic Models

in Computational
Linguistics

Christopher Manning

Depts of Computer Science and Linguistics


Stanford University
[Link]
1
Aims and Goals of Computational Linguistics
 To be able to understand and act on human languages
 To be able to fluently produce human languages
 Applied goals: machine translation, question answer-
ing, information retrieval, speech-driven personal as-
sistants, text mining, report generation, . . .

The big questions for linguistic science


 What kinds of things do people say?
 What do these things say/ask/request about the world?

I will argue that answering these involves questions of


frequency, probability, and likelihood

2
Natural language understanding traditions
 The logical tradition

 Gave up the goal of dealing with imperfect natural

languages in the development of formal logics


 But the tools were taken and re-applied to natural

languages (Lambek 1958, Montague 1973, etc.)


 These tools give rich descriptions of natural lan-

guage structure, and particularly the construction


of sentence meanings (e.g., Carpenter 1999)
ñ
NP:α NP\S:β
S:β(α)
 They don’t tell us about word meaning or use

3
Natural language understanding traditions
 The formal language theory tradition (Chomsky 1957)

 Languages are generated by a grammar, which de-

fines the strings that are members of the language


(others are ungrammatical)
ñ NP → Det Adj* N Adj → clever
 The generation process of the grammar puts struc-

tures over these language strings


 This process is reversed in parsing the language

 These ideas are still usually present in the symbolic

backbone of most statistical NLP systems


 Often insufficient attention to meaning
4
Why Probabilistic Language Understanding?
 Language use is situated in a world context
 People write or say the little that is needed to be

understood in a certain discourse situation


 Consequently
 Language is highly ambiguous
 Tasks like interpretation and translation involve

(probabilistically) reasoning about meaning, using


world knowledge not in the source text
 We thus need to explore quantitative techniques that

move away from the unrealistic categorical assump-


tions of much of formal linguistic theory (and earlier
computational linguistics) 5
Why probabilistic linguistics?
 Categorical grammars aren’t predictive: their notions

of grammaticality and ambiguity do not accord with


human perceptions
 They don’t tell us what “sounds natural”
 Grammatical but unnatural e.g.: In addition to this,
she insisted that women were regarded as a differ-
ent existence from men unfairly.
 Need to account for variation of languages across

speech communities and across time


 People are creative: they bend language ‘rules’ as

needed to achieve their novel communication needs


 Consequently “All grammars leak” (Sapir 1921:39)
6
Psycholinguistics in one slide
 Humans rapidly and incrementally accumulate and

integrate information from world and discourse con-


text and the current utterance so as to interpret what
someone is saying in real time. Often commit early.
 They can often finish each other’s sentences!

 If a human starts hearing Pick up the yellow plate and


there is only one yellow item around, they’ll already
have locked on to it before the word yellow is finished
 Our NLP models don’t incorporate context into recog-

nition like this, or disambiguate without having heard


whole words (and often following context as well)
7
StatNLP: Relation to wider context
 Matches move from logic-based AI to probabilistic AI

 Knowledge → probability distributions

 Inference → conditional distributions

 Probabilities give opportunity to unify reasoning, plan-

ning, and learning, with communication


 There is now widespread use of machine learning

(ML) methods in NLP (perhaps even overuse?)


 Now, an emphasis on empirical validation and the use

of approximation for hard problems

8
Speech and NLP: A probabilistic view
 A acoustic signal  W words
 T syntactic (tree) structures  M meanings
 In spoken language use, we have a distribution:

P (A, W , T , M)

 In written language, just: P (W , T , M)


 Speech people have usually looked at: P (W |A) – the
rest of the hidden structure is ignored
 NLP people interested in the ‘more hidden’ structure
– T and often M – but sometimes W is observable
 E.g., there is much work looking at the parsing prob-
lem P (T |W ). Language generation is P (W |M). 9
Why is NLU difficult? The hidden structure of
language is hugely ambiguous
 Structures for: Fed raises interest rates 0.5% in effort
to control inflation (NYT headline 17 May 2000)
 S

NP VP

NNP V NP NP PP

Fed raises NN NN CD NN P NP

interest rates 0.5 % in NN VP

effort V VP

to V NP

control NN

inflation
10
Where are the ambiguities?

Part of speech ambiguities


Syntactic
VB attachment
VBZ VBP VBZ ambiguities
NNP NNS NN NNS CD NN
Fed raises interest rates 0.5 % in effort
to control
inflation

Word sense ambiguities: Fed → “federal agent”


interest → a feeling of wanting to know or learn more
Semantic interpretation ambiguities above the word level

11
The bad effects of V/N ambiguities (1)

NP VP

N V NP

Fed raises N N

interest rates

12
The bad effects of V/N ambiguities (2)

NP VP

N N V NP

Fed raises interest N

rates

13
The bad effects of V/N ambiguities (3)

NP VP

N N N V NP

Fed raises interest rates CD N

0.5 %

14
Phrasal attachment ambiguities
 S
NP VP
NNP V NP NP PP
Fed raises NN NN CD NN P NP
interest rates 0.5 % in NN VP
effort V VP
to V NP
 S control NN
NP VP inflation
NNP V NP NP VP
Fed raises NN NN NP PP V VP
interest rates CD NN P NP to V NP
0.5 % in NN control NN
effort inflation
15
The many meanings of interest [n.]
 Readiness to give attention to or to learn about some-
thing
 Quality of causing attention to be given
 Activity, subject, etc., which one gives time and at-
tention to
 The advantage, advancement or favor of an individual
or group
 A stake or share (in a company, business, etc.)
 Money paid regularly for the use of money

Converse: words or senses that mean (almost) the same:


image, likeness, portrait, facsimile, picture

16
Hidden Markov Models – POS example

X1 aij X2 aij X3 aij X4 aij X5


hsi NNP VBZ NN NNS

bik bik bik bik bik

rai- inte-
hsi F ed rates
ses rest

 Top row is unobserved states, interpreted as POS tags

 Bottom row is observed output observations

17
Attachment ambiguities S
NP VP

S PRP V NP PP
I VBD DT NN IN NP
NP VP saw the man with DT NN
a telescope
PRP V NP

I VBD NP PP

saw DT NN IN NP

the man with DT NN


v
a telescope
n1 p

n2
18
Likelihood ratios for PP attachment
 Likely attachment chosen by a (log) likelihood ratio:

P (Attach(p) = v|v, n)
λ(v, n, p) = log2
P (Attach(p) = n|v, n)

P (VAp = 1|v)P (NAp = 0|v)


= log2
P (NAp = 1|n)

If (large) positive, decide verb attachment [e.g., be-


low]; if (large) negative, decide noun attachment.
 Moscow sent more than 100,000 soldiers into Afghanistan
0.049 × 0.9993
λ(send , soldiers , into ) ≈ log2 ≈ 6.13
0.0007
Attachment to verb is about 70 times more likely.
19
(Multinomial) Naive Bayes classifiers for WSD
 x
~ is the context (something like a 100 word window)
 ck is a sense of the word to be disambiguated

Choose c 0 = arg max P (ck|~


x)
ck
x|ck)
P (~
= arg max P (ck)
ck P (~
x)
= arg max[log P (~x|ck) + log P (ck)]
ck
X
= arg max[ log P (vj |ck) + log P (ck)]
ck
vj in x
~

 An effective method in practice, but also an example

of a structure-blind ‘bag of words’ model


20
Statistical Computational Linguistic Methods
 Many (related) techniques are used:

 n-grams, history-based models, decision trees /

decision lists, memory-based learning, loglinear


models, HMMs, neural networks, vector spaces,
graphical models, decomposable models, PCFGs,
Probabilistic LSI, . . .
 Predictive and robust

 Good for learning (well, supervised learning works

well; unsupervised learning is still hard)


 The list looks pretty similar to speech work . . .

because we copied from them


21
NLP as a classification problem
 Central to recent advances in NLP has been reconcep-

tualizing NLP as a statistical classification problem


 We – preferably someone else – hand-annotate data,

and then learn using standard ML methods


 Annotated data items are feature vectors x
~i with a
classification ci .
 Our job is to assign an unannotated data item x
~ to
one of the classes ck (or possibly to the doubt D or
outlier O categories, though in practice rarely used).

22
Simple Bayesian Inference for NLP
 Central conception in early work: The “noisy chan-

nel” model. We want to determine English text given


acoustic signal, OCRed text, French text, . . .
Noisy
Generator I O Î
Channel Decoder
p(i)
p(o|i)

words speech words


POS tags words POS tags
L1 words L2 words L1 words
 ı̂ = arg maxi P (i|o) = arg maxi P (i) × P (o|i)

23
Probabilistic inference in more generality
 Overall there is a joint distribution of all the variables

 e.g., P (s, t, m, d)

 We assume a generative or causal model that factor-

izes the joint distribution:


 e.g., P (t)P (s|t)P (m|t)P (d|m)

 This allows the distribution to be represented com-

pactly
 Some items in this distribution are observed

 We do inference to find other parts:

 P (Hidden|Obs = o1 )

24
Machine Learning for NLP
Method \ Problem POS tagging WSD Parsing
Naive Bayes Gale et al. (1992)
(H)MM Charniak et al. (1993)
Decision Trees Schmid (1994) Mooney (1996) Magerman (1995)
Ringuette (1994)
Decision List/TBL Brill (1995) Brill (1993)
kNN/MBL Daelemans et al. (1996) Ng and Zavrel et al. (1997)
Lee (1996)
Maximum entropy Ratnaparkhi (1996) Ratnaparkhi
et al. (1994)
Neural networks Benello et al. (1989) Henderson and
Lane (1998)

25
Distinctiveness of NLP as an ML problem
 Language allows the complex compositional encod-

ing of thoughts, ideas, feelings, . . . , intelligence.


 We are minimally dealing with hierarchical structures

(branching processes), and often want to allow more


complex forms of information sharing (dependencies).
 Enormous problems with data sparseness

 Both features and assigned classes regularly involve

multinomial distributions over huge numbers of val-


ues (often in the tens of thousands)
 Generally dealing with discrete distributions though!

 The distributions are very uneven, and have fat tails


26
The obligatory Zipf’s law slide:
Zipf’s law for the Brown corpus
100000

• • •


••

10000

• •••

••••••
••••••
••••••
•••••
•••••••
••••••
••••••••
•••••
1000

••••••
•••••
••••••••
•••••
frequency

•••••••
••••••••
•••••••••
••••••••
••••••••
•••••••
••••••••••
••••••••
100

••••••••
••••••••
•••••••••
••••••••
•••••••
•••••••
•••••••
••••••
••••••••
•••••••
•••••••••
10

••••••••
••••••••••
••••••
•••••••
•••••••••
•••••••••••••
••••••••••••••••••••••••
1

1 10 100 1000 10000 100000


rank

1
f ∝ or, there is a k such that f · r = k
r
27
Simple linear models of language
 Markov models a.k.a. n-gram models:

W1 aij W2 aij W3 aij W4


hsi In both ??
 Word sequence is predicted via a conditional dis-

tribution
 Conditional Probability Table (CPT): e.g., P (X|both )

ñ P (of |both ) = 0.066 P (to |both ) = 0.041


 Amazingly successful as a simple engineering model

 Hidden Markov Models (above, for POS tagging)

 Linear models panned by Chomsky (1957)


28
Why we need recursive structure
 The velocity of the seismic waves rises to . . .

NPsg VPsg

DT NN PP rises to . . .

The velocity IN NPpl

of the seismic waves


 Or you can use dependency grammar representations

– isomorphisms exist. (Ditto link grammar.)


29
Probabilistic context-free grammars (PCFGs)

A PCFG G consists of:


 A set of terminals, {w k }
 A set of nonterminals, {N i }, with a start symbol, N 1
 A set of rules, {N i → ζ j }, (where ζ j is a sequence of
terminals and nonterminals)
 A set of probabilities on rules such that:
X
∀i P (N i → ζj) = 1
j

 A generalization of HMMs to tree structures


 A similar algorithm to the Viterbi algorithm is used
for finding the most probable parse 30
Expectation Maximization (EM) algorithm
 For both HMMs and PCFGs, we can use EM estimation

to learn the ‘hidden’ structure from plain text data


 We start with initial probability estimates
 E-step: We work out the expectation of the hidden

variables, given the current parameters for the model


 M-step: (Assuming these are right), we calculate the

maximum likelihood estimates for the parameters


 Repeat until convergence. . . (Dempster et al. 1977)
 It’s an iterative hill-climbing algorithm that can get

stuck in local maxima


 Frequently not effective if we wish to imbue the hid-

den states with meanings the algorithm doesn’t know


31
Modern Statistical Parsers
 A greatly increased ability to do accurate, robust, broad

coverage parsing (Charniak 1997, Collins 1997, Rat-


naparkhi 1997, Charniak 2000)
 Achieved by converting parsing into a classification

task and using statistical/machine learning methods


 Statistical methods (fairly) accurately resolve struc-

tural and real world ambiguities


 Much faster: rather than being cubic in the sentence

length or worse, for modern statistical parsers pars-


ing time is made linear (by using beam search)
 Provide probabilistic language models that can be in-

tegrated with speech recognition systems. 32


Parsing as classification decisions
E.g., Charniak (1997)
 A very simple, conservative model of lexicalized PCFG

Srose

NPprofits VProse

JJcorporate NNSprofits Vrose

corporate profits rose


 Probabilistic conditioning is “top-down” (but actual

computation is bottom-up)
33
Charniak (1997) example

Srose A. h = profits ; c = NP
B. ph = rose ; pc = S
NP VProse
C. P (h|ph, c, pc)
D. P (r |h, c, pc)
Srose Srose

NPprofits VProse NPprofits VProse

JJ NNSprofits

34
Charniak (1997) linear interpolation/shrinkage

P̂ (h|ph, c, pc) = λ1(e)P MLE(h|ph, c, pc)


+λ2(e)P MLE(h|C(ph), c, pc)
+λ3(e)P MLE(h|c, pc) + λ4(e)P MLE(h|c)

 λi (e) is here a function of how much one would ex-

pect to see a certain occurrence, given the amount of


training data, word counts, etc.
 C(ph) is semantic class of parent headword

 Techniques like these for dealing with data sparse-

ness are vital to successful model construction


35
Charniak (1997) shrinkage example

P (prft|rose, NP, S) P (corp|prft, JJ, NP)


P (h|ph, c, pc) 0 0.245
P (h|C(ph), c, pc) 0.00352 0.0150
P (h|c, pc) 0.000627 0.00533
P (h|c) 0.000557 0.00418
 Allows utilization of rich highly conditioned estimates,

but smoothes when sufficient data is unavailable


 One can’t just use MLEs: one commonly sees previ-

ously unseen events, which would have probability


0.
36
Unifying different approaches
 Most StatNLP work is using loglinear/exponential mod-

els
 For discrete distributions – common in NLP! – we can

build a contingency table model of the joint distribu-


tion of the data.
 Example contingency table: predicting POS JJ

(N = 150)
f1
+hyphen −hyphen
f2 + -al Y: 8 N: 2 Y: 18 N: 27 Y: 26 N: 29
− -al Y: 10 N: 20 Y: 3 N: 62 Y: 13 N: 82
Y: 18 N: 22 Y: 21 N: 89 Y: 39 N: 111
37
Loglinear/exponential (“maxent”) models
 Most common modeling choice is a loglinear model:
X
log P (X1 = x1, . . . , Xp = xp ) = C
λC (xC )

where C ⊂ {1, . . . , p}.


 Maximum entropy loglinear models

1 YK fi (~
x,c)
p(~
x, c) = α
i=1 i
Z
K is the number of features, αi is the weight for
feature fi and Z is a normalizing constant. Log form:
XK
log p(~
x, c) = − log Z + x, c) × log αi
f (~
i=1 i

 Generalized iterative scaling gives unique ML solution


38
The standard models are loglinear
 All the widely used generative probability models in
StatNLP are loglinear, because they’re done as a prod-
uct of probabilities decomposed by the chain rule
(Naive Bayes, HMMs, PCFGs, decomposable models,
Charniak (1997), Collins (1997) . . . )
 The simpler ones (Naive Bayes, HMMs, . . . ) can also

easily be interpreted as Bayes Nets/“graphical mod-


els” (Pearl 1988), as in the pictures earlier

39
Beyond augmented PCFGs
 For branching process models, relative frequency prob-

ability estimates give ML estimates on observed data


 But because of the rich feature dependencies in lan-

guage, linguists like to use richer constraint models:

 S

NP VP

PRP VBZ VP

She[[Link]] has [VP[SU: 3sg, VBN]] VBN NP

hurt [SU: NP, OB: NP] PRP

herself[refl [Link]]
 Abney (1997) and Johnson et al. (1999) develop log-

linear Markov Random Field/Gibbs models 40


But the whole NLP world isn’t loglinear
 Other methods, e.g., non-parametric instance-based

learning methods are also used


 The Memory-Based Learning approach (Daelemans

et al. 1996) has achieved good results for many


NLP problems
 Also quantitative but non-probabilistic methods, such

as vector spaces
 Latent semantic indexing via singular value decompo-

sition is often effective for dimensionality reduction


and unsupervised clustering
 E.g., Schütze (1997) for learning parts of speech,

word clusters, and word sense clusters 41


What we don’t know how to do yet:
Where are we at on meaning?

GOFAI
Language
Story
Understanding
Understanding
Complexity

Anaphora
Resolution
Text IR
Categorization

Scale

42
Miller et al. (1996) [BBN]
 System over ATIS air travel domain

 Discourse-embedded meaning processing:

 U: I want to fly from Boston to Denver

 S: OK hflights are displayedi

 U: Which flights are available on Tuesday

 [interpret as flights from Boston to Denver]

 S: hdisplays appropriate flightsi

 End-to-end statistical model from words to discourse-

embedded meaning: cross-sentence discourse model.

43
Miller et al. (1996) [BBN]
 Three stage n-best pipeline:

 Pragmatic interpetation D from words W and discourse

history H via sentence meaning M and parse tree T

D̂ = arg max P (D|W , H)


D
X
= arg max P (D|W , H, M, T )P (M, T |W , H)
D M,T
X
= arg max P (D|H, M)P (M, T |W )
D M,T

 Possible because of annotated language resources that

allow supervised ML at all stages (and a rather simple


slot-filler meaning representation)
44
From structure to meaning
 Syntactic structures aren’t meanings, but having heads

and dependents essentially gives one relations:


 orders(president, review(spectrum(wireless)))
 We don’t yet resolve (noun phrase) scope, but that’s

probably too hard for robust broad-coverage NLP


 Main remaining problems: synonymy and polysemy:
 Words have multiple meanings
 Several words can mean the same thing
 But there are well-performing methods of also statis-

tically disambiguating and clustering words as well


 So the goal of transforming a text into meaning rela-

tions or “facts” is close 45


Integrating probabilistic reasoning about con-
text with probabilistic language processing
 Paek and Horvitz (2000) treats conversation as infer-
ence and decision making under uncertainty
 Quartet: A framework for spoken dialog which mod-
els and exploits uncertainty in:
 conversational control  intentions
 maintenance (notices lack of understanding, etc.)
 Attempts to model the development of mutual under-
standing in a dialog
 But language model is very simple
 Much more to do in incorporating knowledge into
NLP models 46
Learning and transfering knowledge
 We can do well iff we can train our models on super-

vised data from the same domain


 We have adequate data for very few domains/genres
 In general, there have been modest to poor results in

learning rich NLP models from unannotated data


 It is underexplored how one can adapt or bootstrap

with knowledge from one domain to another where


data is more limited or only available unannotated
 Perhaps we need to more intelligently design models

that use less parameters (but the right conditioning)?


 These are vital questions for making StatNLP interest-

ing to cognitive science 47


Sometimes an approach where probabilities
annotate a symbolic grammar isn’t sufficient
 There’s lots of evidence that our representations should
also be squishy. E.g.:
 What part of speech do “marginal prepositions”
have? concerning, supposing, considering, regard-
ing, following
 Transitive verb case: Asia’s other cash-rich coun-
tries are following Japan’s lead.
 Marginal preposition (VP modifier, sense of after ):
U.S. chip makers are facing continued slack de-
mand following a traditionally slow summer.
 Penn Treebank tries to mandate that they are verbs
48
 But some have already moved to become only prepo-

sitions: during (originally a verb, cf. endure ) and


notwithstanding (a compound from a verb)
 And others seem well on their way:
 According to this, industrial production declined
 They’re in between being verbs and prepositions
 Conversely standard probabilistic models don’t ex-

plain why language is ‘almost categorical’: categori-


cal grammars have been used for thousands of years
because they just about work. . . .
 In many places there is a very steep drop-off between

‘grammatical’ and ‘ungrammatical’ strings that our


probabilistic models often don’t model well 49
Envoi
 Statistical methods have brought a new level of per-
formance in robust, accurate, broad-coverage NLP
 They provide a fair degree of disambiguation and in-
terpretation, integrable with other systems
 To avoid plateauing, we need to keep developing richer
and more satisfactory representational models
 The time seems ripe to combine sophisticated yet
robust NLP models (which do more with meaning)
with richer probabilistic contextual models

Thanks for listening!

50
Bibliography
Abney, S. P. 1997. Stochastic attribute-value grammars. Computa-
tional Linguistics 23(4):597–618.
Benello, J., A. W. Mackie, and J. A. Anderson. 1989. Syntactic
category disambiguation with neural networks. Computer Speech
and Language 3:203–217.
Brill, E. 1993. Automatic grammar induction and parsing free text:
A transformation-based approach. In ACL 31, 259–265.
Brill, E. 1995. Transformation-based error-driven learning and nat-
ural language processing: A case study in part-of-speech tagging.
Computational Linguistics 21(4):543–565.
Carpenter, B. 1999. Type-Logical Semantics. Cambridge, MA: MIT
Press.
Charniak, E. 1997. Statistical parsing with a context-free grammar
and word statistics. In Proceedings of the Fourteenth National
Conference on Artificial Intelligence (AAAI ’97), 598–603.

51
1

Charniak, E. 2000. A maximum-entropy-inspired parser. In NAACL


1, 132–139.
Charniak, E., C. Hendrickson, N. Jacobson, and M. Perkowitz. 1993.
Equations for part-of-speech tagging. In Proceedings of the Eleventh
National Conference on Artificial Intelligence, 784–789, Menlo Park,
CA.
Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton.
Chomsky, N. 1969. Quine’s empirical assumptions. In D. Davidson
and J. Hintikka (Eds.), Words and Objections: Essays on the Work of
W.V. Quine, 53–68. Dordrecht: D. Reidel.
Collins, M. J. 1997. Three generative, lexicalised models for statis-
tical parsing. In ACL 35/EACL 8, 16–23.
Daelemans, W., J. Zavrel, P. Berck, and S. Gillis. 1996. MBT: A
memory-based part of speech tagger generator. In WVLC 4, 14–27.
2

Dempster, A., N. Laird, and D. Rubin. 1977. Maximum likelihood


from incomplete data via the EM algorithm. J. Royal Statistical
Society Series B 39:1–38.
Gale, W. A., K. W. Church, and D. Yarowsky. 1992. A method for
disambiguating word senses in a large corpus. Computers and the
Humanities 26:415–439.
Henderson, J., and P. Lane. 1998. A connectionist architecture for
learning to parse. In ACL 36/COLING 17, 531–537.
Johnson, M., S. Geman, S. Canon, Z. Chi, and S. Riezler. 1999.
Estimators for stochastic “unification-based” grammars. In ACL 37,
535–541.
Lambek, J. 1958. The mathematics of sentence structure. Amer-
ican Mathematical Monthly 65:154–170. Also in Buzkowski, W.,
W. Marciszewski and J. van Benthem, eds., Categorial Grammar .
Amsterdam: John Benjamin.
3

Lyons, J. 1968. Introduction to Theoretical Linguistics. Cambridge:


Cambridge University Press.
Magerman, D. M. 1995. Review of ‘Statistical language learning’.
Computational Linguistics 11(1):103–111.
Manning, C. D., and H. Schütze. 1999. Foundations of Statistical
Natural Language Processing. Boston, MA: MIT Press.
Miller, S., D. Stallard, and R. Schwartz. 1996. A fully statistical
approach to natural language interfaces. In ACL 34, 55–61.
Montague, R. 1973. The proper treatment of quantification in or-
dinary English. In J. Hintikka, J. Moravcsik, and P. Suppes (Eds.),
Approaches to Natural Language. Dordrecht: D. Reidel.
Mooney, R. J. 1996. Comparative experiments on disambiguating
word senses: An illustration of the role of bias in machine learning.
In EMNLP 1, 82–91.
4

Ng, H. T., and H. B. Lee. 1996. Integrating multiple knowledge


sources to disambiguate word sense: An exemplar-based approach.
In ACL 34, 40–47.
Paek, T., and E. Horvitz. 2000. Conversation as action under un-
certainty. In Proceedings of the 16th Conference on Uncertainty in
Artificial Intelligence (UAI-2000).
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Net-
works of Plausible Inference. San Mateo, CA: Morgan Kaufmann.
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-
speech tagging. In EMNLP 1, 133–142.
Ratnaparkhi, A. 1997. A simple introduction to maximum entropy
models for natural language processing. Technical Report IRCS
Report 97–08, Institute for Research in Cognitive Science, Philadel-
phia, PA.
Ratnaparkhi, A., J. Reynar, and S. Roukos. 1994. A maximum en-
tropy model for prepositional phrase attachment. In Proceedings
5

of the ARPA Workshop on Human Language Technology, 250–255,


Plainsboro, NJ.
Sapir, E. 1921. Language: an introduction to the study of speech.
New York: Harcourt Brace.
Schmid, H. 1994. Probabilistic part-of-speech tagging using deci-
sion trees. In International Conference on New Methods in Language
Processing, 44–49, Manchester, England.
Schütze, H. 1997. Ambiguity Resolution in Language Learning.
Stanford, CA: CSLI Publications.
Weaver, W. 1955. Translation. In W. N. Locke and A. D. Booth
(Eds.), Machine Translation of Languages: Fourteen Essays, 15–23.
New York: John Wiley & Sons.
Zavrel, J., W. Daelemans, and J. Veenstra. 1997. Resolving PP attach-
ment ambiguities with memory-based learning. In Proceedings of
the Workshop on Computational Natural Language Learning, 136–
144, Somerset, NJ. Association for Computational Linguistics.

You might also like