0% found this document useful (0 votes)

21 views56 pages

Ima 2000

Uploaded by

yan_tings

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views56 pages

Ima 2000

Uploaded by

yan_tings

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Probabilistic Models

in Computational
Linguistics

Christopher Manning

Depts of Computer Science and Linguistics

Stanford University
[Link]
1
Aims and Goals of Computational Linguistics
To be able to understand and act on human languages
To be able to fluently produce human languages
Applied goals: machine translation, question answer-
ing, information retrieval, speech-driven personal as-
sistants, text mining, report generation, . . .

The big questions for linguistic science

What kinds of things do people say?
What do these things say/ask/request about the world?

I will argue that answering these involves questions of

frequency, probability, and likelihood

2
Natural language understanding traditions
The logical tradition

Gave up the goal of dealing with imperfect natural

languages in the development of formal logics

But the tools were taken and re-applied to natural

languages (Lambek 1958, Montague 1973, etc.)

These tools give rich descriptions of natural lan-

guage structure, and particularly the construction

of sentence meanings (e.g., Carpenter 1999)
ñ
NP:α NP\S:β
S:β(α)
They don’t tell us about word meaning or use

3
Natural language understanding traditions
The formal language theory tradition (Chomsky 1957)

Languages are generated by a grammar, which de-

fines the strings that are members of the language

(others are ungrammatical)
ñ NP → Det Adj* N Adj → clever
The generation process of the grammar puts struc-

tures over these language strings

This process is reversed in parsing the language

These ideas are still usually present in the symbolic

backbone of most statistical NLP systems

Often insufficient attention to meaning
4
Why Probabilistic Language Understanding?
Language use is situated in a world context
People write or say the little that is needed to be

understood in a certain discourse situation

Consequently
Language is highly ambiguous
Tasks like interpretation and translation involve

(probabilistically) reasoning about meaning, using

world knowledge not in the source text
We thus need to explore quantitative techniques that

move away from the unrealistic categorical assump-

tions of much of formal linguistic theory (and earlier
computational linguistics) 5
Why probabilistic linguistics?
Categorical grammars aren’t predictive: their notions

of grammaticality and ambiguity do not accord with

human perceptions
They don’t tell us what “sounds natural”
Grammatical but unnatural e.g.: In addition to this,
she insisted that women were regarded as a differ-
ent existence from men unfairly.
Need to account for variation of languages across

speech communities and across time

People are creative: they bend language ‘rules’ as

needed to achieve their novel communication needs

Consequently “All grammars leak” (Sapir 1921:39)
6
Psycholinguistics in one slide
Humans rapidly and incrementally accumulate and

integrate information from world and discourse con-

text and the current utterance so as to interpret what
someone is saying in real time. Often commit early.
They can often finish each other’s sentences!

If a human starts hearing Pick up the yellow plate and

there is only one yellow item around, they’ll already
have locked on to it before the word yellow is finished
Our NLP models don’t incorporate context into recog-

nition like this, or disambiguate without having heard

whole words (and often following context as well)
7
StatNLP: Relation to wider context
Matches move from logic-based AI to probabilistic AI

Knowledge → probability distributions

Inference → conditional distributions

Probabilities give opportunity to unify reasoning, plan-

ning, and learning, with communication

There is now widespread use of machine learning

(ML) methods in NLP (perhaps even overuse?)

Now, an emphasis on empirical validation and the use

of approximation for hard problems

8
Speech and NLP: A probabilistic view
A acoustic signal W words
T syntactic (tree) structures M meanings
In spoken language use, we have a distribution:

P (A, W , T , M)

In written language, just: P (W , T , M)

Speech people have usually looked at: P (W |A) – the
rest of the hidden structure is ignored
NLP people interested in the ‘more hidden’ structure
– T and often M – but sometimes W is observable
E.g., there is much work looking at the parsing prob-
lem P (T |W ). Language generation is P (W |M). 9
Why is NLU difficult? The hidden structure of
language is hugely ambiguous
Structures for: Fed raises interest rates 0.5% in effort
to control inflation (NYT headline 17 May 2000)
S

NP VP

NNP V NP NP PP

Fed raises NN NN CD NN P NP

interest rates 0.5 % in NN VP

effort V VP

to V NP

control NN

inflation
10
Where are the ambiguities?

Part of speech ambiguities

Syntactic
VB attachment
VBZ VBP VBZ ambiguities
NNP NNS NN NNS CD NN
Fed raises interest rates 0.5 % in effort
to control
inflation

Word sense ambiguities: Fed → “federal agent”

interest → a feeling of wanting to know or learn more
Semantic interpretation ambiguities above the word level

11
The bad effects of V/N ambiguities (1)

NP VP

N V NP

Fed raises N N

interest rates

12
The bad effects of V/N ambiguities (2)

NP VP

N N V NP

Fed raises interest N

rates

13
The bad effects of V/N ambiguities (3)

NP VP

N N N V NP

Fed raises interest rates CD N

0.5 %

14
Phrasal attachment ambiguities
S
NP VP
NNP V NP NP PP
Fed raises NN NN CD NN P NP
interest rates 0.5 % in NN VP
effort V VP
to V NP
S control NN
NP VP inflation
NNP V NP NP VP
Fed raises NN NN NP PP V VP
interest rates CD NN P NP to V NP
0.5 % in NN control NN
effort inflation
15
The many meanings of interest [n.]
Readiness to give attention to or to learn about some-
thing
Quality of causing attention to be given
Activity, subject, etc., which one gives time and at-
tention to
The advantage, advancement or favor of an individual
or group
A stake or share (in a company, business, etc.)
Money paid regularly for the use of money

Converse: words or senses that mean (almost) the same:

image, likeness, portrait, facsimile, picture

16
Hidden Markov Models – POS example

X1 aij X2 aij X3 aij X4 aij X5

hsi NNP VBZ NN NNS

bik bik bik bik bik

rai- inte-
hsi F ed rates
ses rest

Top row is unobserved states, interpreted as POS tags

Bottom row is observed output observations

17
Attachment ambiguities S
NP VP

S PRP V NP PP
I VBD DT NN IN NP
NP VP saw the man with DT NN
a telescope
PRP V NP

I VBD NP PP

saw DT NN IN NP

the man with DT NN

v
a telescope
n1 p

n2
18
Likelihood ratios for PP attachment
Likely attachment chosen by a (log) likelihood ratio:

P (Attach(p) = v|v, n)
λ(v, n, p) = log2
P (Attach(p) = n|v, n)

P (VAp = 1|v)P (NAp = 0|v)

= log2
P (NAp = 1|n)

If (large) positive, decide verb attachment [e.g., be-

low]; if (large) negative, decide noun attachment.
Moscow sent more than 100,000 soldiers into Afghanistan
0.049 × 0.9993
λ(send , soldiers , into ) ≈ log2 ≈ 6.13
0.0007
Attachment to verb is about 70 times more likely.
19
(Multinomial) Naive Bayes classifiers for WSD
x
~ is the context (something like a 100 word window)
ck is a sense of the word to be disambiguated

Choose c 0 = arg max P (ck|~

x)
ck
x|ck)
P (~
= arg max P (ck)
ck P (~
x)
= arg max[log P (~x|ck) + log P (ck)]
ck
X
= arg max[ log P (vj |ck) + log P (ck)]
ck
vj in x
~

An effective method in practice, but also an example

of a structure-blind ‘bag of words’ model

20
Statistical Computational Linguistic Methods
Many (related) techniques are used:

n-grams, history-based models, decision trees /

decision lists, memory-based learning, loglinear

models, HMMs, neural networks, vector spaces,
graphical models, decomposable models, PCFGs,
Probabilistic LSI, . . .
Predictive and robust

Good for learning (well, supervised learning works

well; unsupervised learning is still hard)

The list looks pretty similar to speech work . . .

because we copied from them

21
NLP as a classification problem
Central to recent advances in NLP has been reconcep-

tualizing NLP as a statistical classification problem

We – preferably someone else – hand-annotate data,

and then learn using standard ML methods

Annotated data items are feature vectors x
~i with a
classification ci .
Our job is to assign an unannotated data item x
~ to
one of the classes ck (or possibly to the doubt D or
outlier O categories, though in practice rarely used).

22
Simple Bayesian Inference for NLP
Central conception in early work: The “noisy chan-

nel” model. We want to determine English text given

acoustic signal, OCRed text, French text, . . .
Noisy
Generator I O Î
Channel Decoder
p(i)
p(o|i)

words speech words

POS tags words POS tags
L1 words L2 words L1 words
ı̂ = arg maxi P (i|o) = arg maxi P (i) × P (o|i)

23
Probabilistic inference in more generality
Overall there is a joint distribution of all the variables

e.g., P (s, t, m, d)

We assume a generative or causal model that factor-

izes the joint distribution:

e.g., P (t)P (s|t)P (m|t)P (d|m)

This allows the distribution to be represented com-

pactly
Some items in this distribution are observed

We do inference to find other parts:

P (Hidden|Obs = o1 )

24
Machine Learning for NLP
Method \ Problem POS tagging WSD Parsing
Naive Bayes Gale et al. (1992)
(H)MM Charniak et al. (1993)
Decision Trees Schmid (1994) Mooney (1996) Magerman (1995)
Ringuette (1994)
Decision List/TBL Brill (1995) Brill (1993)
kNN/MBL Daelemans et al. (1996) Ng and Zavrel et al. (1997)
Lee (1996)
Maximum entropy Ratnaparkhi (1996) Ratnaparkhi
et al. (1994)
Neural networks Benello et al. (1989) Henderson and
Lane (1998)

25
Distinctiveness of NLP as an ML problem
Language allows the complex compositional encod-

ing of thoughts, ideas, feelings, . . . , intelligence.

We are minimally dealing with hierarchical structures

(branching processes), and often want to allow more

complex forms of information sharing (dependencies).
Enormous problems with data sparseness

Both features and assigned classes regularly involve

multinomial distributions over huge numbers of val-

ues (often in the tens of thousands)
Generally dealing with discrete distributions though!

The distributions are very uneven, and have fat tails

26
The obligatory Zipf’s law slide:
Zipf’s law for the Brown corpus
100000

• • •
•
•
••
•
10000

• •••
•
••••••
••••••
••••••
•••••
•••••••
••••••
••••••••
•••••
1000

••••••
•••••
••••••••
•••••
frequency

•••••••
••••••••
•••••••••
••••••••
••••••••
•••••••
••••••••••
••••••••
100

••••••••
••••••••
•••••••••
••••••••
•••••••
•••••••
•••••••
••••••
••••••••
•••••••
•••••••••
10

••••••••
••••••••••
••••••
•••••••
•••••••••
•••••••••••••
••••••••••••••••••••••••
1

1 10 100 1000 10000 100000

rank

1
f ∝ or, there is a k such that f · r = k
r
27
Simple linear models of language
Markov models a.k.a. n-gram models:

W1 aij W2 aij W3 aij W4

hsi In both ??
Word sequence is predicted via a conditional dis-

tribution
Conditional Probability Table (CPT): e.g., P (X|both )

ñ P (of |both ) = 0.066 P (to |both ) = 0.041

Amazingly successful as a simple engineering model

Hidden Markov Models (above, for POS tagging)

Linear models panned by Chomsky (1957)

28
Why we need recursive structure
The velocity of the seismic waves rises to . . .

NPsg VPsg

DT NN PP rises to . . .

The velocity IN NPpl

of the seismic waves

Or you can use dependency grammar representations

– isomorphisms exist. (Ditto link grammar.)

29
Probabilistic context-free grammars (PCFGs)

A PCFG G consists of:

A set of terminals, {w k }
A set of nonterminals, {N i }, with a start symbol, N 1
A set of rules, {N i → ζ j }, (where ζ j is a sequence of
terminals and nonterminals)
A set of probabilities on rules such that:
X
∀i P (N i → ζj) = 1
j

A generalization of HMMs to tree structures

A similar algorithm to the Viterbi algorithm is used
for finding the most probable parse 30
Expectation Maximization (EM) algorithm
For both HMMs and PCFGs, we can use EM estimation

to learn the ‘hidden’ structure from plain text data

We start with initial probability estimates
E-step: We work out the expectation of the hidden

variables, given the current parameters for the model

M-step: (Assuming these are right), we calculate the

maximum likelihood estimates for the parameters

Repeat until convergence. . . (Dempster et al. 1977)
It’s an iterative hill-climbing algorithm that can get

stuck in local maxima

Frequently not effective if we wish to imbue the hid-

den states with meanings the algorithm doesn’t know

31
Modern Statistical Parsers
A greatly increased ability to do accurate, robust, broad

coverage parsing (Charniak 1997, Collins 1997, Rat-

naparkhi 1997, Charniak 2000)
Achieved by converting parsing into a classification

task and using statistical/machine learning methods

Statistical methods (fairly) accurately resolve struc-

tural and real world ambiguities

Much faster: rather than being cubic in the sentence

length or worse, for modern statistical parsers pars-

ing time is made linear (by using beam search)
Provide probabilistic language models that can be in-

tegrated with speech recognition systems. 32

Parsing as classification decisions
E.g., Charniak (1997)
A very simple, conservative model of lexicalized PCFG

Srose

NPprofits VProse

JJcorporate NNSprofits Vrose

corporate profits rose

Probabilistic conditioning is “top-down” (but actual

computation is bottom-up)
33
Charniak (1997) example

Srose A. h = profits ; c = NP
B. ph = rose ; pc = S
NP VProse
C. P (h|ph, c, pc)
D. P (r |h, c, pc)
Srose Srose

NPprofits VProse NPprofits VProse

JJ NNSprofits

34
Charniak (1997) linear interpolation/shrinkage

P̂ (h|ph, c, pc) = λ1(e)P MLE(h|ph, c, pc)

+λ2(e)P MLE(h|C(ph), c, pc)
+λ3(e)P MLE(h|c, pc) + λ4(e)P MLE(h|c)

λi (e) is here a function of how much one would ex-

pect to see a certain occurrence, given the amount of

training data, word counts, etc.
C(ph) is semantic class of parent headword

Techniques like these for dealing with data sparse-

ness are vital to successful model construction

35
Charniak (1997) shrinkage example

P (prft|rose, NP, S) P (corp|prft, JJ, NP)

P (h|ph, c, pc) 0 0.245
P (h|C(ph), c, pc) 0.00352 0.0150
P (h|c, pc) 0.000627 0.00533
P (h|c) 0.000557 0.00418
Allows utilization of rich highly conditioned estimates,

but smoothes when sufficient data is unavailable

One can’t just use MLEs: one commonly sees previ-

ously unseen events, which would have probability

0.
36
Unifying different approaches
Most StatNLP work is using loglinear/exponential mod-

els
For discrete distributions – common in NLP! – we can

build a contingency table model of the joint distribu-

tion of the data.
Example contingency table: predicting POS JJ

(N = 150)
f1
+hyphen −hyphen
f2 + -al Y: 8 N: 2 Y: 18 N: 27 Y: 26 N: 29
− -al Y: 10 N: 20 Y: 3 N: 62 Y: 13 N: 82
Y: 18 N: 22 Y: 21 N: 89 Y: 39 N: 111
37
Loglinear/exponential (“maxent”) models
Most common modeling choice is a loglinear model:
X
log P (X1 = x1, . . . , Xp = xp ) = C
λC (xC )

where C ⊂ {1, . . . , p}.

Maximum entropy loglinear models

1 YK fi (~
x,c)
p(~
x, c) = α
i=1 i
Z
K is the number of features, αi is the weight for
feature fi and Z is a normalizing constant. Log form:
XK
log p(~
x, c) = − log Z + x, c) × log αi
f (~
i=1 i

Generalized iterative scaling gives unique ML solution

38
The standard models are loglinear
All the widely used generative probability models in
StatNLP are loglinear, because they’re done as a prod-
uct of probabilities decomposed by the chain rule
(Naive Bayes, HMMs, PCFGs, decomposable models,
Charniak (1997), Collins (1997) . . . )
The simpler ones (Naive Bayes, HMMs, . . . ) can also

easily be interpreted as Bayes Nets/“graphical mod-

els” (Pearl 1988), as in the pictures earlier

39
Beyond augmented PCFGs
For branching process models, relative frequency prob-

ability estimates give ML estimates on observed data

But because of the rich feature dependencies in lan-

guage, linguists like to use richer constraint models:

NP VP

PRP VBZ VP

She[[Link]] has [VP[SU: 3sg, VBN]] VBN NP

hurt [SU: NP, OB: NP] PRP

herself[refl [Link]]
Abney (1997) and Johnson et al. (1999) develop log-

linear Markov Random Field/Gibbs models 40

But the whole NLP world isn’t loglinear
Other methods, e.g., non-parametric instance-based

learning methods are also used

The Memory-Based Learning approach (Daelemans

et al. 1996) has achieved good results for many

NLP problems
Also quantitative but non-probabilistic methods, such

as vector spaces
Latent semantic indexing via singular value decompo-

sition is often effective for dimensionality reduction

and unsupervised clustering
E.g., Schütze (1997) for learning parts of speech,

word clusters, and word sense clusters 41

What we don’t know how to do yet:
Where are we at on meaning?

GOFAI
Language
Story
Understanding
Understanding
Complexity

Anaphora
Resolution
Text IR
Categorization

Scale

42
Miller et al. (1996) [BBN]
System over ATIS air travel domain

Discourse-embedded meaning processing:

U: I want to fly from Boston to Denver

S: OK hflights are displayedi

U: Which flights are available on Tuesday

[interpret as flights from Boston to Denver]

S: hdisplays appropriate flightsi

End-to-end statistical model from words to discourse-

embedded meaning: cross-sentence discourse model.

43
Miller et al. (1996) [BBN]
Three stage n-best pipeline:

Pragmatic interpetation D from words W and discourse

history H via sentence meaning M and parse tree T

D̂ = arg max P (D|W , H)

D
X
= arg max P (D|W , H, M, T )P (M, T |W , H)
D M,T
X
= arg max P (D|H, M)P (M, T |W )
D M,T

Possible because of annotated language resources that

allow supervised ML at all stages (and a rather simple

slot-filler meaning representation)
44
From structure to meaning
Syntactic structures aren’t meanings, but having heads

and dependents essentially gives one relations:

orders(president, review(spectrum(wireless)))
We don’t yet resolve (noun phrase) scope, but that’s

probably too hard for robust broad-coverage NLP

Main remaining problems: synonymy and polysemy:
Words have multiple meanings
Several words can mean the same thing
But there are well-performing methods of also statis-

tically disambiguating and clustering words as well

So the goal of transforming a text into meaning rela-

tions or “facts” is close 45

Integrating probabilistic reasoning about con-
text with probabilistic language processing
Paek and Horvitz (2000) treats conversation as infer-
ence and decision making under uncertainty
Quartet: A framework for spoken dialog which mod-
els and exploits uncertainty in:
conversational control intentions
maintenance (notices lack of understanding, etc.)
Attempts to model the development of mutual under-
standing in a dialog
But language model is very simple
Much more to do in incorporating knowledge into
NLP models 46
Learning and transfering knowledge
We can do well iff we can train our models on super-

vised data from the same domain

We have adequate data for very few domains/genres
In general, there have been modest to poor results in

learning rich NLP models from unannotated data

It is underexplored how one can adapt or bootstrap

with knowledge from one domain to another where

data is more limited or only available unannotated
Perhaps we need to more intelligently design models

that use less parameters (but the right conditioning)?

These are vital questions for making StatNLP interest-

ing to cognitive science 47

Sometimes an approach where probabilities
annotate a symbolic grammar isn’t sufficient
There’s lots of evidence that our representations should
also be squishy. E.g.:
What part of speech do “marginal prepositions”
have? concerning, supposing, considering, regard-
ing, following
Transitive verb case: Asia’s other cash-rich coun-
tries are following Japan’s lead.
Marginal preposition (VP modifier, sense of after ):
U.S. chip makers are facing continued slack de-
mand following a traditionally slow summer.
Penn Treebank tries to mandate that they are verbs
48
But some have already moved to become only prepo-

sitions: during (originally a verb, cf. endure ) and

notwithstanding (a compound from a verb)
And others seem well on their way:
According to this, industrial production declined
They’re in between being verbs and prepositions
Conversely standard probabilistic models don’t ex-

plain why language is ‘almost categorical’: categori-

cal grammars have been used for thousands of years
because they just about work. . . .
In many places there is a very steep drop-off between

‘grammatical’ and ‘ungrammatical’ strings that our

probabilistic models often don’t model well 49
Envoi
Statistical methods have brought a new level of per-
formance in robust, accurate, broad-coverage NLP
They provide a fair degree of disambiguation and in-
terpretation, integrable with other systems
To avoid plateauing, we need to keep developing richer
and more satisfactory representational models
The time seems ripe to combine sophisticated yet
robust NLP models (which do more with meaning)
with richer probabilistic contextual models

Thanks for listening!

50
Bibliography
Abney, S. P. 1997. Stochastic attribute-value grammars. Computa-
tional Linguistics 23(4):597–618.
Benello, J., A. W. Mackie, and J. A. Anderson. 1989. Syntactic
category disambiguation with neural networks. Computer Speech
and Language 3:203–217.
Brill, E. 1993. Automatic grammar induction and parsing free text:
A transformation-based approach. In ACL 31, 259–265.
Brill, E. 1995. Transformation-based error-driven learning and nat-
ural language processing: A case study in part-of-speech tagging.
Computational Linguistics 21(4):543–565.
Carpenter, B. 1999. Type-Logical Semantics. Cambridge, MA: MIT
Press.
Charniak, E. 1997. Statistical parsing with a context-free grammar
and word statistics. In Proceedings of the Fourteenth National
Conference on Artificial Intelligence (AAAI ’97), 598–603.

51
1

Charniak, E. 2000. A maximum-entropy-inspired parser. In NAACL

1, 132–139.
Charniak, E., C. Hendrickson, N. Jacobson, and M. Perkowitz. 1993.
Equations for part-of-speech tagging. In Proceedings of the Eleventh
National Conference on Artificial Intelligence, 784–789, Menlo Park,
CA.
Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton.
Chomsky, N. 1969. Quine’s empirical assumptions. In D. Davidson
and J. Hintikka (Eds.), Words and Objections: Essays on the Work of
W.V. Quine, 53–68. Dordrecht: D. Reidel.
Collins, M. J. 1997. Three generative, lexicalised models for statis-
tical parsing. In ACL 35/EACL 8, 16–23.
Daelemans, W., J. Zavrel, P. Berck, and S. Gillis. 1996. MBT: A
memory-based part of speech tagger generator. In WVLC 4, 14–27.
2

Dempster, A., N. Laird, and D. Rubin. 1977. Maximum likelihood

from incomplete data via the EM algorithm. J. Royal Statistical
Society Series B 39:1–38.
Gale, W. A., K. W. Church, and D. Yarowsky. 1992. A method for
disambiguating word senses in a large corpus. Computers and the
Humanities 26:415–439.
Henderson, J., and P. Lane. 1998. A connectionist architecture for
learning to parse. In ACL 36/COLING 17, 531–537.
Johnson, M., S. Geman, S. Canon, Z. Chi, and S. Riezler. 1999.
Estimators for stochastic “unification-based” grammars. In ACL 37,
535–541.
Lambek, J. 1958. The mathematics of sentence structure. Amer-
ican Mathematical Monthly 65:154–170. Also in Buzkowski, W.,
W. Marciszewski and J. van Benthem, eds., Categorial Grammar .
Amsterdam: John Benjamin.
3

Lyons, J. 1968. Introduction to Theoretical Linguistics. Cambridge:

Cambridge University Press.
Magerman, D. M. 1995. Review of ‘Statistical language learning’.
Computational Linguistics 11(1):103–111.
Manning, C. D., and H. Schütze. 1999. Foundations of Statistical
Natural Language Processing. Boston, MA: MIT Press.
Miller, S., D. Stallard, and R. Schwartz. 1996. A fully statistical
approach to natural language interfaces. In ACL 34, 55–61.
Montague, R. 1973. The proper treatment of quantification in or-
dinary English. In J. Hintikka, J. Moravcsik, and P. Suppes (Eds.),
Approaches to Natural Language. Dordrecht: D. Reidel.
Mooney, R. J. 1996. Comparative experiments on disambiguating
word senses: An illustration of the role of bias in machine learning.
In EMNLP 1, 82–91.
4

Ng, H. T., and H. B. Lee. 1996. Integrating multiple knowledge

sources to disambiguate word sense: An exemplar-based approach.
In ACL 34, 40–47.
Paek, T., and E. Horvitz. 2000. Conversation as action under un-
certainty. In Proceedings of the 16th Conference on Uncertainty in
Artificial Intelligence (UAI-2000).
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Net-
works of Plausible Inference. San Mateo, CA: Morgan Kaufmann.
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-
speech tagging. In EMNLP 1, 133–142.
Ratnaparkhi, A. 1997. A simple introduction to maximum entropy
models for natural language processing. Technical Report IRCS
Report 97–08, Institute for Research in Cognitive Science, Philadel-
phia, PA.
Ratnaparkhi, A., J. Reynar, and S. Roukos. 1994. A maximum en-
tropy model for prepositional phrase attachment. In Proceedings
5

of the ARPA Workshop on Human Language Technology, 250–255,

Plainsboro, NJ.
Sapir, E. 1921. Language: an introduction to the study of speech.
New York: Harcourt Brace.
Schmid, H. 1994. Probabilistic part-of-speech tagging using deci-
sion trees. In International Conference on New Methods in Language
Processing, 44–49, Manchester, England.
Schütze, H. 1997. Ambiguity Resolution in Language Learning.
Stanford, CA: CSLI Publications.
Weaver, W. 1955. Translation. In W. N. Locke and A. D. Booth
(Eds.), Machine Translation of Languages: Fourteen Essays, 15–23.
New York: John Wiley & Sons.
Zavrel, J., W. Daelemans, and J. Veenstra. 1997. Resolving PP attach-
ment ambiguities with memory-based learning. In Proceedings of
the Workshop on Computational Natural Language Learning, 136–
144, Somerset, NJ. Association for Computational Linguistics.

Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Mod 1
No ratings yet
Mod 1
71 pages
Artificial Intelligence: Rohan Raj Poudel
No ratings yet
Artificial Intelligence: Rohan Raj Poudel
34 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
PLN 1
No ratings yet
PLN 1
72 pages
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
No ratings yet
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
136 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
Lec 2
No ratings yet
Lec 2
21 pages
NLP Merged
100% (1)
NLP Merged
975 pages
CB3591 - Engineering Ssecure Software Systems - Notes
No ratings yet
CB3591 - Engineering Ssecure Software Systems - Notes
50 pages
Word Sense Disambiguation Techniques
No ratings yet
Word Sense Disambiguation Techniques
39 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Probabilistic Theory in Natural Language Processing
No ratings yet
Probabilistic Theory in Natural Language Processing
15 pages
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
No ratings yet
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
26 pages
NLP Challenges: Sparsity Explained
No ratings yet
NLP Challenges: Sparsity Explained
55 pages
NLP & Linguistics for Researchers
No ratings yet
NLP & Linguistics for Researchers
35 pages
CL and Topic Models
No ratings yet
CL and Topic Models
33 pages
NLP Course for Students
No ratings yet
NLP Course for Students
25 pages
Natural Language Processing Unit 1-2
No ratings yet
Natural Language Processing Unit 1-2
18 pages
NLP Internal
No ratings yet
NLP Internal
15 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
NLP 1
No ratings yet
NLP 1
13 pages
Unit 1
No ratings yet
Unit 1
99 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
Natural Language Processing-Course Handout September 2022
No ratings yet
Natural Language Processing-Course Handout September 2022
8 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
107 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Unit 5
No ratings yet
Unit 5
107 pages
Lec1-UNIT5 - MORE SIMPLER
No ratings yet
Lec1-UNIT5 - MORE SIMPLER
28 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Machine Learning Natural Language 2023
No ratings yet
Machine Learning Natural Language 2023
28 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Introduction To NLP 2021
No ratings yet
Introduction To NLP 2021
13 pages
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
No ratings yet
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
148 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
43 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Stat NLP
No ratings yet
Stat NLP
19 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Outline
No ratings yet
Outline
34 pages
NLP Applications and Challenges
No ratings yet
NLP Applications and Challenges
4 pages
Language Model
No ratings yet
Language Model
2 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
Statistical NLP
No ratings yet
Statistical NLP
19 pages
NLP
No ratings yet
NLP
21 pages
Speech & Language Processing Course
No ratings yet
Speech & Language Processing Course
39 pages
N-Gram Language Model Overview
No ratings yet
N-Gram Language Model Overview
28 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
Binary Code
No ratings yet
Binary Code
2 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Unit 1-Introduction To NLP
No ratings yet
Unit 1-Introduction To NLP
68 pages
Challenges in Natural Language Processing
No ratings yet
Challenges in Natural Language Processing
18 pages
English Grammar & Vocabulary Practice
No ratings yet
English Grammar & Vocabulary Practice
4 pages
Chapter 03 - Assembly Language Programming
No ratings yet
Chapter 03 - Assembly Language Programming
272 pages
Large Cu Wire Bonding for Power Devices
No ratings yet
Large Cu Wire Bonding for Power Devices
5 pages
Clemente Syllabus HH1 Sec01 SPRING2017
No ratings yet
Clemente Syllabus HH1 Sec01 SPRING2017
13 pages
Macedonian Cultural Heritage - Archaeological Sites
100% (2)
Macedonian Cultural Heritage - Archaeological Sites
53 pages
History and Types of Counseling
No ratings yet
History and Types of Counseling
16 pages
Pole Unit Install Jha
100% (1)
Pole Unit Install Jha
2 pages
Manual On Subsurface Investigations (2019) : This PDF Is Available at
No ratings yet
Manual On Subsurface Investigations (2019) : This PDF Is Available at
24 pages
History of Batteries & Energy
100% (1)
History of Batteries & Energy
6 pages
Chartered Accountant Profile for Banking
No ratings yet
Chartered Accountant Profile for Banking
3 pages
Grade 11 - PE EXAM
No ratings yet
Grade 11 - PE EXAM
5 pages
Pumps & Pumping Systems-General
100% (1)
Pumps & Pumping Systems-General
349 pages
Blockchain for Land Certificates
No ratings yet
Blockchain for Land Certificates
3 pages
CPC Pay Scales: 4th to 6th Overview
0% (1)
CPC Pay Scales: 4th to 6th Overview
3 pages
Job Safety Analysis (JSA) Stringing Work
100% (3)
Job Safety Analysis (JSA) Stringing Work
4 pages
Quiz 3 - Definite Integral Application (I)
No ratings yet
Quiz 3 - Definite Integral Application (I)
4 pages
English 3 Periodical Test Guide
No ratings yet
English 3 Periodical Test Guide
4 pages
DIY Mercedes O2 Sensor Replacement
No ratings yet
DIY Mercedes O2 Sensor Replacement
9 pages
Elicit - 1D Modeling of SOEC Co-Electrolysis - Report
No ratings yet
Elicit - 1D Modeling of SOEC Co-Electrolysis - Report
12 pages
Inbound Process EWM SAP
No ratings yet
Inbound Process EWM SAP
32 pages
Aniket Raikwar's Resume Overview
No ratings yet
Aniket Raikwar's Resume Overview
2 pages
Network Traffic Management Guide
No ratings yet
Network Traffic Management Guide
10 pages
'Clear As Mud': Toward Greater Clarity in Generic Qualitative Research
No ratings yet
'Clear As Mud': Toward Greater Clarity in Generic Qualitative Research
13 pages
Ratios & Proportions Lesson 1: Tennessee Adult Education
No ratings yet
Ratios & Proportions Lesson 1: Tennessee Adult Education
31 pages
Key Figures in Economic Thought
No ratings yet
Key Figures in Economic Thought
9 pages
What'Sbest!: User'S Manual
No ratings yet
What'Sbest!: User'S Manual
484 pages
Chernobyl Disaster: Radiation Impact Analysis
No ratings yet
Chernobyl Disaster: Radiation Impact Analysis
8 pages
Liebherr 280 HC L Luffing Jib Crane
No ratings yet
Liebherr 280 HC L Luffing Jib Crane
2 pages
The World's Truck Manufacturers (2006) PDF
No ratings yet
The World's Truck Manufacturers (2006) PDF
176 pages
Artículos Científicos: Reyes López-Ordaz Héctor Castillo-Juárez Hugo H. Montaldo
No ratings yet
Artículos Científicos: Reyes López-Ordaz Héctor Castillo-Juárez Hugo H. Montaldo
14 pages