Natural Language Processing and
Computational Linguistics:
from Theory to Application
Professor Mark Johnson
Director of the Macquarie Centre for Language Sciences (CLaS)
Department of Computing
October 2012
1/33
Computational linguistics and
Natural language processing
Computational linguistics is a scientific discipline that studies
linguistic processes from a computational perspective
language comprehension (computational psycho-linguistics)
language production
language acquisition
Natural language processing is an engineering discipline that uses
computers to do useful things with language
information retrieval
topic detection and document clustering
document summarisation
sentiment analysis
machine translation
speech recognition
2/33
Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion
3/33
Phonetics, phonology and the lexicon
Phonetics studies the sounds of a language
E.g., English aspirates stop consonants in certain positions
Phonology studies the distributional properties of these sounds
E.g., the English noun plural is [s] following unvoiced segments
and [z] following voiced segments
A language has a lexicon, which lists for each word and morpheme
how it is pronounced (phonology)
what it means (semantics)
its distributional properties (morphology and syntax)
4/33
Learning the lexicon
Speech does not have pauses in between words
children have to learn how to segment utterances into words
As part of an ARC project, weve built a computational model that
performs word segmentation using:
utterance boundaries
non-linguistic context
syllable structure and rhythmic patterns
language-specific must be learned
other known words and longer-range linguistic context
With our model, we can
measure the contribution of each information source
predict the effect of changing the percept or changing the input
5/33
Morphology and syntax
rich hierarchical structure is pervasive in language
morphology studies the structure of words
E.g., re+structur+ing, un+remark+able
syntax studies the ways words combine to form phrases and
sentences
phrase structure helps identify who did what to whom
.
S
NP
D
VP
N
the cat chased
NP
D
the dog
6/33
Parsing identifies phrase structure
Ambiguity is pervasive in human languages
.
S
NP
VP
V
saw
NP
NP
D
PP
N
the man with
saw
NP
D
VP
the telescope
NP
D
the man
PP
P
with
NP
D
the telescope
Recover English phrase structure with over 90% accuracy
We have an ARC project to parse running speech
coupled with a speech recogniser
our models are robust against speech disfluencies
7/33
Phrase structure of real sentences
.
S
NP-SBJ
DT
JJ
VP
NNS
VBP PRT
The new options carry
NP
RP
NP
out
NN
.
PP
IN
part of
DT
NP
NP
SBAR
NN
WHNP-12
an agreement
IN
that
S
NP-SBJ-1
DT
NN
,
NN
the pension fund
S-ADV
NP-SBJ
-NONE-
,
PP-PRD
VP
IN
NP
*-1 under
VBN
NP
reached
-NONE-
NN
pressure
*T*-12 with
NP-SBJ
-NONE-
VP
VP
TO
to
PP
IN
DT
NNP
the SEC
NNP
TO
NP
PRP$ JJ
NP
in
VP
and
VB
IN
December
CC
VP
relax
PP-TMP
NP
NN
to
NNS
its strict participation rules
VP
VB
provide
NP
JJR
NN
NNS
more investment options
8/33
Semantics and pragmatics
Semantics studies the meaning of words, phrases and sentences
E.g., I ate the oysters in/for an hour.
E.g., Who do you want to talk to /him?
Pragmatics studies how we use language to do things in the world
E.g., Can you pass the salt?
E.g., a letter of recommendation:
Sam is punctual and extremely well-groomed.
9/33
Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion
10/33
The statistical revolution
Hand-written rule-based approach
linguist crafts patterns or rules to solve problem
complicated and expensive to construct
hand-written systems are often brittle
Statistical machine learning approach
collect statistics from large corpora
combine a variety of information sources using machine-learning
statistical models tend to be easier to maintain and less brittle
Statistical models of language are scientifically interesting
humans are very sensitive to statistical properties
statistical models make quantitative predictions
11/33
Machine learning vs. statistical analysis
Machine learning and statistical analysis often use similar
mathematical models
E.g., linear models, least squares, logistic regression
The goals of statistical analysis and machine learning are different
Statistical analysis:
goal is hypothesis testing or identifying predictors
E.g., Does coffee cause cancer?
size of data and number of factors is small (thousands)
Machine learning:
goal is prediction by generalising from examples
E.g., Will person X get cancer?
size of data and number of factors can be huge (billions)
let learning algorithm decide which factors to use
12/33
Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion
13/33
Named entity recognition
Named entity recognition finds the named entities and their classes
in a text
Example:
Sam Spade bought |{z}
300 shares in Acme Corp in 2006
| {z }
| {z } | {z }
person
number
corporation
time
14/33
Noun phrase coreference
Noun phrase coreference tracks mentions to entities within
documents
Example:
Julia Gillard met with the president of Indonesia yesterday. Ms.
Gillard told him that she . . .
Cross-document coreference identifies mentions to same entity in
different documents
Were doing this on speech data as part of our ARC project
15/33
Relation extraction
Relation extraction mines texts to find instances of specific
relationships between named entities
Name
Role
Organisation
Steven Schwartz Vice Chancellor Macquarie University
...
...
...
Has been applied to mining bio-medical literature
16/33
Event extraction and role labelling
Event extraction identifies the events described by a text
Role labelling identifies who did what to whom
Example:
Involvement of p70(S6)-kinase activation in IL-10 up-regulation in
human monocytes by gp41 envelope protein of human
immunodeficiency virus type 1 . . .
.
involvement
Theme
Cause
up-regulation
Theme
IL-10
Cause
gp41
activation
Site
human monocyte
Theme
p70(S6)-kinase
17/33
Opinion mining and sentiment analysis
Used to analyse social media (Web 2.0)
Classify message along a subjectiveobjective scale
Identify polarity of message
in some genres, simple keyword-based approaches work well
but in others its necessary to model syntactic structure as well
E.g., I doubt she had a very good experience . . .
Often combined with
topic modelling to cluster messages with similar opinions
multi-document summarisation to present comprehensible results
Were currently applying this to financial announcements with the
CMCRC
18/33
Topic models for document processing
Topic models cluster documents into
one or more topics
usually unsupervised (i.e., topics
arent given in training data)
Important for document analysis and
information extraction
Example: clustering news stories for
information retrieval
Example: tracking evolution of a
research topic over time
19/33
Topic modelling task
Given just a collection of documents, simultaneously identify:
which topic(s) each document discusses
the words that are characteristic of each topic
Example: TASA collection of 37,000 passages on language and
arts, social studies, health, sciences, etc.
Topic 247
word
prob.
DRUGS
.069
DRUG
.060
MEDICINE .027
EFFECTS
.026
BODY
.023
MEDICINES .019
PAIN
.016
PERSON
.016
MARIJUANA .014
LABEL
.012
ALCOHOL .012
DANGEROUS .011
ABUSE
.009
EFFECT
.009
KNOWN
.008
Topic 5
Topic 43
Topic 56
word
prob.
word
prob.
word
prob.
RED
.202
MIND
.081 DOCTOR .074
BLUE
.099 THOUGHT
.066
DR.
.063
GREEN .096 REMEMBER .064 PATIENT .061
YELLOW .073
MEMORY
.037 HOSPITAL .049
WHITE .048 THINKING
.030
CARE
.046
COLOR .048 PROFESSOR .028 MEDICAL .042
BRIGHT .030
FELT
.025 NURSE
.031
COLORS .029 REMEMBERED .022 PATIENTS .029
ORANGE .027 THOUGHTS .020 DOCTORS .028
BROWN .027 FORGOTTEN .020 HEALTH .025
PINK
.017 MOMENT
.020 MEDICINE .017
LOOK
.017
THINK
.019 NURSING .017
BLACK .016
THING
.016 DENTAL .015
PURPLE .015
WONDER
.014 NURSES .013
CROSS .011
FORGET
.012 PHYSICIAN .012
20/33
Mixture versus admixture topic models
In a mixture model, each document has a single topic
all words in the document come from this topic
In admixture models, each document has a distribution over topics
a single document can have multiple topics (number of topics in a
document controlled by prior)
can capture more complex relationships between documents than
a mixture model
Both mixture and admixture topic models typically use a bag of
words representation of a document
21/33
Example: documents from NIPS corpus
Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .
22/33
Example (cont): ignore function words
Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .
23/33
Example (cont): mixture topic model
Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .
24/33
Example (cont): admixture topic model
Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .
25/33
Finding topics in document collections
If were not told wordtopic and documenttopic mappings, this is
an unsupervised learning problem
Can be solved using Bayesian inference with sparse prior
most documents discuss few topics
most topics have a small vocabulary
Simple iterative learning algorithm
randomly assign words to topics
repeat until converged:
assign topics to documents based on word-topic assignments
assign words to topics based on document-topic assignments
Nothing language-specific these models can be applied to other
domains
search for a hidden causes in a sea of data
26/33
Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion
27/33
Unsupervised word segmentation
Input: phoneme sequences with sentence boundaries
Task: identify word boundaries, and hence words
j
t t u s i
ju wnt tu si bk
you want to see the book
Ignoring phonology and morphology, this involves learning the
pronunciations of the lexical items in the language
28/33
Mapping words to referents
Input to learner:
word sequence: Is that the pig?
objects in nonlinguistic context: dog, pig
Learning objectives:
identify utterance topic: pig
identify word-topic mapping: pig pig
29/33
Word learning as a kind of topic modelling
Learning to map words to referents is like topic modelling: we
identify referential words by clustering them with their contexts
Word learning also involves finding sequences of phonemes (word
pronunciations) that cluster with a context
Idea: apply the word learning model with words (rather than
phonemes) as the basic units
An extended topic model that learns topical multi-word expressions
Currently negotiating a contract with NICTA to develop this idea
30/33
Learning topical multi-word expressions
Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .
31/33
Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion
32/33
Conclusion
Computational linguistics studies language production,
comprehension and acquistion
what information is used, and how is it used?
may give insights into language disorders, and suggest possible
therapies
Natural language processing uses computers to process speech and
texts
information retrieval, extraction and summarisation
machine translation
human-computer interface
Statistical models and machine learning play a central role in both
Theory and practical applications interact in a productive way!
33/33