0% found this document useful (0 votes)
56 views33 pages

CL and Topic Models

This document provides a summary of a presentation on natural language processing and computational linguistics given by Professor Mark Johnson. It discusses key topics in computational linguistics including phonetics, phonology, morphology, syntax, semantics, and pragmatics. It also covers machine learning approaches to natural language processing tasks such as named entity recognition, coreference resolution, relation extraction, event extraction, sentiment analysis, and topic modeling.

Uploaded by

diankusuma123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views33 pages

CL and Topic Models

This document provides a summary of a presentation on natural language processing and computational linguistics given by Professor Mark Johnson. It discusses key topics in computational linguistics including phonetics, phonology, morphology, syntax, semantics, and pragmatics. It also covers machine learning approaches to natural language processing tasks such as named entity recognition, coreference resolution, relation extraction, event extraction, sentiment analysis, and topic modeling.

Uploaded by

diankusuma123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing and

Computational Linguistics:
from Theory to Application
Professor Mark Johnson
Director of the Macquarie Centre for Language Sciences (CLaS)
Department of Computing

October 2012

1/33

Computational linguistics and


Natural language processing
Computational linguistics is a scientific discipline that studies

linguistic processes from a computational perspective

language comprehension (computational psycho-linguistics)


language production
language acquisition

Natural language processing is an engineering discipline that uses

computers to do useful things with language

information retrieval
topic detection and document clustering
document summarisation
sentiment analysis
machine translation
speech recognition
2/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

3/33

Phonetics, phonology and the lexicon


Phonetics studies the sounds of a language
E.g., English aspirates stop consonants in certain positions
Phonology studies the distributional properties of these sounds
E.g., the English noun plural is [s] following unvoiced segments
and [z] following voiced segments
A language has a lexicon, which lists for each word and morpheme
how it is pronounced (phonology)
what it means (semantics)
its distributional properties (morphology and syntax)

4/33

Learning the lexicon


Speech does not have pauses in between words
children have to learn how to segment utterances into words
As part of an ARC project, weve built a computational model that

performs word segmentation using:

utterance boundaries
non-linguistic context
syllable structure and rhythmic patterns
language-specific must be learned

other known words and longer-range linguistic context

With our model, we can


measure the contribution of each information source
predict the effect of changing the percept or changing the input

5/33

Morphology and syntax


rich hierarchical structure is pervasive in language
morphology studies the structure of words
E.g., re+structur+ing, un+remark+able
syntax studies the ways words combine to form phrases and

sentences

phrase structure helps identify who did what to whom

.
S
NP
D

VP
N

the cat chased

NP
D

the dog

6/33

Parsing identifies phrase structure


Ambiguity is pervasive in human languages

.
S

NP

VP
V

saw

NP

NP
D

PP
N

the man with

saw

NP
D

VP

the telescope

NP
D

the man

PP
P

with

NP
D

the telescope

Recover English phrase structure with over 90% accuracy


We have an ARC project to parse running speech
coupled with a speech recogniser
our models are robust against speech disfluencies
7/33

Phrase structure of real sentences


.
S
NP-SBJ
DT

JJ

VP
NNS

VBP PRT

The new options carry

NP

RP

NP

out

NN

.
PP

IN

part of
DT

NP
NP

SBAR
NN

WHNP-12

an agreement

IN

that

S
NP-SBJ-1

DT

NN

,
NN

the pension fund

S-ADV
NP-SBJ
-NONE-

,
PP-PRD

VP

IN

NP

*-1 under

VBN

NP

reached

-NONE-

NN

pressure

*T*-12 with

NP-SBJ
-NONE-

VP
VP

TO

to

PP
IN

DT

NNP

the SEC

NNP

TO
NP

PRP$ JJ

NP

in

VP

and

VB

IN

December

CC
VP

relax

PP-TMP
NP

NN

to
NNS

its strict participation rules

VP
VB

provide

NP
JJR

NN

NNS

more investment options

8/33

Semantics and pragmatics

Semantics studies the meaning of words, phrases and sentences


E.g., I ate the oysters in/for an hour.
E.g., Who do you want to talk to /him?
Pragmatics studies how we use language to do things in the world
E.g., Can you pass the salt?
E.g., a letter of recommendation:
Sam is punctual and extremely well-groomed.

9/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

10/33

The statistical revolution


Hand-written rule-based approach
linguist crafts patterns or rules to solve problem
complicated and expensive to construct
hand-written systems are often brittle
Statistical machine learning approach
collect statistics from large corpora
combine a variety of information sources using machine-learning
statistical models tend to be easier to maintain and less brittle
Statistical models of language are scientifically interesting
humans are very sensitive to statistical properties
statistical models make quantitative predictions

11/33

Machine learning vs. statistical analysis


Machine learning and statistical analysis often use similar

mathematical models

E.g., linear models, least squares, logistic regression

The goals of statistical analysis and machine learning are different


Statistical analysis:
goal is hypothesis testing or identifying predictors
E.g., Does coffee cause cancer?

size of data and number of factors is small (thousands)

Machine learning:
goal is prediction by generalising from examples
E.g., Will person X get cancer?

size of data and number of factors can be huge (billions)


let learning algorithm decide which factors to use

12/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

13/33

Named entity recognition

Named entity recognition finds the named entities and their classes

in a text

Example:
Sam Spade bought |{z}
300 shares in Acme Corp in 2006
| {z }
| {z } | {z }
person

number

corporation

time

14/33

Noun phrase coreference


Noun phrase coreference tracks mentions to entities within

documents

Example:
Julia Gillard met with the president of Indonesia yesterday. Ms.
Gillard told him that she . . .

Cross-document coreference identifies mentions to same entity in

different documents
Were doing this on speech data as part of our ARC project

15/33

Relation extraction

Relation extraction mines texts to find instances of specific

relationships between named entities


Name
Role
Organisation
Steven Schwartz Vice Chancellor Macquarie University
...
...
...
Has been applied to mining bio-medical literature

16/33

Event extraction and role labelling


Event extraction identifies the events described by a text
Role labelling identifies who did what to whom
Example:
Involvement of p70(S6)-kinase activation in IL-10 up-regulation in
human monocytes by gp41 envelope protein of human
immunodeficiency virus type 1 . . .

.
involvement
Theme

Cause

up-regulation
Theme

IL-10

Cause

gp41

activation
Site

human monocyte

Theme

p70(S6)-kinase

17/33

Opinion mining and sentiment analysis


Used to analyse social media (Web 2.0)
Classify message along a subjectiveobjective scale
Identify polarity of message
in some genres, simple keyword-based approaches work well
but in others its necessary to model syntactic structure as well
E.g., I doubt she had a very good experience . . .

Often combined with


topic modelling to cluster messages with similar opinions
multi-document summarisation to present comprehensible results
Were currently applying this to financial announcements with the

CMCRC

18/33

Topic models for document processing


Topic models cluster documents into

one or more topics

usually unsupervised (i.e., topics


arent given in training data)

Important for document analysis and

information extraction

Example: clustering news stories for


information retrieval
Example: tracking evolution of a
research topic over time

19/33

Topic modelling task


Given just a collection of documents, simultaneously identify:
which topic(s) each document discusses
the words that are characteristic of each topic
Example: TASA collection of 37,000 passages on language and

arts, social studies, health, sciences, etc.


Topic 247
word
prob.
DRUGS
.069
DRUG
.060
MEDICINE .027
EFFECTS
.026
BODY
.023
MEDICINES .019
PAIN
.016
PERSON
.016
MARIJUANA .014
LABEL
.012
ALCOHOL .012
DANGEROUS .011
ABUSE
.009
EFFECT
.009
KNOWN
.008

Topic 5
Topic 43
Topic 56
word
prob.
word
prob.
word
prob.
RED
.202
MIND
.081 DOCTOR .074
BLUE
.099 THOUGHT
.066
DR.
.063
GREEN .096 REMEMBER .064 PATIENT .061
YELLOW .073
MEMORY
.037 HOSPITAL .049
WHITE .048 THINKING
.030
CARE
.046
COLOR .048 PROFESSOR .028 MEDICAL .042
BRIGHT .030
FELT
.025 NURSE
.031
COLORS .029 REMEMBERED .022 PATIENTS .029
ORANGE .027 THOUGHTS .020 DOCTORS .028
BROWN .027 FORGOTTEN .020 HEALTH .025
PINK
.017 MOMENT
.020 MEDICINE .017
LOOK
.017
THINK
.019 NURSING .017
BLACK .016
THING
.016 DENTAL .015
PURPLE .015
WONDER
.014 NURSES .013
CROSS .011
FORGET
.012 PHYSICIAN .012
20/33

Mixture versus admixture topic models


In a mixture model, each document has a single topic
all words in the document come from this topic
In admixture models, each document has a distribution over topics
a single document can have multiple topics (number of topics in a
document controlled by prior)
can capture more complex relationships between documents than
a mixture model
Both mixture and admixture topic models typically use a bag of

words representation of a document

21/33

Example: documents from NIPS corpus


Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

22/33

Example (cont): ignore function words


Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

23/33

Example (cont): mixture topic model


Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

24/33

Example (cont): admixture topic model


Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

25/33

Finding topics in document collections


If were not told wordtopic and documenttopic mappings, this is

an unsupervised learning problem


Can be solved using Bayesian inference with sparse prior

most documents discuss few topics


most topics have a small vocabulary

Simple iterative learning algorithm


randomly assign words to topics
repeat until converged:
assign topics to documents based on word-topic assignments
assign words to topics based on document-topic assignments

Nothing language-specific these models can be applied to other

domains

search for a hidden causes in a sea of data

26/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

27/33

Unsupervised word segmentation

Input: phoneme sequences with sentence boundaries


Task: identify word boundaries, and hence words
j

t t u s i
ju wnt tu si bk

you want to see the book


Ignoring phonology and morphology, this involves learning the

pronunciations of the lexical items in the language

28/33

Mapping words to referents

Input to learner:
word sequence: Is that the pig?
objects in nonlinguistic context: dog, pig
Learning objectives:
identify utterance topic: pig
identify word-topic mapping: pig pig
29/33

Word learning as a kind of topic modelling


Learning to map words to referents is like topic modelling: we

identify referential words by clustering them with their contexts


Word learning also involves finding sequences of phonemes (word
pronunciations) that cluster with a context
Idea: apply the word learning model with words (rather than
phonemes) as the basic units
An extended topic model that learns topical multi-word expressions
Currently negotiating a contract with NICTA to develop this idea

30/33

Learning topical multi-word expressions


Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

31/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

32/33

Conclusion
Computational linguistics studies language production,

comprehension and acquistion

what information is used, and how is it used?


may give insights into language disorders, and suggest possible
therapies

Natural language processing uses computers to process speech and

texts

information retrieval, extraction and summarisation


machine translation
human-computer interface

Statistical models and machine learning play a central role in both


Theory and practical applications interact in a productive way!

33/33

You might also like