0% found this document useful (0 votes)

56 views33 pages

CL and Topic Models

This document provides a summary of a presentation on natural language processing and computational linguistics given by Professor Mark Johnson. It discusses key topics in computational linguistics including phonetics, phonology, morphology, syntax, semantics, and pragmatics. It also covers machine learning approaches to natural language processing tasks such as named entity recognition, coreference resolution, relation extraction, event extraction, sentiment analysis, and topic modeling.

Uploaded by

diankusuma123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views33 pages

CL and Topic Models

Uploaded by

diankusuma123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language Processing and

Computational Linguistics:
from Theory to Application
Professor Mark Johnson
Director of the Macquarie Centre for Language Sciences (CLaS)
Department of Computing

October 2012

1/33

Computational linguistics and

Natural language processing
Computational linguistics is a scientific discipline that studies

linguistic processes from a computational perspective

language comprehension (computational psycho-linguistics)

language production
language acquisition

Natural language processing is an engineering discipline that uses

computers to do useful things with language

information retrieval
topic detection and document clustering
document summarisation
sentiment analysis
machine translation
speech recognition
2/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

3/33

Phonetics, phonology and the lexicon

Phonetics studies the sounds of a language
E.g., English aspirates stop consonants in certain positions
Phonology studies the distributional properties of these sounds
E.g., the English noun plural is [s] following unvoiced segments
and [z] following voiced segments
A language has a lexicon, which lists for each word and morpheme
how it is pronounced (phonology)
what it means (semantics)
its distributional properties (morphology and syntax)

4/33

Learning the lexicon

Speech does not have pauses in between words
children have to learn how to segment utterances into words
As part of an ARC project, weve built a computational model that

performs word segmentation using:

utterance boundaries
non-linguistic context
syllable structure and rhythmic patterns
language-specific must be learned

other known words and longer-range linguistic context

With our model, we can

measure the contribution of each information source
predict the effect of changing the percept or changing the input

5/33

Morphology and syntax

rich hierarchical structure is pervasive in language
morphology studies the structure of words
E.g., re+structur+ing, un+remark+able
syntax studies the ways words combine to form phrases and

sentences

phrase structure helps identify who did what to whom

.
S
NP
D

VP
N

the cat chased

NP
D

the dog

6/33

Parsing identifies phrase structure

Ambiguity is pervasive in human languages

.
S

VP
V

saw

NP
D

PP
N

the man with

saw

NP
D

the telescope

NP
D

the man

PP
P

with

NP
D

the telescope

Recover English phrase structure with over 90% accuracy

We have an ARC project to parse running speech
coupled with a speech recogniser
our models are robust against speech disfluencies
7/33

Phrase structure of real sentences

.
S
NP-SBJ
DT

VP
NNS

VBP PRT

The new options carry

out

.
PP

part of
DT

NP
NP

SBAR
NN

WHNP-12

an agreement

that

S
NP-SBJ-1

,
NN

the pension fund

S-ADV
NP-SBJ
-NONE-

,
PP-PRD

*-1 under

VBN

reached

-NONE-

pressure

*T*-12 with

NP-SBJ
-NONE-

VP
VP

PP
IN

NNP

the SEC

NNP

TO
NP

PRP$ JJ

and

December

CC
VP

relax

PP-TMP
NP

to
NNS

its strict participation rules

VP
VB

provide

NP
JJR

NNS

more investment options

8/33

Semantics and pragmatics

Semantics studies the meaning of words, phrases and sentences

E.g., I ate the oysters in/for an hour.
E.g., Who do you want to talk to /him?
Pragmatics studies how we use language to do things in the world
E.g., Can you pass the salt?
E.g., a letter of recommendation:
Sam is punctual and extremely well-groomed.

9/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

10/33

The statistical revolution

Hand-written rule-based approach
linguist crafts patterns or rules to solve problem
complicated and expensive to construct
hand-written systems are often brittle
Statistical machine learning approach
collect statistics from large corpora
combine a variety of information sources using machine-learning
statistical models tend to be easier to maintain and less brittle
Statistical models of language are scientifically interesting
humans are very sensitive to statistical properties
statistical models make quantitative predictions

11/33

Machine learning vs. statistical analysis

Machine learning and statistical analysis often use similar

mathematical models

E.g., linear models, least squares, logistic regression

The goals of statistical analysis and machine learning are different

Statistical analysis:
goal is hypothesis testing or identifying predictors
E.g., Does coffee cause cancer?

size of data and number of factors is small (thousands)

Machine learning:
goal is prediction by generalising from examples
E.g., Will person X get cancer?

size of data and number of factors can be huge (billions)

let learning algorithm decide which factors to use

12/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

13/33

Named entity recognition

Named entity recognition finds the named entities and their classes

in a text

Example:
Sam Spade bought |{z}
300 shares in Acme Corp in 2006
| {z }
| {z } | {z }
person

number

corporation

time

14/33

Noun phrase coreference

Noun phrase coreference tracks mentions to entities within

documents

Example:
Julia Gillard met with the president of Indonesia yesterday. Ms.
Gillard told him that she . . .

Cross-document coreference identifies mentions to same entity in

different documents
Were doing this on speech data as part of our ARC project

15/33

Relation extraction

Relation extraction mines texts to find instances of specific

relationships between named entities

Name
Role
Organisation
Steven Schwartz Vice Chancellor Macquarie University
...
...
...
Has been applied to mining bio-medical literature

16/33

Event extraction and role labelling

Event extraction identifies the events described by a text
Role labelling identifies who did what to whom
Example:
Involvement of p70(S6)-kinase activation in IL-10 up-regulation in
human monocytes by gp41 envelope protein of human
immunodeficiency virus type 1 . . .

.
involvement
Theme

Cause

up-regulation
Theme

IL-10

Cause

gp41

activation
Site

human monocyte

Theme

p70(S6)-kinase

17/33

Opinion mining and sentiment analysis

Used to analyse social media (Web 2.0)
Classify message along a subjectiveobjective scale
Identify polarity of message
in some genres, simple keyword-based approaches work well
but in others its necessary to model syntactic structure as well
E.g., I doubt she had a very good experience . . .

Often combined with

topic modelling to cluster messages with similar opinions
multi-document summarisation to present comprehensible results
Were currently applying this to financial announcements with the

CMCRC

18/33

Topic models for document processing

Topic models cluster documents into

one or more topics

usually unsupervised (i.e., topics

arent given in training data)

Important for document analysis and

information extraction

Example: clustering news stories for

information retrieval
Example: tracking evolution of a
research topic over time

19/33

Topic modelling task

Given just a collection of documents, simultaneously identify:
which topic(s) each document discusses
the words that are characteristic of each topic
Example: TASA collection of 37,000 passages on language and

arts, social studies, health, sciences, etc.

Topic 247
word
prob.
DRUGS
.069
DRUG
.060
MEDICINE .027
EFFECTS
.026
BODY
.023
MEDICINES .019
PAIN
.016
PERSON
.016
MARIJUANA .014
LABEL
.012
ALCOHOL .012
DANGEROUS .011
ABUSE
.009
EFFECT
.009
KNOWN
.008

Topic 5
Topic 43
Topic 56
word
prob.
word
prob.
word
prob.
RED
.202
MIND
.081 DOCTOR .074
BLUE
.099 THOUGHT
.066
DR.
.063
GREEN .096 REMEMBER .064 PATIENT .061
YELLOW .073
MEMORY
.037 HOSPITAL .049
WHITE .048 THINKING
.030
CARE
.046
COLOR .048 PROFESSOR .028 MEDICAL .042
BRIGHT .030
FELT
.025 NURSE
.031
COLORS .029 REMEMBERED .022 PATIENTS .029
ORANGE .027 THOUGHTS .020 DOCTORS .028
BROWN .027 FORGOTTEN .020 HEALTH .025
PINK
.017 MOMENT
.020 MEDICINE .017
LOOK
.017
THINK
.019 NURSING .017
BLACK .016
THING
.016 DENTAL .015
PURPLE .015
WONDER
.014 NURSES .013
CROSS .011
FORGET
.012 PHYSICIAN .012
20/33

Mixture versus admixture topic models

In a mixture model, each document has a single topic
all words in the document come from this topic
In admixture models, each document has a distribution over topics
a single document can have multiple topics (number of topics in a
document controlled by prior)
can capture more complex relationships between documents than
a mixture model
Both mixture and admixture topic models typically use a bag of

words representation of a document

21/33

Example: documents from NIPS corpus

Annotating an unlabeled dataset is one of the bottlenecks in using supervised
learning to build good predictive models. Getting a dataset labeled by experts can
be expensive and time consuming. With the advent of crowdsourcing services . . .
The task of recovering intrinsic images is to separate a given input image into its
material-dependent properties, known as reflectance or albedo, and its
light-dependent properties, such as shading, shadows, specular highlights, . . .
In each trial of a standard visual short-term memory experiment, subjects are first
presented with a display containing multiple items with simple features (e.g. colored
squares) for a brief duration and then, after a delay interval, their memory for . . .
Many studies have uncovered evidence that visual cortex contains specialized regions
involved in processing faces but not other object classes. Recent electrophysiology
studies of cells in several of these specialized regions revealed that at least some . . .

22/33

Example (cont): ignore function words

23/33

Example (cont): mixture topic model

24/33

Example (cont): admixture topic model

25/33

Finding topics in document collections

If were not told wordtopic and documenttopic mappings, this is

an unsupervised learning problem

Can be solved using Bayesian inference with sparse prior

most documents discuss few topics

most topics have a small vocabulary

Simple iterative learning algorithm

randomly assign words to topics
repeat until converged:
assign topics to documents based on word-topic assignments
assign words to topics based on document-topic assignments

Nothing language-specific these models can be applied to other

domains

search for a hidden causes in a sea of data

26/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

27/33

Unsupervised word segmentation

Input: phoneme sequences with sentence boundaries

Task: identify word boundaries, and hence words
j

t t u s i
ju wnt tu si bk

you want to see the book

Ignoring phonology and morphology, this involves learning the

pronunciations of the lexical items in the language

28/33

Mapping words to referents

Input to learner:
word sequence: Is that the pig?
objects in nonlinguistic context: dog, pig
Learning objectives:
identify utterance topic: pig
identify word-topic mapping: pig pig
29/33

Word learning as a kind of topic modelling

Learning to map words to referents is like topic modelling: we

identify referential words by clustering them with their contexts

Word learning also involves finding sequences of phonemes (word
pronunciations) that cluster with a context
Idea: apply the word learning model with words (rather than
phonemes) as the basic units
An extended topic model that learns topical multi-word expressions
Currently negotiating a contract with NICTA to develop this idea

30/33

Learning topical multi-word expressions

31/33

Outline
A crash course in linguistics
Machine learning and data mining
Brief survey of NLP applications
Word segmentation and topic models
Conclusion

32/33

Conclusion
Computational linguistics studies language production,

comprehension and acquistion

what information is used, and how is it used?

may give insights into language disorders, and suggest possible
therapies

Natural language processing uses computers to process speech and

texts

information retrieval, extraction and summarisation

machine translation
human-computer interface

Statistical models and machine learning play a central role in both

Theory and practical applications interact in a productive way!

33/33

Phonology in Natural Language Processing
No ratings yet
Phonology in Natural Language Processing
35 pages
NLP & Linguistics for Researchers
No ratings yet
NLP & Linguistics for Researchers
35 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
21 pages
AI Unit 3
No ratings yet
AI Unit 3
12 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
NLP Module 1
No ratings yet
NLP Module 1
19 pages
Natural Language Processing
100% (6)
Natural Language Processing
49 pages
Unit 1 Text and Speech Analysis Notes
No ratings yet
Unit 1 Text and Speech Analysis Notes
28 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
32 pages
Tsa-Unit-1 To 5 Notes
No ratings yet
Tsa-Unit-1 To 5 Notes
124 pages
Chapter23 - Natural Language Processing
No ratings yet
Chapter23 - Natural Language Processing
87 pages
Assignment on Natural Language Processing
No ratings yet
Assignment on Natural Language Processing
16 pages
AI Chapter 5
No ratings yet
AI Chapter 5
37 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
49 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Text and Speech Analysis Notes ccs369 Unit 1
No ratings yet
Text and Speech Analysis Notes ccs369 Unit 1
28 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Overview of Natural Language Processing: Advanced AI CSCE 976 Amy Davis
No ratings yet
Overview of Natural Language Processing: Advanced AI CSCE 976 Amy Davis
54 pages
Introduction
No ratings yet
Introduction
24 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
17 pages
Tsa Unit 1 Notes
No ratings yet
Tsa Unit 1 Notes
27 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
43 pages
Natural Language Processing Guide
No ratings yet
Natural Language Processing Guide
21 pages
1 Introduction
No ratings yet
1 Introduction
62 pages
Chapter 7 - Communication Perceving and Acting
No ratings yet
Chapter 7 - Communication Perceving and Acting
21 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
32 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Ima 2000
No ratings yet
Ima 2000
56 pages
L1Introduction To NLP
No ratings yet
L1Introduction To NLP
45 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
تعلم ML4
No ratings yet
تعلم ML4
42 pages
Mod 1
No ratings yet
Mod 1
71 pages
Module1 Chapter1
No ratings yet
Module1 Chapter1
23 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
26 pages
Understanding Ambiguity in NLP
No ratings yet
Understanding Ambiguity in NLP
35 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
Module 1
No ratings yet
Module 1
27 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
NLP Final
No ratings yet
NLP Final
33 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
20 pages
NLP Notes (Ch-1)
No ratings yet
NLP Notes (Ch-1)
5 pages
Intro To NLP and Text Mining
No ratings yet
Intro To NLP and Text Mining
28 pages
NLP Chapter-1
No ratings yet
NLP Chapter-1
24 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
Unit 1
No ratings yet
Unit 1
99 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
20 pages
Department of Electrical Engineering Princeton University, Princeton, NJ 08544, USA Fax: +1-609-258-3745 Minwu, Liu @ee - Princeton.edu
No ratings yet
Department of Electrical Engineering Princeton University, Princeton, NJ 08544, USA Fax: +1-609-258-3745 Minwu, Liu @ee - Princeton.edu
5 pages
Image Watermarking & Ethics Workshop Report
No ratings yet
Image Watermarking & Ethics Workshop Report
4 pages
Recovery of Watermarks From Distorted Images: Identication Marks Salient
No ratings yet
Recovery of Watermarks From Distorted Images: Identication Marks Salient
16 pages
Digital Water Marking
No ratings yet
Digital Water Marking
21 pages
Advances in Digital Image Compression by Adaptive Thinning
No ratings yet
Advances in Digital Image Compression by Adaptive Thinning
5 pages
Image Representation and Discretization PDF
No ratings yet
Image Representation and Discretization PDF
34 pages
Overview of Expert Systems
No ratings yet
Overview of Expert Systems
11 pages
SDLC RUP System Architecture Document
No ratings yet
SDLC RUP System Architecture Document
6 pages
EXCERPT Fundamentals of Expert System
No ratings yet
EXCERPT Fundamentals of Expert System
15 pages
Automata and Formal Languages Overview
No ratings yet
Automata and Formal Languages Overview
152 pages
Wireless Networking Security: © The Government of The Hong Kong Special Administrative Region
No ratings yet
Wireless Networking Security: © The Government of The Hong Kong Special Administrative Region
30 pages
Decision Support Systems Overview
No ratings yet
Decision Support Systems Overview
185 pages
APC Fundamental Principles of Network Security
No ratings yet
APC Fundamental Principles of Network Security
14 pages
Coh-Metrix: Advanced NLP Text Analysis
No ratings yet
Coh-Metrix: Advanced NLP Text Analysis
30 pages
Coling82, Z Homck) (Ed.) North - Hollandpublishingcompany © Academ 1982
No ratings yet
Coling82, Z Homck) (Ed.) North - Hollandpublishingcompany © Academ 1982
6 pages
Cat 314e LCR
No ratings yet
Cat 314e LCR
36 pages
Input Processing For Adults in SLA
No ratings yet
Input Processing For Adults in SLA
13 pages
2012 - ICN Handbook - Entrepreneurial - Practice - Eng PDF
No ratings yet
2012 - ICN Handbook - Entrepreneurial - Practice - Eng PDF
74 pages
Microfinance Strategies for Empowerment
No ratings yet
Microfinance Strategies for Empowerment
2 pages
Empowering Through e-Learning
No ratings yet
Empowering Through e-Learning
45 pages
Tutorial Scilab Xcos Modelica
No ratings yet
Tutorial Scilab Xcos Modelica
19 pages
Group Dynamics Training Manual
100% (1)
Group Dynamics Training Manual
20 pages
Valuing and Acquiring A Business: Hawawini & Viallet 1
100% (1)
Valuing and Acquiring A Business: Hawawini & Viallet 1
53 pages
Locally Developed Learning Resources School Inventory 2019-2021 District: Northern Pinukpuk School: MAGAO-GAO E/S
No ratings yet
Locally Developed Learning Resources School Inventory 2019-2021 District: Northern Pinukpuk School: MAGAO-GAO E/S
8 pages
What'Sbest!: User'S Manual
No ratings yet
What'Sbest!: User'S Manual
484 pages
Magee Hip
No ratings yet
Magee Hip
45 pages
Huawei EkitEngine S310 Series Switches Datasheet
No ratings yet
Huawei EkitEngine S310 Series Switches Datasheet
10 pages
MS-DD-3000-HSE-FRM-0020 - Harness Inspection Register
No ratings yet
MS-DD-3000-HSE-FRM-0020 - Harness Inspection Register
1 page
Ethics+and+Social+Responsibility +PPT
100% (3)
Ethics+and+Social+Responsibility +PPT
20 pages
EOI-1069 - Togumogu
No ratings yet
EOI-1069 - Togumogu
5 pages
A Commentary On The Holy Scriptures - Critical, Doctrina, and Homiletical
No ratings yet
A Commentary On The Holy Scriptures - Critical, Doctrina, and Homiletical
522 pages
019wilkowski - Edited
No ratings yet
019wilkowski - Edited
10 pages
United States v. Gianquitto, 1st Cir. (1996)
No ratings yet
United States v. Gianquitto, 1st Cir. (1996)
21 pages
Stainless Steel Family
No ratings yet
Stainless Steel Family
2 pages
AUT PhD Scholarship Ranking Criteria
No ratings yet
AUT PhD Scholarship Ranking Criteria
3 pages
Bob Statement
No ratings yet
Bob Statement
2 pages
Subatomic Particles Lesson Plan
No ratings yet
Subatomic Particles Lesson Plan
14 pages
Maths Unit Test Paper STD 5
100% (2)
Maths Unit Test Paper STD 5
2 pages
25 Sun Ord A - 7am & 9am 20 Sep 2020
No ratings yet
25 Sun Ord A - 7am & 9am 20 Sep 2020
101 pages
Aits-22 Jee Main Solution
No ratings yet
Aits-22 Jee Main Solution
20 pages
Ucla Math 174E
No ratings yet
Ucla Math 174E
4 pages
Functional Linguistics
No ratings yet
Functional Linguistics
7 pages
Artículos Científicos: Reyes López-Ordaz Héctor Castillo-Juárez Hugo H. Montaldo
No ratings yet
Artículos Científicos: Reyes López-Ordaz Héctor Castillo-Juárez Hugo H. Montaldo
14 pages
Aeration/Settling Tank Plan
No ratings yet
Aeration/Settling Tank Plan
1 page
History and Types of Counseling
No ratings yet
History and Types of Counseling
16 pages