SEMINÁRIOS DA ESCOLA DE INFORMÁTICA & COMPUTAÇÃO
AVANÇOS RECENTES NO
PROCESSAMENTO DE
LINGUAGEM NATURAL COM DEEP
LEARNING
Prof. Eduardo Bezerra
ebezerra@[Link]
08/out/2020
Summary
2
Introduction
NLP Periods
Symbolic based
Corpus-based
Neural-based
Conclusions
3
Introduction
What is NLP?
4
At the intersection of linguistics,
computer science, and artificial
intelligence.
Has to do with processing and analyzing
large amounts of natural language data.
“processing and analyzing” extract
context and meaning.
NLP is pop, but it is hard!
5
Homonymy, polysemy, …
Jaguar is the luxury vehicle brand of Land Rover.
The jaguar is an animal of the genus Panthera native to the Americas
Natural languages are unstructured,
redundant and ambiguous.
Enraged cow injures farmer with ax.
NLP Tasks/Applications
6
Text classification, clustering, summarization
Machine translation
Conversational chatbots
Question answering
Speech synthesis & recognition
Text generation
Auto-correcting
7
NLP Periods
NLP periods
8
9
Symbolic-based NLP (1950s-1990s)
Symbolic-based NLP
10
Georgetown Experiment (1954)
ELISA (1964-1966)
Cyc Project (1984)
WordNet (1985)
1950s-1990s
Georgetown-IBM experiment
11
Machine translation: automatic
translation of Russian sentences into
English.
“The experiment was considered a
success and encouraged
governments to invest in
computational linguistics. The
project managers claimed that
machine translation would be a
reality in three to five years.”
1954
ELIZA
12
“Natural language” conversation through
pattern matching.
“...sister...” “Tell me more about your
family.”
1966
Cyc Project
13
1984-
WordNet
14
155,327 words
organized in
175,979 synsets for
a total of 207,016
word-sense pairs
1985-
WordNet – graph fragment
15
chicken
Is_a poultry Purpose supply Typ_obj
clean Is_a Quesp
smooth Typ_obj keep
Is_a hen duck
Is_a
Typ_obj Purpose meat
preen Typ_subj Caused_by
Is_a egg
Means quack
Not_is_a plant
chatter Typ_subj animal
Is_a Is_a
Is_a Is_a creature
make bird Is_a
Typ_obj sound
gaggle Part feather
Is_a Is_a
Classifier goose wing Is_a limb
peck Is_a
number Typ_subj Is_a
claw
Is_a Means Is_a
beak Part Part
hawk Is_a
Typ_obj
strike Typ_subj
fly
leg
turtle catch
Is_a Typ_subj Is_a
bill arm
face Location mouth Is_a opening
16
Corpus-based NLP (1990s-2010s)
Corpus-based NLP
17
1990s-2010s
Corpus-based NLP (aka ML-based)
18
Successful applications of ML methods to
text data
e.g., SVM, HMM
1990s-2010s
Corpus-based NLP (aka ML-based)
19
Text Mining
1990s-2010s
20
Neural-based NLP (2010s-present)
21
Conception, gestation, …, birth!
22
"There is a moment of conception and
a moment of birth, but between them
there is a long period of gestation."
Jonas Salk, 1914-1995
Distributional Hypothesis
23
“The more semantically similar two words are, the more
distributionally similar they will be in turn, and thus the more
that they will tend to occur in similar linguistic contexts.”
“words that are similar in meaning occur in similar contexts”
1950s
Distributional Hypothesis
24
“words that are similar in meaning occur in similar contexts”
It would be marvelous to watch a match between Kasparov and Fisher.
similar words
It would be fantastic to watch a match between Kasparov and Fisher.
Zellig Harris, 1909-1992
1950s
Vector Space Model (for Information Retrieval)
25
SMART Information Retrieval System
term-doc matrix
First attempt to model
1960s text elements as vectors Gerard Salton, 1927-1995
Vector Space Model
26
Similarity between docs (sentences,
words)
1960s Gerard Salton, 1927-1995
Distributed Representations
27
1986
Latent semantic analysis (LSA)
28
1988
Latent semantic analysis (LSA)
29
1988
Latent semantic analysis (LSA)
30
LSA creates context vectors
1988
Distributed representation – an example
31
Image by Garrett Hoffman
Distributed representation – an example
32
Image by Garrett Hoffman
Distributed representation – an example
33
Image by Garrett Hoffman
Conception, gestation, …, birth!
34
Conception, gestation
Distributional hypothesis
Vector Space model
LSA
Distributed representations
Now, for the Deep Learning based NLP
birth…
Neural-based NLP (aka Deep Learning based)
35
Most SOTA results in NLP today are
obtained through Deep Learning
methods.
One of the main achievements of this
period is related to building rich
distributed representations of text
objects through deep neural networks.
2010s-present
word2vec
36
Efficient Estimation of Word
Representations in Vector
Space, September 7th, 2013.
Distributed Representations of
Words and Phrases and their
Compositionality, October 16th, Tomas Mikolov
2013. (20K+ citations)
2013 Idea: each word can be represented by a fixed-length numeric
vector. Words of similar meanings have similar vectors.
word2vec
37
In word2vec, a single hidden layer NN is trained to
perform a certain “fake” task.
Skip-gram: predicting surrounding context words
given a center word.
CBOW: predicting a center word from the surrounding
context.
But this NN is not actually used!
Instead, the goal is to learn the weights of the hidden
layer– these weights are the “word vectors”.
word2vec: skip-gram alternative
38
The task: given a specific word w in the middle
of a sentence (the input word), look at the
words nearby and pick one word at random.
The solution: train an ANN to produce the
probability (for every word in the vocabulary) of
being nearby w.
“nearby” means there is actually a "window size"
hyperparameter (typical value: 5)
word2vec
39
Each word in the vocabulary is
represented using one hot encoding (aka
local representation!).
Credits: Marco Bonzaninin
word2vec
40
Credits: Marco Bonzaninin
word2vec
41
Credits: Marco Bonzaninin
word2vec
42
Skip-gram NN
architecture
The amount of neurons in the hidden layer (a hyperparameter) determines de size of the embedding.
word2vec
43
word2vec
44
word2vec captures context similarity:
If words wj and wk have similar contexts, then the
model needs to output very similar results for
them.
Oneway for the network to do this is to make the word
vectors for wj and wk very similar.
So, if two words have similar contexts, the network
is motivated to learn similar word vectors for
them.
word2vec
45
Credits: [Link]
Embedding models
46
Word2Vec
GloVe Currently, the distributional hypothesis through vector
embeddings models generated by ANNs is used
SkipThoughts pervasively in NLP.
Paragraph2Vec
Doc2Vec
FastText
Encoder-Decoder models (aka seq2seq models)
47
Encoder
Decoder
[Link]
“Classical” Encoder-Decoder model
48
“The idea is to use one LSTM to read the input sequence,
one timestep at a time, to obtain large fixed-dimensional
vector representation, and then to use another LSTM to
extract the output sequence from that vector
2014 recurrent architecture
Encoder-Decoder model with Attention
49
2015 recurrent architecture
Attention models into recurrent NNs
50
2015 Bahdanau et al 2015
Transformers
51
ATTENTION
“We propose a new simple network
architecture, the Transformer, based
2017 solely on attention mechanisms,
dispensing with recurrence and
convolutions entirely.” feedforward architecture!
Transformers
52
Transformers are the
current SOTA neural
architecture when it
comes to produce text
representations to use in
most NLP tasks.
From Vaswani et al (2018)
Famous Transformer Models
53
BERT (Bidirection Encoder Representations from
Transformers)
GPT-2 (Generative Pre-Training)
GPT-3
2018-2020
54
Conclusions
Take away notes
55
SOTA results in most NLP is currently
neural-based.
Neural-based NLP is recent, but relies on
older ideas.
Attention mechanism is a novel and very
promising idea.
Pretrained models
56
[Link]
[Link]
Neural Nets need a Vapnik!
57
The theory about generalization properties of ANNs is not completely understood.
TODO: Natural Language Understanding
58
Headlines:
Enraged Cow Injures Farmer With Ax
Hospitals Are Sued by 7 Foot Doctors
Ban on Nude Dancing on Governor’s Desk
Iraqi Head Seeks Arms
Local HS Dropouts Cut in Half
Juvenile Court to Try Shooting Defendant
Stolen Painting Found by Tree
Humans use their underlying understanding of the world as context
Source: CS188
TODO: Common Sense Knowledge
59
"If a mother has a son, then the son is younger than
the mother and remains younger for his entire life."
"If President Trump is in Washington, then his left foot
is also in Washington,"
Food for thought
60
“There’ll be a lot of people who argue against
it, who say you can’t capture a thought like
that. But there’s no reason why not. I think
you can capture a thought by a vector.”
Geoff Hinton
These slides are available at
[Link]
Eduardo Bezerra (ebezerra@cefet-
[Link])
62
Backup slides
Language Models (Unigrams, Bigrams, etc.)
63
A model that assigns a probability to a
sequence of tokens.
A good language model gives...
...(syntactically and semantically) valid
sentences a high probability.
...low probability to nonsense.
Language Models (Unigrams, Bigrams, etc.)
64
Mathematically, we can apply a LM to
any given sequence of n words:
Language Models (Unigrams, Bigrams, etc.)
65
An example:
"The quick brown fox jumps over the lazy
dog."
Another example:
"The quik brown lettuce over jumps the
lazy dog.“
Language Models (Unigrams, Bigrams, etc.)
66
Unigram model
Bigram model
But, how to learn these
probabilities?
Transfer Learning
67
68
Neural Nets
Artificial Neural Net
69
It is possible to build arbitrarily complex
networks using the artificial neuron as
the basic component.
Artificial Neural Net
70
Feedforward Neural Network
Training
71
Given a training set of the form
training an ANN corresponds to using this set to
adjust the parameters of the network, so that the
training error is minimized.
So, training of an ANN is an optimization problem.
Training
72
The error signal (computed with a cost function)
is used during training to gradually change the
weights (parameters), so that the predictions are
morePick
accurate.
a batch Propagate them through
of training layers from input to
examples output ()
Update Backpropagate the error
parameters for all signal through the
hidden layers layers from the output
W, b. to the input ()