Natural Language Processing
with Deep Learning
Language Models and
Recurrent Neural Networks
Overview
Today we will:
• Introduce a new NLP task
• Language Modeling
motivates
• Introduce a new family of neural networks
• Recurrent Neural Networks (RNNs)
These are two of the most important ideas for the rest of the class!
2
Language Modeling
• Language Modeling is the task of predicting what word comes
next. books
laptops
the students opened their
exams
minds
• More formally: given a sequence of words ,
compute the probability distribution of the next word :
where can be any word in the vocabulary
• A system that does this is called a Language Model.
3
Language Modeling
• You can also think of a Language Model as a system that
assigns probability to a piece of text.
• For example, if we have some text , then the
probability of this text (according to the Language Model) is:
This is what our LM provides
4
You use Language Models every day!
5
You use Language Models every day!
6
n-gram Language Models
the students opened their
• Question: How to learn a Language Model?
• Answer (pre- Deep Learning): learn a n-gram Language Model!
• Definition: A n-gram is a chunk of n consecutive words.
• unigrams: “the”, “students”, “opened”, ”their”
• bigrams: “the students”, “students opened”, “opened their”
• trigrams: “the students opened”, “students opened their”
• 4-grams: “the students opened their”
• Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.
7
n-gram Language Models
• First we make a simplifying assumption: depends only on the
preceding n-1 words.
n-1 words
(assumption)
prob of a n-gram
(definition of
prob of a (n-1)-gram conditional prob)
• Question: How do we get these n-gram and (n-1)-gram probabilities?
• Answer: By counting them in some large corpus of text!
(statistical
approximation)
8
n-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.
as the proctor started the clock, the students opened their
discard
condition on this
For example, suppose that in the corpus:
• “students opened their” occurred 1000 times
• “students opened their books” occurred 400 times
• P(books | students opened their) = 0.4 Should we have
discarded the
• “students opened their exams” occurred 100 times
“proctor” context?
• P(exams | students opened their) = 0.1
9
Sparsity Problems with n-gram Language Models
Sparsity Problem 1
Problem: What if “students
(Partial) Solution: Add small 𝛿
opened their ” never
to the count for every .
occurred in data? Then
This is called smoothing.
has probability 0!
Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition
opened their” never occurred in
on “opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any !
Note: Increasing n makes sparsity problems worse.
Typically we can’t have n bigger than 5.
10
Storage Problems with n-gram Language Models
Storage: Need to store count for
all n-grams you saw in the corpus.
Increasing n or increasing corpus
increases model size!
11
n-gram Language Models in practice
• You can build a simple trigram Language Model over a
1.7 million word corpus (Reuters) in a few seconds on your laptop*
Business and financial news
today the
get probability
distribution
company 0.153 Sparsity problem:
bank 0.153 not much granularity
price 0.077 in the probability
italian 0.039 distribution
emirate 0.039
…
Otherwise, seems reasonable! * Try for yourself: https://nlpforhackers.io/language-models/
12
Generating text with a n-gram Language Model
• You can also use a Language Model to generate text.
today the
condition on this
get probability
distribution
company 0.153
bank 0.153
price 0.077 sample
italian 0.039
emirate 0.039
…
13
Generating text with a n-gram Language Model
• You can also use a Language Model to generate text.
today the price
condition on this
get probability
distribution
of 0.308 sample
for 0.050
it 0.046
to 0.046
is 0.031
…
14
Generating text with a n-gram Language Model
• You can also use a Language Model to generate text.
today the price of
condition on this
get probability
distribution
the 0.072
18 0.043
oil 0.043
its 0.036
gold 0.018 sample
…
15
Generating text with a n-gram Language Model
• You can also use a Language Model to generate text.
today the price of gold
16
Generating text with a n-gram Language Model
• You can also use a Language Model to generate text.
today the price of gold per ton , while production of shoe
lasts and shoe industry , the bank intervened just after it
considered and rejected an imf demand to rebuild depleted
european stocks , sept 30 end primary 76 cts a share .
Surprisingly grammatical!
…but incoherent. We need to consider more than
three words at a time if we want to model language well.
But increasing n worsens sparsity problem,
17
and increases model size…
How to build a neural Language Model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob dist of the next word
• How about a window-based neural model?
• We saw this applied to Named Entity Recognition in Lecture 3:
LOCATION
18
museums in Paris are amazing
A fixed-window neural Language Model
as the proctor started the clock the students opened their
discard
fixed window
19
A fixed-window neural Language Model
books
laptops
output distribution
a zoo
hidden layer
concatenated word embeddings
words / one-hot vectors the students opened their
20
A fixed-window neural Language Model
books
Improvements over n-gram LM: laptops
• No sparsity problem
• Don’t need to store all observed
n-grams
a zoo
Remaining problems:
• Fixed window is too small
• Enlarging window enlarges
• Window can never be large
enough!
• and are multiplied by
completely different weights in .
No symmetry in how the inputs are
processed.
We need a neural
architecture that can the students opened their
process any length input
21
Core idea: Apply the
Recurrent Neural Networks (RNN) same weights
A family of neural architectures repeatedly
outputs
(optional) …
hidden states …
input sequence
(any length) …
22
A RNN Language Model books
laptops
output distribution
a zoo
hidden states
is the initial hidden state
word embeddings
words / one-hot vectors the students opened their
Note: this input sequence could be much
23
longer, but this slide doesn’t have space!
A RNN Language Model books
laptops
RNN Advantages:
• Can process any length
input
a zoo
• Computation for step t
can (in theory) use
information from many
steps back
• Model size doesn’t
increase for longer input
• Same weights applied on
every timestep, so there is
symmetry in how inputs
are processed.
RNN Disadvantages:
• Recurrent computation is More on
slow these later
• In practice, difficult to in the the students opened their
access information from course
many steps back
24
Training a RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e. predict probability dist of every word, given words so far
• Loss function on step t is cross-entropy between predicted probability
distribution , and the true next word (one-hot for ):
• Average this to get overall loss for entire training set:
25
Training a RNN Language Model
= negative log prob
of “students”
Loss
Predicted
prob dists
Corpus the students opened their exams …
26
Training a RNN Language Model
= negative log prob
of “opened”
Loss
Predicted
prob dists
Corpus the students opened their exams …
27
Training a RNN Language Model
= negative log prob
of “their”
Loss
Predicted
prob dists
Corpus the students opened their exams …
28
Training a RNN Language Model
= negative log prob
of “exams”
Loss
Predicted
prob dists
Corpus the students opened their exams …
29
Training a RNN Language Model
Loss + + + +… =
Predicted
prob dists
Corpus the students opened their exams …
30
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus
is too expensive!
• In practice, consider as a sentence (or a document)
• Recall: Stochastic Gradient Descent allows us to compute loss
and gradients for small chunk of data, and update.
• Compute loss for a sentence (actually a batch of
sentences), compute gradients and update weights. Repeat.
31
Backpropagation for RNNs
… …
Question: What’s the derivative of w.r.t. the repeated weight matrix ?
“The gradient w.r.t. a repeated weight
Answer: is the sum of the gradient
w.r.t. each time it appears”
Why?
32
Multivariable Chain Rule
Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
33
Backpropagation for RNNs: Proof sketch
In our example: Apply the multivariable chain rule:
=1
Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
34
Backpropagation for RNNs
… …
Answer: Backpropagate over
timesteps i=t,…,0, summing
gradients as you go.
This algorithm is called
“backpropagation through time”
Question: How do we
35 calculate this?
Generating text with a RNN Language Model
Just like a n-gram Language Model, you can use a RNN Language Model to
generate text by repeated sampling. Sampled output is next step’s input.
favorite season is spring
sample sample sample sample
36 spring
my favorite season is
Generating text with a RNN Language Model
• Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text
in that style.
• RNN-LM trained on Obama speeches:
Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
37
Generating text with a RNN Language Model
• Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text
in that style.
• RNN-LM trained on Harry Potter:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
38
Generating text with a RNN Language Model
• Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text
in that style.
• RNN-LM trained on recipes:
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
39
Generating text with a RNN Language Model
• Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text
in that style.
• RNN-LM trained on paint color names:
This is an example of a character-level RNN-LM (predicts what character comes next)
40 Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network
Evaluating Language Models
• The standard evaluation metric for Language Models is perplexity.
Normalized by
number of words
Inverse probability of corpus, according to Language Model
• This is equal to the exponential of the cross-entropy loss :
Lower perplexity is better!
41
RNNs have greatly improved perplexity
n-gram model
Increasingly
complex RNNs
Perplexity improves
(lower is better)
Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/
42
Why should we care about Language Modeling?
• Language Modeling is a benchmark task that helps us
measure our progress on understanding language
• Language Modeling is a subcomponent of many NLP tasks,
especially those involving generating text or
estimating the probability of text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Authorship identification
• Machine translation
• Summarization
• Dialogue
• etc.
43
Recap
• Language Model: A system that predicts the next word
• Recurrent Neural Network: A family of neural networks that:
• Take sequential input of any length
• Apply the same weights on each step
• Can optionally produce output on each step
• Recurrent Neural Network ≠ Language Model
• We’ve shown that RNNs are a great way to build a LM.
• But RNNs are useful for much more!
44
RNNs can be used for tagging
e.g. part-of-speech tagging, named entity recognition
DT JJ NN VBN IN DT NN
the startled cat knocked over the vase
45
RNNs can be used for sentence classification
e.g. sentiment classification
positive How to compute
sentence encoding?
Sentence encoding
overall I enjoyed the movie a lot
46
RNNs can be used for sentence classification
e.g. sentiment classification
positive How to compute
sentence encoding?
Basic way:
Sentence encoding Use final hidden state
overall I enjoyed the movie a lot
47
RNNs can be used for sentence classification
e.g. sentiment classification
positive How to compute
sentence encoding?
Usually better:
Sentence encoding Take element-wise max or
mean of all hidden states
overall I enjoyed the movie a lot
48
RNNs can be used as an encoder module
e.g. question answering, machine translation, many other tasks!
Answer: German
Here the RNN acts as an
encoder for the Question (the
hidden states represent the
Question). The encoder is part
of a larger neural system.
Context: Ludwig van
Beethoven was a
German composer
and pianist. A crucial
figure …
Question: what nationality was Beethoven ?
49
RNN-LMs can be used to generate text
e.g. speech recognition, machine translation, summarization
RNN-LM
what’s the weather
Input (audio)
conditioning
<START> what’s the
This is an example of a conditional language model.
We’ll see Machine Translation in much more detail later.
50
A note on terminology
RNN described in this lecture = “vanilla RNN”
Next lecture: You will learn about other RNN flavors
like GRU and LSTM and multi-layer RNNs
By the end of the course: You will understand phrases like
“stacked bidirectional LSTM with residual connections and self-attention”
51
Next time
• Problems with RNNs!
• Vanishing gradients
motivates
• Fancy RNN variants!
• LSTM
• GRU
• multi-layer
• bidirectional
52