Sequence-to-sequence
Models
CIS 530, Computational Linguistics: Spring 2018
! John Hewitt & Reno Kriz No
ing University of Pennsylvania we
arn
ve
nd
le ee
eep Some concepts drawn a bit transparently from Graham Neubig’s excellent pe
r!
D Neural Machine Translation and Sequence-to-sequence Models: A Tutorial
[Link]
We’ve already seen RNNs for language modeling
The memory
vector, or “state” .
Only
We’ve already seen RNNs for language modeling
The memory
vector, or “state” .
The “word vector”
Only use representation of
the word.
The RNN function,
which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling
The memory
vector, or “state” .
The “word vector”
Only use neural representation of
the word.
The RNN function,
which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling
The memory
vector, or “state” .
The “word vector”
Only use neural nets representation of
the word.
The RNN function,
which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling
The memory
vector, or “state” .
The “word vector”
Only use neural nets representation of
the word.
The RNN function,
which combines
the word vector
and the previous
state to create a
new state.
How does the RNN function work?
The RNN function takes the current RNN state and a word vector and produces a
subsequent RNN state that “encodes” the sentence so far.
RNN := 1 + 2 + 3
function
Learned weights representing how to combine past
information (the RNN memory) and current information (the
new word vector.)
How does the prediction function work?
We’ve seen how RNNs “encode” word sequences. But how do they produce
probability distributions over a vocabulary?
Only use neural
softmax ( )=
4
A probability distribution over the vocab, constructed from the
RNN memory and 1 last transformation (in green.) The
softmax function turns “scores” into a probability distribution.
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.
Only use neural nets
Here’s our RNN
encoder,
representing the
sentence.
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.
Only use neural nets
Predict
parts of
speech!
ADV VB ADJ NNS
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.
Only use neural nets
Or syntax!
ADV VB ADJ NNS
General idea: build a representation
The method of building the representation is called an Encoder and is frequently
an RNN.
Only use neural nets
Each memory vector in the encoder attempts to represent the sentence so far, but
mostly represents the word most recently input.
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.
Only use neural nets
Jiri
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.
Only use neural nets
Jiri naanị
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.
Only use neural nets
Jiri naanị netwọk
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.
Only use neural nets
Jiri naanị netwọk nụ
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.
Only use neural nets
Decoder (2seq)
Encoder (seq)
Jiri naanị netwọk nụ
How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)
neural nets
How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)
neural nets .7
ọk
ne nị
nụ
ri
tw
a
Ji
na
How is it formalized? How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)
Loss
neural nets .7 GOLD: Jiri -log(.7)
ọk
ne nị
nụ
ri
tw
a
Ji
na
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)
Jiri
-log(.7)
1
0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)
Jiri naanị
-log(.7) -log(.5)
1
0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)
Jiri naanị netwọk nụ
-log(.7) -log(.5) -log(.6) -log(.4)
1
0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)
Minimize
Jiri naanị netwọk nụ This!
sum( -log(.7) -log(.5) -log(.6) -log(.4)) = 1.07
1
0
How is it formalized?
Let ht be the RNN hidden state at timestep t:
How is it formalized?
Let ht be the RNN hidden state at timestep t:
Let xt be the input vector at timestep t:
How is it formalized?
Let ht be the RNN hidden state at timestep t:
Let xt be the input vector at timestep t:
The RNN equation posits 2 matrices and 1 vector as parameters:
Whx integrates input vector information.
How is it formalized?
Let ht be the RNN hidden state at timestep t:
Let xt be the input vector at timestep t:
The RNN equation posits 2 matrices and 1 vector as parameters:
Whx integrates input vector information.
Whh integrates information from the previous timestep.
How is it formalized?
Let ht be the RNN hidden state at timestep t:
Let xt be the input vector at timestep t:
The RNN equation posits 2 matrices and 1 vector as parameters:
Whx integrates input vector information.
Whh integrates information from the previous timestep.
bh is a bias term. (What function does this perform?)
How is it formalized?
Let ht be the RNN hidden state at timestep t:
Let xt be the input vector at timestep t:
The RNN equation posits 2 matrices and 1 vector as parameters:
Whx integrates input vector information.
Whh integrates information from the previous timestep.
bh is a bias term. (What function does this perform?)
The RNN equation is: ht = tanh(Whxxt + Whhht−1 + bh)
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.
For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.
For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.
For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.
dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.
For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.
dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
p( * | x1:n,d1:t-1) = softmaxD(WDhst-1+bD)
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.
For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.
dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
p( * | x1:n,d1:t-1) = softmaxD(WDhst-1+bD)
Note that WhDst-1+bD produces a vector of scores. The softmax function
normalizes scores to a probability distribution by exponentiating each dimension,
and normalizing by the sum. For some choice k of K, p(k) = escore(k)/ ∑k’ ∈ K escore(k’)
The information bottleneck and latent structure
Given the diagram below, what problem do you foresee when translating
progressively longer sentences?
Only use neural nets
Jiri naanị netwọk nụ
The information bottleneck and latent structure
We are trying to encode variable-length structure (e.g., variable-length sentences)
in a fixed-length memory (e.g., only the 300 dimensions of your hidden state.)
Only use neural nets The last encoder hidden state is the
bottleneck -- all information in the
source sentence must pass through it
to get to the decoder.
Finding a solution to this problem was
Encoder (seq) the final advance that made neural
MT competitive with previous
approaches.
The information bottleneck and latent structure
The key insight is related to the word alignment work we did last week. We allow
the decoder to look at any encoder state, and let it learn which are important at
each time step!
Only use neural nets
Jiri naanị netwọk nụ
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 1: Take the decoder state, and compute an affinity αi with all encoder states.
α0 α1 α2 α3 α4
Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 1: Take the decoder state, and compute an affinity αi with all encoder states.
α0 α1 α2 α3 α4 The affinity function, , is a
dot product, or something
similar.
: = αi
Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 2: Normalize the scores to sum to 1 by the softmax function.
softmax
α0 α1 α2 α3 α4
Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 2: Normalize the scores to sum to 1 by the softmax function.
a0 a1 a2 a3 a4
Note that ∑i=1,2,3,4ai =1
softmax
α0 α1 α2 α3 α4
Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 3: Average the encoder states, weighted by the a distribution.
a0 1 a1 1 a2 1 a3 1 a4 1
This weighted average
Is called the context
vector.
Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 3: Average the encoder states, weighted by the a distribution.
In this example, since “Jiri”
Only use neural nets means “use”, the attention
will focus on the vectors
around “use”.
Jiri
Focus of context vector over encoder states
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.
Step 4: Use the context vector at prediction, concatenating it to the decoder state.
softmax ( )= Probability distribution
over the vocabulary
4
Jiri
This vector has the current decoder information, , but also a
focused summary of the encoder, .
Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.
Attention computes the affinity between the decoder state and all encoder
[Link] are many affinity computation methods, but they’re all like a dot
product.
Let there are n encoder states. The affinity between encoder state i and the
decoder state is αi. The encoder states are h1:n, and the decoder state is st-1.
Let αi = f(hi, st-1) = hiTst-1
Let weights a = softmax(α).
Let the context c = ∑i=1:nhiai. (Note that this is a weighted average.)
Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.
Attention is used at prediction as extra information in the final prediction.
Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)
Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.
Attention is used at prediction as extra information in the final prediction.
Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)
Let the notation [s;c] mean the concatenation of vectors s and c.
dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
(same as before, without attention)
Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.
Attention is used at prediction as extra information in the final prediction.
Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)
Let the notation [s;c] mean the concatenation of vectors s and c.
dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
p( * | x1:n,d1:t-1) = softmaxD(WD(2h)[st-1;c]+bD)
So, the only difference is that the final prediction uses the context vector
concatenated to the decoder state to make the prediction.
Empirical considerations
There are a lot of “hyperparameter” choices that can greatly affect the quality of
your model. In short, take parameters from papers/tutorials, and grid search
(try many combinations of parameters) around them.
RNN variants: LSTMs have a different (much better) recurrent equation.
Hidden state sizes: larger: more memory! Requires more data.
Embedding sizes: more representation power! Requires more data.
Learning rate: the step size you take in learning your parameters! Start this
“large”, and cut it in half when your training stops improving development set
performance.
Empirical considerations
There are a lot of “hyperparameter” choices that can greatly affect the quality of
your model. In short, take parameters from papers/tutorials, and grid search
(try many combinations of parameters) around them.
Regularization: “dropout” prevents overfitting by making each node in your
hidden state unavailable for an observation with a given probability. Try
some values around .2 to .3.
Batch size: The number of observations to group together before performing a
parameter update step. Larger batches: less fine-grained training, many more
observations per minute, especially on GPU.
Case study: text simplification
Text simplification is the process in which a text is transformed into an equivalent
text that can be more easily read by a broader audience (Saggion, 2017).
Simplification can be used as a preprocessing tool for improving performance of
many NLP end-tasks such as parsing, SRL, summarization, Information Retrieval etc.
Case study: text simplification
Text simplification is the process in which a text is transformed into an equivalent
text that can be more easily read by a broader audience (Saggion, 2017).
Simplification can be used as a preprocessing tool for improving performance of
many NLP end-tasks such as parsing, SRL, summarization, Information Retrieval etc.
“There’s just one major hitch: the primary purpose of education is to develop
citizens with a wide variety of skills.”
“The purpose of education is to develop many skills.”
Case study: text simplification
Text simplification can be thought of in part as monolingual machine translation.
Problem: The most common rewrite operation is copying from the complex
sentence to the simple sentence.
- One solution: Add in reinforcement learning (Zhang and Lapata 2017), to
encourage the model to use other rewrite operations, such as deletion,
substitution, word reordering.
A brief introduction to Reinforcement Learning
The reinforcement learning framework (Sutton and Barto, 1998)
Case study: text simplification
Basic encoder-decoder model, from (Zhang and Lapata, 2017).
Case study: text simplification
Encoder-Decoder model with reinforcement learning (Zhang and Lapata, 2017).
- Process of generating new words from
Derivational morphology existing words
- Changes semantic meaning
- Often a new part-of-speech
employ V -> N, Agent employer
employ V -> N, Passive employee
employ V -> N, Result employment
employ V -> Adj, Potential employable
employable V -> Adj -> N, Stative employability
Encoder (seq) Decoder (2seq)
Derivational morphology
c o m p o s i
c o m p o s e VERB-NOM
Derivational morphology: search
e r
n g
f i c a t i o n
g r o u n d
i q
m e n t
s
Reference Sheet
A learned parameter dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
The memory vector,
or “state” . Color matrix
denotes whether Whx integrates input vector p( * | x1:n,d1:t-1) =
encoder or decoder. information.
softmaxD(WDhst-1+bD)
The “word vector” Whh integrates information from the
representation of the previous timestep.
word.
bh is a bias term.
The RNN function,
dt is our decision at timestep t.
which combines the
word vector and the The RNN equation is:
previous state to create
a new state. ht = tanh(Whxxt + Whhht−1 + bh)