Rnnjan 25
Rnnjan 25
yt
ht
xt
h0 0
for i 1 to L ENGTH(x) do
hi g(Uhi 1 + Wxi )
yi f (Vhi )
return y
Figure 8.3 Forward inference in a simple recurrent network. The matrices U, V and W
shared across time, while new values for h and y are calculated with each time step.
Training in simple RNNs yt
ht
• training set,
+
• a loss func<on,
• backpropaga<on U W
ht-1 xt
0me t − 1. V
y2 h3
2. hidden layer at 0me t influences the
output at 0me t and hidden layer at 0me V U W
U W
So: to measure error accruing to ht, V
h1 x2
• need to know its influence on both the U W
current output as well as the ones that
follow. h0 x1
y3
y2 h3
V U W
h2
y1 x3
U W
V
h1 x2
U W
h0 x1
^
yt ^
yt
a) a) b) b)
^
yt ^
yt
U U
V V
ht ht
FFN RNN
ples it by the weight matrix W, and then adds it to the hidd
rd Inference
Forward in
inferencean RNN
in the language
RNN
s step (weighted by weight matrix U) to compute LM modela new
layer Given
is then used to generate an output layer which is pa
e in a recurrent language
input X of of modelasproceeds
N tokens represented exactly as de
one-hot vectors
er to generate a probability distribution over the entire voc
e input sequence X = [x1 ; ...; xt ; ...; xN ] consists of a serie
as a one-hot vector
Use embedding oftosize
matrix | ⇥ 1, and
|Vembedding
get the the output
for current token xt predict
g a probability distribution
et = Ext over the vocabulary. At eac
ord embedding
Combine … matrix E to retrieve
ht = g(Uht 1 + Wet ) the embedding for t
by the weight matrix W, and then adds it to the hidden
ŷt = softmax(Vht )
(weighted by weight matrix U) to compute a new hid
is then used to generate an output layer which is passed
b)
Shapes
^
yt |V| x 1
|V| x d
V
dxd
ht-2 dU
x1 ht-1 U ht dx1
W W W dxd
• Self-supervision
• take a corpus of text as training material
• at each time step t
• ask the model to predict the next word.
• Why called self-supervised: we don't need human labels;
the text is its own supervision signal
• We train the model to
• minimize the error
• in predicting the true next word in the training sequence,
• using cross-entropy as the loss function.
correct distribution.
Figure 8.6 Training RNNs as language models.
X
Cross-entropy loss LCE = yt [w] log ŷt [w]
correct distribution. w2V
The difference between: X
In the case of language modeling, the L correct
= distribution
y [w] log yt
ŷ com
t [w]
• a predicted probability distribu3on CE t
next word. This is represented as a one-hot vector w2V correspondi
• the correct distribu3on.
where the entry for the actual next word is 1, and all the other
CE loss for LMsIn is
thesimpler!!!
the cross-entropy case loss
of language modeling,
for language the correct
modeling distributionb
is determined
• the model assigns to the correct next word. So at time vector
next word.
correct distribu<on y t This
is a is represented
one-hot vector over as
thea one-hot
vocabulary t the CE correslos
• where the entry
wherefor the
theactual
entry nextfor
word is 1,
the and all next
actual the other entries
word is are and
1, 0. all the
probability the model assigns to the next word in the training se
• So the CE lossthe
for cross-entropy
LMs is only determined
loss forbylanguage
the probability of nextisword.
modeling determi
• So at <me t, CE loss is:
model assigns to theLCE correct
(ŷt , ynext
t ) = word. logSoŷ at
[w
t t+1
time] t the
probability the model assigns to the next word in the trai
Teacher forcing
We always give the model the correct history to predict the next word (rather
than feeding the model the possible buggy guess from the prior :me step).
This is called teacher forcing (in training we force the context to be correct based
on the gold words)
What teacher forcing looks like:
• At word posi:on t
• the model takes as input the correct word wt together with ht−1, computes a
probability distribu:on over possible next words
• That gives loss for the next token wt+1
• Then we move on to next word, ignore what the model predicted for the next
word and instead use the correct word wt+1 along with the prior history
encoded to es:mate the probability of token wt+2.
ayer of the network through the calculation of Vh. V is of shape [|V | ⇥
is, the rows of V are shaped like a transpose of E, meaning that V provi
Weight tying
set of learned word embeddings.
ad of having two sets of embedding matrices, language models use a sin
The input embedding matrix E and the final layer matrix V, are similar
ng matrix, which appears at both the input and softmax layers. Tha
• The columns of E represent the word embeddings for each word | in
ense with vocab.
V andE isuse E at
[d x |V|] the start of the computation and E (because
V is the transpose of E at the end. Using the same matrix (transposed
• The final layer matrix V helps give a score (logit) for each word in
weight 1
es is calledvocab . V is tying.
[|V| x d ]The weight-tied equations for an RNN langu
hen become:
Instead of having separate E and V, we just <e them together, using ET
instead of V:
et = Ext (8
ht = g(Uht 1 + Wet ) (8
|
ŷt = softmax(E ht ) (8
RNNs as Language Models
RNNs and
LSTMs
RNNs for Sequences
RNNs and
LSTMs
RNNs for sequence labeling
Assign a label to each element of a sequence
Part-of-speech tagging
Argmax NNP MD VB DT NN
y
Softmax over
tags
Vh
RNN h
Layer(s)
Embeddings e
Text classification
RNN
Softmax
FFN
x1 x2 x3 xn
hn
Figure 8.8 Sequence classification using a simple RNN combined with a feedfo
RNN
work. The final hidden state from the RNN is used as the input to a feedforward ne
performs the classification.
x1 x2 x3 xn
Softmax
RNN
Embedding
y1 y2 y3 yn
RNN 3
RNN 2
RNN 1
x1 x2 x3 xn
• S TACKED AND B IDIRECTIONAL RNN ARCHITECTURES 13
run two separate f
RNNs, one left-to-right, and one right-to-left, and
h
ir representations. t = RNNforward (x1 , . . . , xt ) (8.16)
b
BidirecAonal RNNs
n state h t represents all the information we have discerned about the
o-rightf RNNs we’ve discussed so far, the hidden state at a given time
t to hthesimply
end of corresponds
the sequence.
ion
rything t the network knows to the normal
about the sequencehiddenupstate at time
to that t, repre-
point. The
onalthe
hing RNN (Schuster
network has and Paliwal,
gleaned from 1997)
the combines
sequence so two independent
far.
on of the inputs x1 , ..., xt and represents the context of the network to
ere the input
vantage of is processed
context to the from
right the
of start
the to the end,
current andwe
input, thecan
other
trainfrom
urrent time. y1 y2an y3 yn
start. We
ersed inputthen concatenate
sequence. Withthe thistwo representations
approach, the hiddencomputed
state at by
timethet
concatenated
single vector
rmation aboutt the
h f that captures both the left and right contexts of an input
RNNforward
= sequence to(xthe1 , .right
. . , xtof
) the current input: (8.16) outputs
time. Here we use either the semicolon ”;” or the equivalent symbol
on h ft simplyhbcorresponds
or concatenation: t = RNNbackward to the (x normal
t , . . . xhidden
n) state at time t, repre-
(8.17) RNN 2
ing the network has gleaned from the sequence so far.
antage of context to the right of the current input, we can train an
ht = [h ft ; hbt ] RNN 1
rsed input sequence. With this approach, the hidden state at time t
f b
mation about the sequence to = h h
t the tright of the current input: (8.18)
x1 x2 x3 xn
FFN
← →
h1 hn
←
h1 RNN 2
→
RNN 1 hn
x1 x2 x3 xn
RNNs for Sequences
RNNs and
LSTMs
The LSTM
RNNs and
LSTMs
MoAvaAng the LSTM: dealing with distance
• It's hard to assign probabili<es accurately when context is very far away:
• The flights the airline was canceling were full.
• Hidden layers are being forced to do two things:
• Provide informa<on useful for the current decision,
• Update and carry forward informa<on required for future decisions.
• Another problem: During backprop, we have to repeatedly mul<ply
gradients through <me and many h's
• The "vanishing gradient" problem
The LSTM: Long short-term memory network
LSTMs divide the context management problem into two
subproblems:
• removing informa<on no longer needed from the context,
• adding informa<on likely to be needed for later decision making
• LSTMs add:
• explicit context layer
• Neural circuits with gates to control informa0on flow
context vector to remove the information from context that is no longer re
Forgetmultiplication
. Element-wise gate of two vectors (represented by the operator
metimes called the Hadamard product) is the vector of the same dimensio
two input vectors,
Deletes informationwhere eachthatelement
from the context i is the product of element i in the tw
is no longer needed.
vectors:
ft = s (U f ht 1 + W f xt ) (8.20
kt = ct 1 ft (8.21
ext task is to compute the actual information we need to extract from the previ
dden state and current inputs—the same basic computation we’ve been usin
our recurrent networks.
gt = tanh(Ug ht 1 + Wg xt ) (8.22
mpute the actual information
Regular passing of information
we need to
current inputs—the same basic comput
etworks.
gt = tanh(Ug ht 1 + Wg xt )
ct =ctj=
t+ ktkt
jt + (8
’ll use output gate which is used to decide
is thegate
Output
he current hidden state (as opposed to what info
Decide what informa6on is required for the current hidden state (as opposed to what informa6on needs to
uture decisions).
be preserved for future decisions).
ot = s (Uo ht 1 + Wo xt )
ht = ot tanh(ct )
ct-1 ct-1
f
σ ct
ct
+
ht-1 ht-1
tanh
tanh
+
g
ht
ht
i
σ
xt xt +
σ
o
LSTM
+
Units
h ht ct ht
a a
g g
LSTM
z z
Unit
⌃ ⌃
RNN RNN
x1 x2 … xn x1 x2 … xn
y1 y2 … ym
x2 x3 … xt Decoder RNN
Context
RNN
Encoder RNN
x1 x2 … xt-1
x1 x2 … xn
y1 y2 … ym
Decoder
Context
Encoder
x1 x2 … xn
if g is an activation function like tanh or ReLU
Ns, but we’ll see in Chapter 13 how to apply them to transformers as well. We
ldand the hidden
up theEncoder-decoder
equations stateforat time
models tby starting
translaAon
for encoder-decoder 1, and thecondition
with the soft
abulary items, then at time t the output
N language model p(y), the probability of a sequence y.
y t and hid
Recall that in any language model, we can break down the probability as follow
Regular language modeling
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ) . . . p(ym |y1 , ..., ym 1) (8.2
h = g(h
In RNN language modeling, at a particular time t, we pass the prefix of t
t t 1 t ,x )
ens through the language model, using forward inference to produce a sequenc
ŷ = softmax(h )
hidden states, ending with the hidden state corresponding to the last word
t t
prefix. We then use the final hidden state of the prefix as our starting point
erate the next token.
o make one slight change to turn this language
More formally, if g is an activation function like tanh or ReLU, a function
input at time t and the hidden state at time t 1, and the softmax is over th
ation into an encoder-decoder model that is a tra
of possible vocabulary items, then at time t the output yt and hidden state ht a
dd a sentence separation marker at the end of the source text, and then sim
oncatenate the target text.
Encoder-decoder for translaAon
Let’s use <s> for our sentence separator token, and let’s think about transla
n English source text (“the green witch arrived”), to a Spanish sentence (“l
a brujaLet x be(which
verde” the source
can be text plus
glossed a separate as
word-by-word token <s>the
‘arrived and
witch gre
We couldy the
alsotarget
illustrate encoder-decoder models with a question-answer pair,
Let x = Thepair.
ext-summarization green witch arrive <s>
Let’sLet
usey x=tollego
refeŕ la
to the source
bruja verdetext (in this case in English) plus the separ
oken <s>, and y to refer to the target text y (in this case in Spanish). The
ncoder-decoder model computes the probability p(y|x) as follows:
p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x) . . . p(ym |y1 , ..., ym 1 , x) (
Fig. 8.17 shows the setup for a simplified version of the encoder-decoder m
we’ll see the full model, which requires the new concept of attention, in the
Encoder-decoder simplified
Target Text
hidden hn
layer(s)
embedding
layer
Separator
Source Text
arameter to the computation of the current hidden st
: Encoder-decoder showing context
d d
ht = g(ŷt 1 , ht 1 , c)
Decoder
y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax
embedding
layer
x1 x2 x3 xn <s> y1 y2 y3 ym
Encoder
der-decoder model, with context available at each decoding timestep. Recal
g is a stand-in for some flavor of RNN and ŷt 1 is the embedding for the outpu
led fromEncoder-decoder equaAons
the softmax at the previous step:
e
c = hn
d
h0 = c
d d
ht = g(ŷt 1 , ht 1 , c)
d
ŷt = softmax(ht ) (8.33)
gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5
softmax
hidden
layer(s)
embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde
Encoder
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
LSTM ABenCon
RNNs and
LSTMs
Problem with passing context c only from end
Requiring the context c to be only the encoder’s final hidden state
forces all the informa<on from the en<re source sentence to pass
through this representa<onal bo_leneck.
bottleneck
Encoder bottleneck
Decoder
SoluAon: aRenAon
In the attention mechanism, as in the vanilla enco
vector
instead c is taken
of being a single vector
from the thatstate,
last hidden is athefunction
context it’s aof the hi
weighted average of all the hidden states of the decoder.
instead of being taken from the last hidden state, it’
this weighted average is also informed by part of the decoder state as
hidden
well, states
the state of the of the right
decoder decoder.
before theAnd this
current weighted
token i. ave
the decoder state as well, the state of the decoder r
e e d
That is, c = f (h1 . . . hn , hi 1 ). The weights focus on
the source text that is relevant for the token i that the
Attention thus replaces the static context vector with
from the encoder hidden states, but also informed b
token in decoding.
ring decoding by conditioning the computation of the
on it ARenAon
(along with the prior hidden state and the previou
coder), as we see in this equation (and Fig. 8.21):
d d
hi = g(ŷi 1 , hi 1 , ci )
y1 y2 yi
y1 hd
y2 hd hd
yi
1 2 … i …
c1 c2 ci
hd1 hd2 hdi …
…
i 1 j
ach encoder
score that state
resultsj. from this dot product is a scalar that reflects the degre
The
ilarity How
simplest such
between tothecompute
score,
twocalled c?The vectorattention,
dot-product
vectors. of these implements
scores acrossrelevance as
all the enc
arity:states
den measuring how similar the decoder hidden state is to an encoder hidden
We'll create a score that tells us how much to focus on each encoder step of
gives us the relevance of each encoder state to the current
by computing the dot product between them:
oder. state, how relevant each encoder state is to the decoder state:
To make use of thesescore(h
scores,d we’ll e normalize d them
e with a softmax to crea
(8.35)
i 1 , h j ) = hi 1 · h j
tor of weights, ai j , that tells us the proportional relevance of each encoder hid
escore
j to that
the prior hidden decoder state, h d .
results
We’ll fromthem
normalize this with
dot product
a so`max i is1toa create
scalarweights
that reflects thetell
αi j , that degree
us of
the relevance
arity between the twoofvectors.
encoderThehidden state
vector ofj these
to hidden
scoresdecoder
acrossstate, hdencoder
all the i-1
d e
en
8 states
• RNNgives us the
S AND LSTM ai jS = softmax(score(h
relevance of each encoder state i 1 , htoj ))
the current step of the
der. d , he )
And then use this to help
exp(score(h
create a weighted j
i average:
1 with
To make
states. use of these scores, we’ll
= P normalize them a softmax to create a(8
d e
k exp(score(h
or of weights, ai j , that tells us the proportional
X relevance
e i 1 , hk ))of each encoder hidden
c
j to the prior hidden decoder state, i = d a h
hi 1 . i j j (8.37)
ally, given the distribution in a, we can
j compute a fixed-length context vecto
Encoder-decoder with a<en=on, focusing on the
computa=on of c
Decoder
X
ci
<latexit sha1_base64="TNdNmv/RIlrhPa6LgQyjjQLqyBA=">AAACAnicdVDLSsNAFJ3UV62vqCtxM1gEVyHpI9Vd0Y3LCvYBTQyT6bSddvJgZiKUUNz4K25cKOLWr3Dn3zhpK6jogQuHc+7l3nv8mFEhTfNDyy0tr6yu5dcLG5tb2zv67l5LRAnHpIkjFvGOjwRhNCRNSSUjnZgTFPiMtP3xRea3bwkXNAqv5SQmboAGIe1TjKSSPP3AEUngjVIHsXiIvJSOpnB4Q7zR1NOLpmGaVbtqQdOwLbtk24qY5Yp9VoOWsjIUwQINT393ehFOAhJKzJAQXcuMpZsiLilmZFpwEkFihMdoQLqKhiggwk1nL0zhsVJ6sB9xVaGEM/X7RIoCISaBrzoDJIfit5eJf3ndRPZP3ZSGcSJJiOeL+gmDMoJZHrBHOcGSTRRBmFN1K8RDxBGWKrWCCuHrU/g/aZUMyzbKV5Vi/XwRRx4cgiNwAixQA3VwCRqgCTC4Aw/gCTxr99qj9qK9zltz2mJmH/yA9vYJSymYCA==</latexit>
↵ij hej
j yi-1 yi
attention
.4 .3 .1 .2
weights
↵ij
hdi · hej
<latexit sha1_base64="y8s4mGdpwrGrBnuSR+p1gJJXYdo=">AAAB/nicdVDJSgNBEO2JW4zbqHjy0hgEL4YeJyQBL0EvHiOYBbIMPT09mTY9C909QhgC/ooXD4p49Tu8+Td2FkFFHxQ83quiqp6bcCYVQh9Gbml5ZXUtv17Y2Nza3jF391oyTgWhTRLzWHRcLClnEW0qpjjtJILi0OW07Y4up377jgrJ4uhGjRPaD/EwYj4jWGnJMQ+Cgedk7NSa9IgXq955MKDOrWMWUQnNAFGpYtfsakUTZNtWGUFrYRXBAg3HfO95MUlDGinCsZRdCyWqn2GhGOF0UuilkiaYjPCQdjWNcEhlP5udP4HHWvGgHwtdkYIz9ftEhkMpx6GrO0OsAvnbm4p/ed1U+bV+xqIkVTQi80V+yqGK4TQL6DFBieJjTTARTN8KSYAFJkonVtAhfH0K/yets5JVKdnX5WL9YhFHHhyCI3ACLFAFdXAFGqAJCMjAA3gCz8a98Wi8GK/z1pyxmNkHP2C8fQICDpWK</latexit>
ci
x1 x2 x3 xn
yi-2 yi-1
Encoder
LSTM ABenCon
RNNs and
LSTMs