0% found this document useful (0 votes)
16 views59 pages

Rnnjan 25

The document discusses Simple Recurrent Networks (RNNs), highlighting their ability to model temporal sequences in language processing. It explains the architecture of RNNs, including the training process and the concept of teacher forcing, where the model is provided with the correct previous word during training. Additionally, it covers the use of RNNs for language modeling, sequence labeling, and classification tasks, emphasizing their capacity to handle variable context sizes without fixed limitations.

Uploaded by

Hira Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views59 pages

Rnnjan 25

The document discusses Simple Recurrent Networks (RNNs), highlighting their ability to model temporal sequences in language processing. It explains the architecture of RNNs, including the training process and the concept of teacher forcing, where the model is provided with the correct previous word during training. Additionally, it covers the use of RNNs for language modeling, sequence labeling, and classification tasks, emphasizing their capacity to handle variable context sizes without fixed limitations.

Uploaded by

Hira Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Simple Recurrent Networks

(RNNs or Elman Nets)


RNNs and
LSTMs
Modeling Time in Neural Networks

Language is inherently temporal


Yet the simple NLP classifiers we've seen (for example for
sen=ment analysis) mostly ignore =me
• (Feedforward neural LMs (and the transformers we'll
see later) use a "moving window" approach to =me.)
Here we introduce a deep learning architecture with a
different way of represen=ng =me
• RNNs and their variants like LSTMs
Recurrent Neural Networks (RNNs)

Any network that contains a cycle within its network


connec1ons.
The value of some unit is directly, or indirectly,
dependent on its own earlier outputs as an input.
Simple Recurrent Nets (Elman nets)

yt

ht

xt

The hidden layer has a recurrence as part of its input


The ac1va1on value ht depends on xt but also ht-1!
Forward inference in simple RNNs

Very similar to the feedforward networks we've seen!


hSimple
t-1 h
recurrent
t = g(Uh
neural network
t 1 + Wx
illustrated
t ) as a x t
feedforward network
yt =
Simple recurrent neural f (Vh ) yt
networkt illustrated as a feedforwa
V

1 from the prior time step is multiplied hby weight matrix


o the input, hidden and output layer dimensions as d
t

rd component from the current time step.+ dh ⇥


Given this, our three parameter matrices are: W 2 R
t ⇥dh . U W

r.ute y via a softmax computation that gives a probabi


ht-1 xt
t
ible output classes.
ht = g(Uht 1 + Wxt )
f (Vht )
ytt = softmax(Vht)
time t 1 mandates an incremental inference algorithm that proceeds from the
Inference has to be incremental
of the sequence to the end as illustrated in Fig. 8.3. The sequential nature of si
recurrent networks can also be seen by unrolling the network in time as is show
Fig. 8.4. In this figure, the various layers of units are copied for each time st
Computing h atthat
illustrate time t requires
they will have that we first
differing computed
values over time.hHowever,
at the the various w
previous time step!
matrices are shared across time.

function F ORWARD RNN(x, network) returns output sequence y

h0 0
for i 1 to L ENGTH(x) do
hi g(Uhi 1 + Wxi )
yi f (Vhi )
return y

Figure 8.3 Forward inference in a simple recurrent network. The matrices U, V and W
shared across time, while new values for h and y are calculated with each time step.
Training in simple RNNs yt

Just like feedforward training: V

ht
• training set,
+
• a loss func<on,
• backpropaga<on U W

ht-1 xt

Weights that need to be updated:


• W, the weights from the input layer to the hidden layer,
• U, the weights from the previous hidden layer to the current hidden layer,
• V, the weights from the hidden layer to the output layer.
Training in simple RNNs: unrolling in Ame
Unlike feedforward networks:
1. To compute loss func0on for the output
at 0me t we need the hidden layer from y3

0me t − 1. V

y2 h3
2. hidden layer at 0me t influences the
output at 0me t and hidden layer at 0me V U W

t+1 (and hence the output and loss at t+1). y1


h2
x3

U W
So: to measure error accruing to ht, V

h1 x2
• need to know its influence on both the U W
current output as well as the ones that
follow. h0 x1
y3

Unrolling in Ame (2)


V

y2 h3

V U W

h2
y1 x3

U W
V

h1 x2

U W

h0 x1

We unroll a recurrent network into a feedforward


computa3onal graph elimina3ng recurrence
1. Given an input sequence,
2. Generate an unrolled feedforward network specific to input
3. Use graph to train weights directly via ordinary backprop (or
can do forward inference)
Simple Recurrent Networks
(RNNs or Elman Nets)
RNNs and
LSTMs
RNNs as Language Models
RNNs and
LSTMs
models predict the next word in
P(fish|Thanks for all the) a sequence given some preceding
ample, Reminder:
if the precedingLanguage
context Modeling
is “Thanks for all the” and we want
s give
kely theusnext
the ability
word isto“fish”
assign we
such a conditional
would compute:probability to every
rd, giving us a distribution over the entire vocabulary. We can also
P(fish|Thanks for all the)
ies to entire sequences by combining these conditional probabilities
le:give us the ability to assign such a conditional probability to every
els
n
ord, giving us a distribution over the entire vocabulary. We can also
Y
P(w 1:n ) = P(w i |w<i )
ities to entire sequences by combining these conditional probabilities
i=1
ule:
uage models of Chapter 3 Y compute
n the probability of a word given
urrence with the P(wn 1:n1) prior
= words. The
P(wi |w <i context
) is thus of size n 1.
ard language models of Chapter i=1 7, the context is the window size.
The size of the condi.oning context for different LMs

The n-gram LM:


Context size is the n − 1 prior words we condi3on on.

The feedforward LM:


Context is the window size.

The RNN LM:


No fixed context size; ht-1 represents en3re history
FFN LMs vs RNN LMs

^
yt ^
yt
a) a) b) b)
^
yt ^
yt
U U

V V
ht ht

ht-2 U ht-1ht-2 U U ht ht-1 U ht


W W
W W W W W W

et-2 et-1 et et-2 et-1 eett-2 … et-1et-2 et et-1 et

FFN RNN
ples it by the weight matrix W, and then adds it to the hidd
rd Inference
Forward in
inferencean RNN
in the language
RNN
s step (weighted by weight matrix U) to compute LM modela new
layer Given
is then used to generate an output layer which is pa
e in a recurrent language
input X of of modelasproceeds
N tokens represented exactly as de
one-hot vectors
er to generate a probability distribution over the entire voc
e input sequence X = [x1 ; ...; xt ; ...; xN ] consists of a serie
as a one-hot vector
Use embedding oftosize
matrix | ⇥ 1, and
|Vembedding
get the the output
for current token xt predict
g a probability distribution
et = Ext over the vocabulary. At eac
ord embedding
Combine … matrix E to retrieve
ht = g(Uht 1 + Wet ) the embedding for t
by the weight matrix W, and then adds it to the hidden
ŷt = softmax(Vht )
(weighted by weight matrix U) to compute a new hid
is then used to generate an output layer which is passed
b)
Shapes
^
yt |V| x 1

|V| x d
V
dxd
ht-2 dU
x1 ht-1 U ht dx1

W W W dxd

et-2 et-1 et dx1


hat a particular word k in the vocabulary is the next
bility of an entire sequence is just the product of the prob word
kth Compu.ngof
component theŷprobability
t : that the next word is word k
sequence, where we’ll use ŷ [w ] to mean the probability
i i
step i.
P(wt+1 = k|w1 , . . . , wt ) = ŷt [k]
Yn
ity of an entire sequence
P(w1:n ) is=just theP(w product
|w
i 1:i 1of
) the probab
equence, where we’ll use ŷi [wii=1
] to mean the probability o
ep i. Yn
= ŷi [wi ]
n
Yi=1
P(w1:n ) = P(wi |w1:i 1 )
Training RNN LM

• Self-supervision
• take a corpus of text as training material
• at each time step t
• ask the model to predict the next word.
• Why called self-supervised: we don't need human labels;
the text is its own supervision signal
• We train the model to
• minimize the error
• in predicting the true next word in the training sequence,
• using cross-entropy as the loss function.
correct distribution.
Figure 8.6 Training RNNs as language models.
X
Cross-entropy loss LCE = yt [w] log ŷt [w]
correct distribution. w2V
The difference between: X
In the case of language modeling, the L correct
= distribution
y [w] log yt
ŷ com
t [w]
• a predicted probability distribu3on CE t
next word. This is represented as a one-hot vector w2V correspondi
• the correct distribu3on.
where the entry for the actual next word is 1, and all the other
CE loss for LMsIn is
thesimpler!!!
the cross-entropy case loss
of language modeling,
for language the correct
modeling distributionb
is determined
• the model assigns to the correct next word. So at time vector
next word.
correct distribu<on y t This
is a is represented
one-hot vector over as
thea one-hot
vocabulary t the CE correslos
• where the entry
wherefor the
theactual
entry nextfor
word is 1,
the and all next
actual the other entries
word is are and
1, 0. all the
probability the model assigns to the next word in the training se
• So the CE lossthe
for cross-entropy
LMs is only determined
loss forbylanguage
the probability of nextisword.
modeling determi
• So at <me t, CE loss is:
model assigns to theLCE correct
(ŷt , ynext
t ) = word. logSoŷ at
[w
t t+1
time] t the
probability the model assigns to the next word in the trai
Teacher forcing
We always give the model the correct history to predict the next word (rather
than feeding the model the possible buggy guess from the prior :me step).
This is called teacher forcing (in training we force the context to be correct based
on the gold words)
What teacher forcing looks like:
• At word posi:on t
• the model takes as input the correct word wt together with ht−1, computes a
probability distribu:on over possible next words
• That gives loss for the next token wt+1
• Then we move on to next word, ignore what the model predicted for the next
word and instead use the correct word wt+1 along with the prior history
encoded to es:mate the probability of token wt+2.
ayer of the network through the calculation of Vh. V is of shape [|V | ⇥
is, the rows of V are shaped like a transpose of E, meaning that V provi
Weight tying
set of learned word embeddings.
ad of having two sets of embedding matrices, language models use a sin
The input embedding matrix E and the final layer matrix V, are similar
ng matrix, which appears at both the input and softmax layers. Tha
• The columns of E represent the word embeddings for each word | in
ense with vocab.
V andE isuse E at
[d x |V|] the start of the computation and E (because
V is the transpose of E at the end. Using the same matrix (transposed
• The final layer matrix V helps give a score (logit) for each word in
weight 1
es is calledvocab . V is tying.
[|V| x d ]The weight-tied equations for an RNN langu
hen become:
Instead of having separate E and V, we just <e them together, using ET
instead of V:
et = Ext (8
ht = g(Uht 1 + Wet ) (8
|
ŷt = softmax(E ht ) (8
RNNs as Language Models
RNNs and
LSTMs
RNNs for Sequences
RNNs and
LSTMs
RNNs for sequence labeling
Assign a label to each element of a sequence
Part-of-speech tagging
Argmax NNP MD VB DT NN
y
Softmax over
tags

Vh
RNN h
Layer(s)

Embeddings e

Words Janet will back the bill


FFN
RNNs for sequence classificaAon hn

Text classification
RNN
Softmax

FFN
x1 x2 x3 xn
hn

Figure 8.8 Sequence classification using a simple RNN combined with a feedfo
RNN
work. The final hidden state from the RNN is used as the input to a feedforward ne
performs the classification.
x1 x2 x3 xn

pools all the n hidden states by taking their element-wise mean:


Instead of taking the last state, could use some pooling function of all
the output states, like mean pooling 1 Xn
hmean = hi
n
i=1
Autoregressive generation

Sampled Word So long and ?

Softmax

RNN

Embedding

Input Word <s> So long and


Stacked RNNs

y1 y2 y3 yn

RNN 3

RNN 2

RNN 1
x1 x2 x3 xn
• S TACKED AND B IDIRECTIONAL RNN ARCHITECTURES 13
run two separate f
RNNs, one left-to-right, and one right-to-left, and
h
ir representations. t = RNNforward (x1 , . . . , xt ) (8.16)
b
BidirecAonal RNNs
n state h t represents all the information we have discerned about the
o-rightf RNNs we’ve discussed so far, the hidden state at a given time
t to hthesimply
end of corresponds
the sequence.
ion
rything t the network knows to the normal
about the sequencehiddenupstate at time
to that t, repre-
point. The
onalthe
hing RNN (Schuster
network has and Paliwal,
gleaned from 1997)
the combines
sequence so two independent
far.
on of the inputs x1 , ..., xt and represents the context of the network to
ere the input
vantage of is processed
context to the from
right the
of start
the to the end,
current andwe
input, thecan
other
trainfrom
urrent time. y1 y2an y3 yn
start. We
ersed inputthen concatenate
sequence. Withthe thistwo representations
approach, the hiddencomputed
state at by
timethet
concatenated
single vector
rmation aboutt the
h f that captures both the left and right contexts of an input
RNNforward
= sequence to(xthe1 , .right
. . , xtof
) the current input: (8.16) outputs
time. Here we use either the semicolon ”;” or the equivalent symbol
on h ft simplyhbcorresponds
or concatenation: t = RNNbackward to the (x normal
t , . . . xhidden
n) state at time t, repre-
(8.17) RNN 2
ing the network has gleaned from the sequence so far.
antage of context to the right of the current input, we can train an
ht = [h ft ; hbt ] RNN 1
rsed input sequence. With this approach, the hidden state at time t
f b
mation about the sequence to = h h
t the tright of the current input: (8.18)
x1 x2 x3 xn

hbt = RNNbackward (xt , . . . xn ) (8.17)


rates such a bidirectional network that concatenates the outputs of
d backward pass. Other simple ways to combine the forward and
exts include element-wise addition or multiplication. The output at
me thus captures information to the left and to the right of the current
nce labeling applications, these concatenated outputs can serve as the
BidirecAonal RNNs for classificaAon
Softmax

FFN

← →
h1 hn


h1 RNN 2


RNN 1 hn

x1 x2 x3 xn
RNNs for Sequences
RNNs and
LSTMs
The LSTM
RNNs and
LSTMs
MoAvaAng the LSTM: dealing with distance
• It's hard to assign probabili<es accurately when context is very far away:
• The flights the airline was canceling were full.
• Hidden layers are being forced to do two things:
• Provide informa<on useful for the current decision,
• Update and carry forward informa<on required for future decisions.
• Another problem: During backprop, we have to repeatedly mul<ply
gradients through <me and many h's
• The "vanishing gradient" problem
The LSTM: Long short-term memory network
LSTMs divide the context management problem into two
subproblems:
• removing informa<on no longer needed from the context,
• adding informa<on likely to be needed for later decision making
• LSTMs add:
• explicit context layer
• Neural circuits with gates to control informa0on flow
context vector to remove the information from context that is no longer re
Forgetmultiplication
. Element-wise gate of two vectors (represented by the operator
metimes called the Hadamard product) is the vector of the same dimensio
two input vectors,
Deletes informationwhere eachthatelement
from the context i is the product of element i in the tw
is no longer needed.

vectors:

ft = s (U f ht 1 + W f xt ) (8.20
kt = ct 1 ft (8.21

ext task is to compute the actual information we need to extract from the previ
dden state and current inputs—the same basic computation we’ve been usin
our recurrent networks.

gt = tanh(Ug ht 1 + Wg xt ) (8.22
mpute the actual information
Regular passing of information
we need to
current inputs—the same basic comput
etworks.

gt = tanh(Ug ht 1 + Wg xt )

e mask for the add gate to select the in


gt = tanh(Ug ht 1 + Wg xt ) (8
Add gate
enerate the mask for the add gate to select the information to add to
ntext. Selec:ng = s (U
it informa:on i htot current
to add Wi xt )
1 +context
jt = it g=t s i(U
t i ht 1 + Wi xt ) (8
jt = gt it (8
the modified context vector to get our new context v
dd thisAdd
to the modified
this to context
the modified context vector
vector toto
getget
our our new context
new context vector. vector.

ct =ctj=
t+ ktkt
jt + (8
’ll use output gate which is used to decide
is thegate
Output
he current hidden state (as opposed to what info
Decide what informa6on is required for the current hidden state (as opposed to what informa6on needs to

uture decisions).
be preserved for future decisions).

ot = s (Uo ht 1 + Wo xt )
ht = ot tanh(ct )

the complete computation for a single LSTM u


for the various gates, an LSTM accepts as inp
The LSTM

ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

i
σ
xt xt +

σ
o
LSTM
+
Units
h ht ct ht

a a

g g
LSTM
z z
Unit
⌃ ⌃

x ht-1 xt ct-1 ht-1 xt

(a) (b) (c)

FFN SRN LSTM


The LSTM
RNNs and
LSTMs
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
Four architectures for NLP tasks with RNNs
y
y1 y2 yn

RNN RNN

x1 x2 … xn x1 x2 … xn

a) sequence labeling b) sequence classification

y1 y2 … ym

x2 x3 … xt Decoder RNN

Context
RNN
Encoder RNN

x1 x2 … xt-1
x1 x2 … xn

c) language modeling d) encoder-decoder


3 components of an encoder-decoder

1. An encoder that accepts an input sequence, x1:n, and


generates a corresponding sequence of contextualized
representa3ons, h1:n.
2. A context vector, c, which is a func3on of h1:n, and
conveys the essence of the input to the decoder.
3. A decoder, which accepts c as input and generates an
arbitrary length sequence of hidden states h1:m, from which
a corresponding sequence of output states y1:m, can be
obtained
Encoder-decoder

y1 y2 … ym

Decoder

Context

Encoder

x1 x2 … xn
if g is an activation function like tanh or ReLU
Ns, but we’ll see in Chapter 13 how to apply them to transformers as well. We
ldand the hidden
up theEncoder-decoder
equations stateforat time
models tby starting
translaAon
for encoder-decoder 1, and thecondition
with the soft
abulary items, then at time t the output
N language model p(y), the probability of a sequence y.
y t and hid
Recall that in any language model, we can break down the probability as follow
Regular language modeling
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ) . . . p(ym |y1 , ..., ym 1) (8.2

h = g(h
In RNN language modeling, at a particular time t, we pass the prefix of t
t t 1 t ,x )
ens through the language model, using forward inference to produce a sequenc
ŷ = softmax(h )
hidden states, ending with the hidden state corresponding to the last word
t t
prefix. We then use the final hidden state of the prefix as our starting point
erate the next token.
o make one slight change to turn this language
More formally, if g is an activation function like tanh or ReLU, a function
input at time t and the hidden state at time t 1, and the softmax is over th
ation into an encoder-decoder model that is a tra
of possible vocabulary items, then at time t the output yt and hidden state ht a
dd a sentence separation marker at the end of the source text, and then sim
oncatenate the target text.
Encoder-decoder for translaAon
Let’s use <s> for our sentence separator token, and let’s think about transla
n English source text (“the green witch arrived”), to a Spanish sentence (“l
a brujaLet x be(which
verde” the source
can be text plus
glossed a separate as
word-by-word token <s>the
‘arrived and
witch gre
We couldy the
alsotarget
illustrate encoder-decoder models with a question-answer pair,
Let x = Thepair.
ext-summarization green witch arrive <s>
Let’sLet
usey x=tollego
refeŕ la
to the source
bruja verdetext (in this case in English) plus the separ
oken <s>, and y to refer to the target text y (in this case in Spanish). The
ncoder-decoder model computes the probability p(y|x) as follows:
p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x) . . . p(ym |y1 , ..., ym 1 , x) (

Fig. 8.17 shows the setup for a simplified version of the encoder-decoder m
we’ll see the full model, which requires the new concept of attention, in the
Encoder-decoder simplified

Target Text

llegó la bruja verde </s>

softmax (output of source is ignored)

hidden hn
layer(s)

embedding
layer

the green witch arrived <s> llegó la bruja verde

Separator
Source Text
arameter to the computation of the current hidden st
: Encoder-decoder showing context

d d
ht = g(ŷt 1 , ht 1 , c)
Decoder

y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax

he1 he2 he3 hd hd hd hd hd


hhn n = c = h 0
e d
hidden 1 2 3 4 m
layer(s)

embedding
layer

x1 x2 x3 xn <s> y1 y2 y3 ym

Encoder
der-decoder model, with context available at each decoding timestep. Recal
g is a stand-in for some flavor of RNN and ŷt 1 is the embedding for the outpu
led fromEncoder-decoder equaAons
the softmax at the previous step:
e
c = hn
d
h0 = c
d d
ht = g(ŷt 1 , ht 1 , c)
d
ŷt = softmax(ht ) (8.33)

ŷt is a vector of probabilities


g is a stand-in for some flavor ofover
RNN the vocabulary, representing the probability
ch wordyˆoccurring at time
t−1 is the embedding fort.
theTo generate
output sampled text, weso9max
from the sample from
at the thisstep
previous distribution
ˆyt is a vector
or example, of probabili=es
the greedy choice overisthe vocabulary,
simply to represen=ng
choose the themost
probability of each word to
probable
word occurring at =me t. To generate text, we sample from this distribu=on ˆyt .
rate at each timestep. We’ll introduce more sophisticated sampling methods in
on ??.
Training the encoder-decoder with teacher forcing
Decoder

gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5

Total loss is the average L1 = L2 = L3 = L4 = L5 =


cross-entropy loss per -log P(y1) -log P(y2) -log P(y3) -log P(y4) -log P(y5) per-word
target word: loss

softmax

hidden
layer(s)

embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde

Encoder
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
LSTM ABenCon
RNNs and
LSTMs
Problem with passing context c only from end
Requiring the context c to be only the encoder’s final hidden state
forces all the informa<on from the en<re source sentence to pass
through this representa<onal bo_leneck.

bottleneck
Encoder bottleneck
Decoder
SoluAon: aRenAon
In the attention mechanism, as in the vanilla enco
vector
instead c is taken
of being a single vector
from the thatstate,
last hidden is athefunction
context it’s aof the hi
weighted average of all the hidden states of the decoder.
instead of being taken from the last hidden state, it’
this weighted average is also informed by part of the decoder state as
hidden
well, states
the state of the of the right
decoder decoder.
before theAnd this
current weighted
token i. ave
the decoder state as well, the state of the decoder r
e e d
That is, c = f (h1 . . . hn , hi 1 ). The weights focus on
the source text that is relevant for the token i that the
Attention thus replaces the static context vector with
from the encoder hidden states, but also informed b
token in decoding.
ring decoding by conditioning the computation of the
on it ARenAon
(along with the prior hidden state and the previou
coder), as we see in this equation (and Fig. 8.21):
d d
hi = g(ŷi 1 , hi 1 , ci )

y1 y2 yi

y1 hd
y2 hd hd
yi
1 2 … i …

c1 c2 ci
hd1 hd2 hdi …

i 1 j
ach encoder
score that state
resultsj. from this dot product is a scalar that reflects the degre
The
ilarity How
simplest such
between tothecompute
score,
twocalled c?The vectorattention,
dot-product
vectors. of these implements
scores acrossrelevance as
all the enc
arity:states
den measuring how similar the decoder hidden state is to an encoder hidden
We'll create a score that tells us how much to focus on each encoder step of
gives us the relevance of each encoder state to the current
by computing the dot product between them:
oder. state, how relevant each encoder state is to the decoder state:
To make use of thesescore(h
scores,d we’ll e normalize d them
e with a softmax to crea
(8.35)
i 1 , h j ) = hi 1 · h j
tor of weights, ai j , that tells us the proportional relevance of each encoder hid
escore
j to that
the prior hidden decoder state, h d .
results
We’ll fromthem
normalize this with
dot product
a so`max i is1toa create
scalarweights
that reflects thetell
αi j , that degree
us of
the relevance
arity between the twoofvectors.
encoderThehidden state
vector ofj these
to hidden
scoresdecoder
acrossstate, hdencoder
all the i-1
d e
en
8 states
• RNNgives us the
S AND LSTM ai jS = softmax(score(h
relevance of each encoder state i 1 , htoj ))
the current step of the
der. d , he )
And then use this to help
exp(score(h
create a weighted j
i average:
1 with
To make
states. use of these scores, we’ll
= P normalize them a softmax to create a(8
d e
k exp(score(h
or of weights, ai j , that tells us the proportional
X relevance
e i 1 , hk ))of each encoder hidden
c
j to the prior hidden decoder state, i = d a h
hi 1 . i j j (8.37)
ally, given the distribution in a, we can
j compute a fixed-length context vecto
Encoder-decoder with a<en=on, focusing on the
computa=on of c
Decoder
X
ci
<latexit sha1_base64="TNdNmv/RIlrhPa6LgQyjjQLqyBA=">AAACAnicdVDLSsNAFJ3UV62vqCtxM1gEVyHpI9Vd0Y3LCvYBTQyT6bSddvJgZiKUUNz4K25cKOLWr3Dn3zhpK6jogQuHc+7l3nv8mFEhTfNDyy0tr6yu5dcLG5tb2zv67l5LRAnHpIkjFvGOjwRhNCRNSSUjnZgTFPiMtP3xRea3bwkXNAqv5SQmboAGIe1TjKSSPP3AEUngjVIHsXiIvJSOpnB4Q7zR1NOLpmGaVbtqQdOwLbtk24qY5Yp9VoOWsjIUwQINT393ehFOAhJKzJAQXcuMpZsiLilmZFpwEkFihMdoQLqKhiggwk1nL0zhsVJ6sB9xVaGEM/X7RIoCISaBrzoDJIfit5eJf3ndRPZP3ZSGcSJJiOeL+gmDMoJZHrBHOcGSTRRBmFN1K8RDxBGWKrWCCuHrU/g/aZUMyzbKV5Vi/XwRRx4cgiNwAixQA3VwCRqgCTC4Aw/gCTxr99qj9qK9zltz2mJmH/yA9vYJSymYCA==</latexit>

↵ij hej
j yi-1 yi
attention
.4 .3 .1 .2
weights
↵ij
hdi · hej
<latexit sha1_base64="y8s4mGdpwrGrBnuSR+p1gJJXYdo=">AAAB/nicdVDJSgNBEO2JW4zbqHjy0hgEL4YeJyQBL0EvHiOYBbIMPT09mTY9C909QhgC/ooXD4p49Tu8+Td2FkFFHxQ83quiqp6bcCYVQh9Gbml5ZXUtv17Y2Nza3jF391oyTgWhTRLzWHRcLClnEW0qpjjtJILi0OW07Y4up377jgrJ4uhGjRPaD/EwYj4jWGnJMQ+Cgedk7NSa9IgXq955MKDOrWMWUQnNAFGpYtfsakUTZNtWGUFrYRXBAg3HfO95MUlDGinCsZRdCyWqn2GhGOF0UuilkiaYjPCQdjWNcEhlP5udP4HHWvGgHwtdkYIz9ftEhkMpx6GrO0OsAvnbm4p/ed1U+bV+xqIkVTQi80V+yqGK4TQL6DFBieJjTTARTN8KSYAFJkonVtAhfH0K/yets5JVKdnX5WL9YhFHHhyCI3ACLFAFdXAFGqAJCMjAA3gCz8a98Wi8GK/z1pyxmNkHP2C8fQICDpWK</latexit>

hidden he1 he2 he3 hen … ci-1


hdi-1 hdi …
layer(s)

ci
x1 x2 x3 xn
yi-2 yi-1
Encoder
LSTM ABenCon
RNNs and
LSTMs

You might also like