0% found this document useful (0 votes)

16 views59 pages

Rnnjan 25

The document discusses Simple Recurrent Networks (RNNs), highlighting their ability to model temporal sequences in language processing. It explains the architecture of RNNs, including the training process and the concept of teacher forcing, where the model is provided with the correct previous word during training. Additionally, it covers the use of RNNs for language modeling, sequence labeling, and classification tasks, emphasizing their capacity to handle variable context sizes without fixed limitations.

Uploaded by

Hira Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views59 pages

Rnnjan 25

Uploaded by

Hira Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Simple Recurrent Networks

(RNNs or Elman Nets)

RNNs and
LSTMs
Modeling Time in Neural Networks

Language is inherently temporal

Yet the simple NLP classiﬁers we've seen (for example for
sen=ment analysis) mostly ignore =me
• (Feedforward neural LMs (and the transformers we'll
see later) use a "moving window" approach to =me.)
Here we introduce a deep learning architecture with a
diﬀerent way of represen=ng =me
• RNNs and their variants like LSTMs
Recurrent Neural Networks (RNNs)

Any network that contains a cycle within its network

connec1ons.
The value of some unit is directly, or indirectly,
dependent on its own earlier outputs as an input.
Simple Recurrent Nets (Elman nets)

The hidden layer has a recurrence as part of its input

The ac1va1on value ht depends on xt but also ht-1!
Forward inference in simple RNNs

Very similar to the feedforward networks we've seen!

hSimple
t-1 h
recurrent
t = g(Uh
neural network
t 1 + Wx
illustrated
t ) as a x t
feedforward network
yt =
Simple recurrent neural f (Vh ) yt
networkt illustrated as a feedforwa
V

1 from the prior time step is multiplied hby weight matrix

o the input, hidden and output layer dimensions as d
t

rd component from the current time step.+ dh ⇥

Given this, our three parameter matrices are: W 2 R
t ⇥dh . U W

r.ute y via a softmax computation that gives a probabi

ht-1 xt
t
ible output classes.
ht = g(Uht 1 + Wxt )
f (Vht )
ytt = softmax(Vht)
time t 1 mandates an incremental inference algorithm that proceeds from the
Inference has to be incremental
of the sequence to the end as illustrated in Fig. 8.3. The sequential nature of si
recurrent networks can also be seen by unrolling the network in time as is show
Fig. 8.4. In this figure, the various layers of units are copied for each time st
Computing h atthat
illustrate time t requires
they will have that we first
differing computed
values over time.hHowever,
at the the various w
previous time step!
matrices are shared across time.

function F ORWARD RNN(x, network) returns output sequence y

h0 0
for i 1 to L ENGTH(x) do
hi g(Uhi 1 + Wxi )
yi f (Vhi )
return y

Figure 8.3 Forward inference in a simple recurrent network. The matrices U, V and W
shared across time, while new values for h and y are calculated with each time step.
Training in simple RNNs yt

Just like feedforward training: V

ht
• training set,
+
• a loss func<on,
• backpropaga<on U W

ht-1 xt

Weights that need to be updated:

• W, the weights from the input layer to the hidden layer,
• U, the weights from the previous hidden layer to the current hidden layer,
• V, the weights from the hidden layer to the output layer.
Training in simple RNNs: unrolling in Ame
Unlike feedforward networks:
1. To compute loss func0on for the output
at 0me t we need the hidden layer from y3

0me t − 1. V

y2 h3
2. hidden layer at 0me t inﬂuences the
output at 0me t and hidden layer at 0me V U W

t+1 (and hence the output and loss at t+1). y1

h2
x3

U W
So: to measure error accruing to ht, V

h1 x2
• need to know its inﬂuence on both the U W
current output as well as the ones that
follow. h0 x1
y3

Unrolling in Ame (2)

y2 h3

V U W

h2
y1 x3

U W
V

h1 x2

U W

h0 x1

We unroll a recurrent network into a feedforward

computa3onal graph elimina3ng recurrence
1. Given an input sequence,
2. Generate an unrolled feedforward network speciﬁc to input
3. Use graph to train weights directly via ordinary backprop (or
can do forward inference)
Simple Recurrent Networks
(RNNs or Elman Nets)
RNNs and
LSTMs
RNNs as Language Models
RNNs and
LSTMs
models predict the next word in
P(fish|Thanks for all the) a sequence given some preceding
ample, Reminder:
if the precedingLanguage
context Modeling
is “Thanks for all the” and we want
s give
kely theusnext
the ability
word isto“fish”
assign we
such a conditional
would compute:probability to every
rd, giving us a distribution over the entire vocabulary. We can also
P(fish|Thanks for all the)
ies to entire sequences by combining these conditional probabilities
le:give us the ability to assign such a conditional probability to every
els
n
ord, giving us a distribution over the entire vocabulary. We can also
Y
P(w 1:n ) = P(w i |w<i )
ities to entire sequences by combining these conditional probabilities
i=1
ule:
uage models of Chapter 3 Y compute
n the probability of a word given
urrence with the P(wn 1:n1) prior
= words. The
P(wi |w <i context
) is thus of size n 1.
ard language models of Chapter i=1 7, the context is the window size.
The size of the condi.oning context for diﬀerent LMs

The n-gram LM:

Context size is the n − 1 prior words we condi3on on.

The feedforward LM:

Context is the window size.

The RNN LM:

No ﬁxed context size; ht-1 represents en3re history
FFN LMs vs RNN LMs

^
yt ^
yt
a) a) b) b)
^
yt ^
yt
U U

V V
ht ht

ht-2 U ht-1ht-2 U U ht ht-1 U ht

W W
W W W W W W

et-2 et-1 et et-2 et-1 eett-2 … et-1et-2 et et-1 et

FFN RNN
ples it by the weight matrix W, and then adds it to the hidd
rd Inference
Forward in
inferencean RNN
in the language
RNN
s step (weighted by weight matrix U) to compute LM modela new
layer Given
is then used to generate an output layer which is pa
e in a recurrent language
input X of of modelasproceeds
N tokens represented exactly as de
one-hot vectors
er to generate a probability distribution over the entire voc
e input sequence X = [x1 ; ...; xt ; ...; xN ] consists of a serie
as a one-hot vector
Use embedding oftosize
matrix | ⇥ 1, and
|Vembedding
get the the output
for current token xt predict
g a probability distribution
et = Ext over the vocabulary. At eac
ord embedding
Combine … matrix E to retrieve
ht = g(Uht 1 + Wet ) the embedding for t
by the weight matrix W, and then adds it to the hidden
ŷt = softmax(Vht )
(weighted by weight matrix U) to compute a new hid
is then used to generate an output layer which is passed
b)
Shapes
^
yt |V| x 1

|V| x d
V
dxd
ht-2 dU
x1 ht-1 U ht dx1

W W W dxd

et-2 et-1 et dx1

hat a particular word k in the vocabulary is the next
bility of an entire sequence is just the product of the prob word
kth Compu.ngof
component theŷprobability
t : that the next word is word k
sequence, where we’ll use ŷ [w ] to mean the probability
i i
step i.
P(wt+1 = k|w1 , . . . , wt ) = ŷt [k]
Yn
ity of an entire sequence
P(w1:n ) is=just theP(w product
|w
i 1:i 1of
) the probab
equence, where we’ll use ŷi [wii=1
] to mean the probability o
ep i. Yn
= ŷi [wi ]
n
Yi=1
P(w1:n ) = P(wi |w1:i 1 )
Training RNN LM

• Self-supervision
• take a corpus of text as training material
• at each time step t
• ask the model to predict the next word.
• Why called self-supervised: we don't need human labels;
the text is its own supervision signal
• We train the model to
• minimize the error
• in predicting the true next word in the training sequence,
• using cross-entropy as the loss function.
correct distribution.
Figure 8.6 Training RNNs as language models.
X
Cross-entropy loss LCE = yt [w] log ŷt [w]
correct distribution. w2V
The difference between: X
In the case of language modeling, the L correct
= distribution
y [w] log yt
ŷ com
t [w]
• a predicted probability distribu3on CE t
next word. This is represented as a one-hot vector w2V correspondi
• the correct distribu3on.
where the entry for the actual next word is 1, and all the other
CE loss for LMsIn is
thesimpler!!!
the cross-entropy case loss
of language modeling,
for language the correct
modeling distributionb
is determined
• the model assigns to the correct next word. So at time vector
next word.
correct distribu<on y t This
is a is represented
one-hot vector over as
thea one-hot
vocabulary t the CE correslos
• where the entry
wherefor the
theactual
entry nextfor
word is 1,
the and all next
actual the other entries
word is are and
1, 0. all the
probability the model assigns to the next word in the training se
• So the CE lossthe
for cross-entropy
LMs is only determined
loss forbylanguage
the probability of nextisword.
modeling determi
• So at <me t, CE loss is:
model assigns to theLCE correct
(ŷt , ynext
t ) = word. logSoŷ at
[w
t t+1
time] t the
probability the model assigns to the next word in the trai
Teacher forcing
We always give the model the correct history to predict the next word (rather
than feeding the model the possible buggy guess from the prior :me step).
This is called teacher forcing (in training we force the context to be correct based
on the gold words)
What teacher forcing looks like:
• At word posi:on t
• the model takes as input the correct word wt together with ht−1, computes a
probability distribu:on over possible next words
• That gives loss for the next token wt+1
• Then we move on to next word, ignore what the model predicted for the next
word and instead use the correct word wt+1 along with the prior history
encoded to es:mate the probability of token wt+2.
ayer of the network through the calculation of Vh. V is of shape [|V | ⇥
is, the rows of V are shaped like a transpose of E, meaning that V provi
Weight tying
set of learned word embeddings.
ad of having two sets of embedding matrices, language models use a sin
The input embedding matrix E and the final layer matrix V, are similar
ng matrix, which appears at both the input and softmax layers. Tha
• The columns of E represent the word embeddings for each word | in
ense with vocab.
V andE isuse E at
[d x |V|] the start of the computation and E (because
V is the transpose of E at the end. Using the same matrix (transposed
• The final layer matrix V helps give a score (logit) for each word in
weight 1
es is calledvocab . V is tying.
[|V| x d ]The weight-tied equations for an RNN langu
hen become:
Instead of having separate E and V, we just <e them together, using ET
instead of V:
et = Ext (8
ht = g(Uht 1 + Wet ) (8
|
ŷt = softmax(E ht ) (8
RNNs as Language Models
RNNs and
LSTMs
RNNs for Sequences
RNNs and
LSTMs
RNNs for sequence labeling
Assign a label to each element of a sequence
Part-of-speech tagging
Argmax NNP MD VB DT NN
y
Softmax over
tags

Vh
RNN h
Layer(s)

Embeddings e

Words Janet will back the bill

FFN
RNNs for sequence classiﬁcaAon hn

Text classification
RNN
Softmax

FFN
x1 x2 x3 xn
hn

Figure 8.8 Sequence classification using a simple RNN combined with a feedfo
RNN
work. The final hidden state from the RNN is used as the input to a feedforward ne
performs the classification.
x1 x2 x3 xn

pools all the n hidden states by taking their element-wise mean:

Instead of taking the last state, could use some pooling function of all
the output states, like mean pooling 1 Xn
hmean = hi
n
i=1
Autoregressive generation

Sampled Word So long and ?

Softmax

RNN

Embedding

Input Word <s> So long and

Stacked RNNs

y1 y2 y3 yn

RNN 3

RNN 2

RNN 1
x1 x2 x3 xn
• S TACKED AND B IDIRECTIONAL RNN ARCHITECTURES 13
run two separate f
RNNs, one left-to-right, and one right-to-left, and
h
ir representations. t = RNNforward (x1 , . . . , xt ) (8.16)
b
BidirecAonal RNNs
n state h t represents all the information we have discerned about the
o-rightf RNNs we’ve discussed so far, the hidden state at a given time
t to hthesimply
end of corresponds
the sequence.
ion
rything t the network knows to the normal
about the sequencehiddenupstate at time
to that t, repre-
point. The
onalthe
hing RNN (Schuster
network has and Paliwal,
gleaned from 1997)
the combines
sequence so two independent
far.
on of the inputs x1 , ..., xt and represents the context of the network to
ere the input
vantage of is processed
context to the from
right the
of start
the to the end,
current andwe
input, thecan
other
trainfrom
urrent time. y1 y2an y3 yn
start. We
ersed inputthen concatenate
sequence. Withthe thistwo representations
approach, the hiddencomputed
state at by
timethet
concatenated
single vector
rmation aboutt the
h f that captures both the left and right contexts of an input
RNNforward
= sequence to(xthe1 , .right
. . , xtof
) the current input: (8.16) outputs
time. Here we use either the semicolon ”;” or the equivalent symbol
on h ft simplyhbcorresponds
or concatenation: t = RNNbackward to the (x normal
t , . . . xhidden
n) state at time t, repre-
(8.17) RNN 2
ing the network has gleaned from the sequence so far.
antage of context to the right of the current input, we can train an
ht = [h ft ; hbt ] RNN 1
rsed input sequence. With this approach, the hidden state at time t
f b
mation about the sequence to = h h
t the tright of the current input: (8.18)
x1 x2 x3 xn

hbt = RNNbackward (xt , . . . xn ) (8.17)

rates such a bidirectional network that concatenates the outputs of
d backward pass. Other simple ways to combine the forward and
exts include element-wise addition or multiplication. The output at
me thus captures information to the left and to the right of the current
nce labeling applications, these concatenated outputs can serve as the
BidirecAonal RNNs for classiﬁcaAon
Softmax

FFN

← →
h1 hn

←
h1 RNN 2

→
RNN 1 hn

x1 x2 x3 xn
RNNs for Sequences
RNNs and
LSTMs
The LSTM
RNNs and
LSTMs
MoAvaAng the LSTM: dealing with distance
• It's hard to assign probabili<es accurately when context is very far away:
• The ﬂights the airline was canceling were full.
• Hidden layers are being forced to do two things:
• Provide informa<on useful for the current decision,
• Update and carry forward informa<on required for future decisions.
• Another problem: During backprop, we have to repeatedly mul<ply
gradients through <me and many h's
• The "vanishing gradient" problem
The LSTM: Long short-term memory network
LSTMs divide the context management problem into two
subproblems:
• removing informa<on no longer needed from the context,
• adding informa<on likely to be needed for later decision making
• LSTMs add:
• explicit context layer
• Neural circuits with gates to control informa0on ﬂow
context vector to remove the information from context that is no longer re
Forgetmultiplication
. Element-wise gate of two vectors (represented by the operator
metimes called the Hadamard product) is the vector of the same dimensio
two input vectors,
Deletes informationwhere eachthatelement
from the context i is the product of element i in the tw
is no longer needed.

vectors:

ft = s (U f ht 1 + W f xt ) (8.20
kt = ct 1 ft (8.21

ext task is to compute the actual information we need to extract from the previ
dden state and current inputs—the same basic computation we’ve been usin
our recurrent networks.

gt = tanh(Ug ht 1 + Wg xt ) (8.22
mpute the actual information
Regular passing of information
we need to
current inputs—the same basic comput
etworks.

gt = tanh(Ug ht 1 + Wg xt )

e mask for the add gate to select the in

gt = tanh(Ug ht 1 + Wg xt ) (8
Add gate
enerate the mask for the add gate to select the information to add to
ntext. Selec:ng = s (U
it informa:on i htot current
to add Wi xt )
1 +context
jt = it g=t s i(U
t i ht 1 + Wi xt ) (8
jt = gt it (8
the modified context vector to get our new context v
dd thisAdd
to the modified
this to context
the modiﬁed context vector
vector toto
getget
our our new context
new context vector. vector.

ct =ctj=
t+ ktkt
jt + (8
’ll use output gate which is used to decide
is thegate
Output
he current hidden state (as opposed to what info
Decide what informa6on is required for the current hidden state (as opposed to what informa6on needs to

uture decisions).
be preserved for future decisions).

ot = s (Uo ht 1 + Wo xt )
ht = ot tanh(ct )

the complete computation for a single LSTM u

for the various gates, an LSTM accepts as inp
The LSTM

ct-1 ct-1

f
σ ct
ct

+
ht-1 ht-1
tanh
tanh

+
g
ht
ht

i
σ
xt xt +

σ
o
LSTM
+
Units
h ht ct ht

a a

g g
LSTM
z z
Unit
⌃ ⌃

x ht-1 xt ct-1 ht-1 xt

(a) (b) (c)

FFN SRN LSTM

The LSTM
RNNs and
LSTMs
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
Four architectures for NLP tasks with RNNs
y
y1 y2 yn
…

RNN RNN

x1 x2 … xn x1 x2 … xn

a) sequence labeling b) sequence classification

y1 y2 … ym

x2 x3 … xt Decoder RNN

Context
RNN
Encoder RNN

x1 x2 … xt-1
x1 x2 … xn

c) language modeling d) encoder-decoder

3 components of an encoder-decoder

1. An encoder that accepts an input sequence, x1:n, and

generates a corresponding sequence of contextualized
representa3ons, h1:n.
2. A context vector, c, which is a func3on of h1:n, and
conveys the essence of the input to the decoder.
3. A decoder, which accepts c as input and generates an
arbitrary length sequence of hidden states h1:m, from which
a corresponding sequence of output states y1:m, can be
obtained
Encoder-decoder

y1 y2 … ym

Decoder

Context

Encoder

x1 x2 … xn
if g is an activation function like tanh or ReLU
Ns, but we’ll see in Chapter 13 how to apply them to transformers as well. We
ldand the hidden
up theEncoder-decoder
equations stateforat time
models tby starting
translaAon
for encoder-decoder 1, and thecondition
with the soft
abulary items, then at time t the output
N language model p(y), the probability of a sequence y.
y t and hid
Recall that in any language model, we can break down the probability as follow
Regular language modeling
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ) . . . p(ym |y1 , ..., ym 1) (8.2

h = g(h
In RNN language modeling, at a particular time t, we pass the prefix of t
t t 1 t ,x )
ens through the language model, using forward inference to produce a sequenc
ŷ = softmax(h )
hidden states, ending with the hidden state corresponding to the last word
t t
prefix. We then use the final hidden state of the prefix as our starting point
erate the next token.
o make one slight change to turn this language
More formally, if g is an activation function like tanh or ReLU, a function
input at time t and the hidden state at time t 1, and the softmax is over th
ation into an encoder-decoder model that is a tra
of possible vocabulary items, then at time t the output yt and hidden state ht a
dd a sentence separation marker at the end of the source text, and then sim
oncatenate the target text.
Encoder-decoder for translaAon
Let’s use <s> for our sentence separator token, and let’s think about transla
n English source text (“the green witch arrived”), to a Spanish sentence (“l
a brujaLet x be(which
verde” the source
can be text plus
glossed a separate as
word-by-word token <s>the
‘arrived and
witch gre
We couldy the
alsotarget
illustrate encoder-decoder models with a question-answer pair,
Let x = Thepair.
ext-summarization green witch arrive <s>
Let’sLet
usey x=tollego
refeŕ la
to the source
bruja verdetext (in this case in English) plus the separ
oken <s>, and y to refer to the target text y (in this case in Spanish). The
ncoder-decoder model computes the probability p(y|x) as follows:
p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x) . . . p(ym |y1 , ..., ym 1 , x) (

Fig. 8.17 shows the setup for a simplified version of the encoder-decoder m
we’ll see the full model, which requires the new concept of attention, in the
Encoder-decoder simplified

Target Text

llegó la bruja verde </s>

softmax (output of source is ignored)

hidden hn
layer(s)

embedding
layer

the green witch arrived <s> llegó la bruja verde

Separator
Source Text
arameter to the computation of the current hidden st
: Encoder-decoder showing context

d d
ht = g(ŷt 1 , ht 1 , c)
Decoder

y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax

he1 he2 he3 hd hd hd hd hd

hhn n = c = h 0
e d
hidden 1 2 3 4 m
layer(s)

embedding
layer

x1 x2 x3 xn <s> y1 y2 y3 ym

Encoder
der-decoder model, with context available at each decoding timestep. Recal
g is a stand-in for some flavor of RNN and ŷt 1 is the embedding for the outpu
led fromEncoder-decoder equaAons
the softmax at the previous step:
e
c = hn
d
h0 = c
d d
ht = g(ŷt 1 , ht 1 , c)
d
ŷt = softmax(ht ) (8.33)

ŷt is a vector of probabilities

g is a stand-in for some ﬂavor ofover
RNN the vocabulary, representing the probability
ch wordyˆoccurring at time
t−1 is the embedding fort.
theTo generate
output sampled text, weso9max
from the sample from
at the thisstep
previous distribution
ˆyt is a vector
or example, of probabili=es
the greedy choice overisthe vocabulary,
simply to represen=ng
choose the themost
probability of each word to
probable
word occurring at =me t. To generate text, we sample from this distribu=on ˆyt .
rate at each timestep. We’ll introduce more sophisticated sampling methods in
on ??.
Training the encoder-decoder with teacher forcing
Decoder

gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5

Total loss is the average L1 = L2 = L3 = L4 = L5 =

cross-entropy loss per -log P(y1) -log P(y2) -log P(y3) -log P(y4) -log P(y5) per-word
target word: loss

softmax

hidden
layer(s)

embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde

Encoder
The LSTM Encoder-Decoder
Architecture
RNNs and
LSTMs
LSTM ABenCon
RNNs and
LSTMs
Problem with passing context c only from end
Requiring the context c to be only the encoder’s ﬁnal hidden state
forces all the informa<on from the en<re source sentence to pass
through this representa<onal bo_leneck.

bottleneck
Encoder bottleneck
Decoder
SoluAon: aRenAon
In the attention mechanism, as in the vanilla enco
vector
instead c is taken
of being a single vector
from the thatstate,
last hidden is athefunction
context it’s aof the hi
weighted average of all the hidden states of the decoder.
instead of being taken from the last hidden state, it’
this weighted average is also informed by part of the decoder state as
hidden
well, states
the state of the of the right
decoder decoder.
before theAnd this
current weighted
token i. ave
the decoder state as well, the state of the decoder r
e e d
That is, c = f (h1 . . . hn , hi 1 ). The weights focus on
the source text that is relevant for the token i that the
Attention thus replaces the static context vector with
from the encoder hidden states, but also informed b
token in decoding.
ring decoding by conditioning the computation of the
on it ARenAon
(along with the prior hidden state and the previou
coder), as we see in this equation (and Fig. 8.21):
d d
hi = g(ŷi 1 , hi 1 , ci )

y1 y2 yi

y1 hd
y2 hd hd
yi
1 2 … i …

c1 c2 ci
hd1 hd2 hdi …
…
i 1 j
ach encoder
score that state
resultsj. from this dot product is a scalar that reflects the degre
The
ilarity How
simplest such
between tothecompute
score,
twocalled c?The vectorattention,
dot-product
vectors. of these implements
scores acrossrelevance as
all the enc
arity:states
den measuring how similar the decoder hidden state is to an encoder hidden
We'll create a score that tells us how much to focus on each encoder step of
gives us the relevance of each encoder state to the current
by computing the dot product between them:
oder. state, how relevant each encoder state is to the decoder state:
To make use of thesescore(h
scores,d we’ll e normalize d them
e with a softmax to crea
(8.35)
i 1 , h j ) = hi 1 · h j
tor of weights, ai j , that tells us the proportional relevance of each encoder hid
escore
j to that
the prior hidden decoder state, h d .
results
We’ll fromthem
normalize this with
dot product
a so`max i is1toa create
scalarweights
that reflects thetell
αi j , that degree
us of
the relevance
arity between the twoofvectors.
encoderThehidden state
vector ofj these
to hidden
scoresdecoder
acrossstate, hdencoder
all the i-1
d e
en
8 states
• RNNgives us the
S AND LSTM ai jS = softmax(score(h
relevance of each encoder state i 1 , htoj ))
the current step of the
der. d , he )
And then use this to help
exp(score(h
create a weighted j
i average:
1 with
To make
states. use of these scores, we’ll
= P normalize them a softmax to create a(8
d e
k exp(score(h
or of weights, ai j , that tells us the proportional
X relevance
e i 1 , hk ))of each encoder hidden
c
j to the prior hidden decoder state, i = d a h
hi 1 . i j j (8.37)
ally, given the distribution in a, we can
j compute a fixed-length context vecto
Encoder-decoder with a<en=on, focusing on the
computa=on of c
Decoder
X
ci
<latexit sha1_base64="TNdNmv/RIlrhPa6LgQyjjQLqyBA=">AAACAnicdVDLSsNAFJ3UV62vqCtxM1gEVyHpI9Vd0Y3LCvYBTQyT6bSddvJgZiKUUNz4K25cKOLWr3Dn3zhpK6jogQuHc+7l3nv8mFEhTfNDyy0tr6yu5dcLG5tb2zv67l5LRAnHpIkjFvGOjwRhNCRNSSUjnZgTFPiMtP3xRea3bwkXNAqv5SQmboAGIe1TjKSSPP3AEUngjVIHsXiIvJSOpnB4Q7zR1NOLpmGaVbtqQdOwLbtk24qY5Yp9VoOWsjIUwQINT393ehFOAhJKzJAQXcuMpZsiLilmZFpwEkFihMdoQLqKhiggwk1nL0zhsVJ6sB9xVaGEM/X7RIoCISaBrzoDJIfit5eJf3ndRPZP3ZSGcSJJiOeL+gmDMoJZHrBHOcGSTRRBmFN1K8RDxBGWKrWCCuHrU/g/aZUMyzbKV5Vi/XwRRx4cgiNwAixQA3VwCRqgCTC4Aw/gCTxr99qj9qK9zltz2mJmH/yA9vYJSymYCA==</latexit>

↵ij hej
j yi-1 yi
attention
.4 .3 .1 .2
weights
↵ij
hdi · hej
<latexit sha1_base64="y8s4mGdpwrGrBnuSR+p1gJJXYdo=">AAAB/nicdVDJSgNBEO2JW4zbqHjy0hgEL4YeJyQBL0EvHiOYBbIMPT09mTY9C909QhgC/ooXD4p49Tu8+Td2FkFFHxQ83quiqp6bcCYVQh9Gbml5ZXUtv17Y2Nza3jF391oyTgWhTRLzWHRcLClnEW0qpjjtJILi0OW07Y4up377jgrJ4uhGjRPaD/EwYj4jWGnJMQ+Cgedk7NSa9IgXq955MKDOrWMWUQnNAFGpYtfsakUTZNtWGUFrYRXBAg3HfO95MUlDGinCsZRdCyWqn2GhGOF0UuilkiaYjPCQdjWNcEhlP5udP4HHWvGgHwtdkYIz9ftEhkMpx6GrO0OsAvnbm4p/ed1U+bV+xqIkVTQi80V+yqGK4TQL6DFBieJjTTARTN8KSYAFJkonVtAhfH0K/yets5JVKdnX5WL9YhFHHhyCI3ACLFAFdXAFGqAJCMjAA3gCz8a98Wi8GK/z1pyxmNkHP2C8fQICDpWK</latexit>

hidden he1 he2 he3 hen … ci-1

hdi-1 hdi …
layer(s)

ci
x1 x2 x3 xn
yi-2 yi-1
Encoder
LSTM ABenCon
RNNs and
LSTMs

NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
LSTM and RNN Language Models
No ratings yet
LSTM and RNN Language Models
59 pages
Recurrent Neural Networks Cheatsheet
No ratings yet
Recurrent Neural Networks Cheatsheet
44 pages
Building a Recurrent Neural Network Language Model
No ratings yet
Building a Recurrent Neural Network Language Model
50 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
NLP Lecture: Language Models & RNNs
No ratings yet
NLP Lecture: Language Models & RNNs
14 pages
4.2 Sequence2Sequence (RNN)
No ratings yet
4.2 Sequence2Sequence (RNN)
46 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
12 pages
RNN Sequence Modeling Basics
No ratings yet
RNN Sequence Modeling Basics
24 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
28 pages
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
No ratings yet
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
24 pages
Sequence Models
No ratings yet
Sequence Models
73 pages
Neural Networks in Information Retrieval
No ratings yet
Neural Networks in Information Retrieval
290 pages
L6 - UCLxDeepMind DL2020 Document of Google
No ratings yet
L6 - UCLxDeepMind DL2020 Document of Google
141 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
RNNs for Sequence Modeling
No ratings yet
RNNs for Sequence Modeling
25 pages
Unit 3 Deep Learning SPPU BE IT
No ratings yet
Unit 3 Deep Learning SPPU BE IT
30 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
53 pages
Lec02 RNN
No ratings yet
Lec02 RNN
52 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Outline
No ratings yet
Outline
50 pages
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
No ratings yet
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
47 pages
Module 4 RNN LSTM GRU
No ratings yet
Module 4 RNN LSTM GRU
59 pages
Introduction to RNNs and LSTMs
No ratings yet
Introduction to RNNs and LSTMs
52 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
LSTM, RNN
No ratings yet
LSTM, RNN
38 pages
Recurrent Neural Network Applications
No ratings yet
Recurrent Neural Network Applications
16 pages
For Seminar
No ratings yet
For Seminar
17 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
39 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Sequence Models For NLP
No ratings yet
Sequence Models For NLP
195 pages
11 RNN
No ratings yet
11 RNN
32 pages
Intro to RNNs: A Beginner's Guide
No ratings yet
Intro to RNNs: A Beginner's Guide
8 pages
N-Gram vs. RNN Language Models
No ratings yet
N-Gram vs. RNN Language Models
37 pages
RNN Basics
No ratings yet
RNN Basics
17 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
8.5 Recurrent Neural Networks
No ratings yet
8.5 Recurrent Neural Networks
5 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
21cse356t NLP Unit 4
No ratings yet
21cse356t NLP Unit 4
81 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Statistical Language Models Based On Neural Networks
No ratings yet
Statistical Language Models Based On Neural Networks
59 pages
Rnns and LSTMS: 8.1 Recurrent Neural Networks
No ratings yet
Rnns and LSTMS: 8.1 Recurrent Neural Networks
27 pages
RNN and LSTM
No ratings yet
RNN and LSTM
65 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
66 pages
Sequence Analysis with RNNs
No ratings yet
Sequence Analysis with RNNs
33 pages
10 RNN
No ratings yet
10 RNN
77 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
142 pages
DL M5 Tech
No ratings yet
DL M5 Tech
21 pages
Sequence Modeling for IT Students
No ratings yet
Sequence Modeling for IT Students
71 pages
Lec RNNs 2 LLMs - 1
No ratings yet
Lec RNNs 2 LLMs - 1
117 pages
Lecture 33 - The Nearest-Neighbor Algorithm (Not Very Interesting)
No ratings yet
Lecture 33 - The Nearest-Neighbor Algorithm (Not Very Interesting)
30 pages
CST414 Scheme
No ratings yet
CST414 Scheme
8 pages
Polynomial Characteristics Handout
No ratings yet
Polynomial Characteristics Handout
4 pages
HSC Polynomials
No ratings yet
HSC Polynomials
21 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Important Questions For DAA
No ratings yet
Important Questions For DAA
2 pages
Clustering: Hierarchical vs Partitional
No ratings yet
Clustering: Hierarchical vs Partitional
3 pages
Sorting Review
No ratings yet
Sorting Review
2 pages
Cs502 Midterm Solved Mcqs by Junaid
100% (2)
Cs502 Midterm Solved Mcqs by Junaid
47 pages
Multiplying Polynomials
No ratings yet
Multiplying Polynomials
4 pages
MTH 600: Assignment 2: Due March. 19 (Thursday), in Lab
No ratings yet
MTH 600: Assignment 2: Due March. 19 (Thursday), in Lab
3 pages
Ethiopian LPM & Ad Strategy Optimization
No ratings yet
Ethiopian LPM & Ad Strategy Optimization
1 page
10.4 Activity Guide
No ratings yet
10.4 Activity Guide
2 pages
Basic Models of Artificial Neural Networks
No ratings yet
Basic Models of Artificial Neural Networks
5 pages
Dendogram Analysis
No ratings yet
Dendogram Analysis
7 pages
Numerical Analysis: (MA214) : Instructor: Prof. Tony J. Puthenpurakal
No ratings yet
Numerical Analysis: (MA214) : Instructor: Prof. Tony J. Puthenpurakal
31 pages
Khushnawaz Math
No ratings yet
Khushnawaz Math
8 pages
Lecture 9: October 2: 9.1.1 Stochastic Block Model
No ratings yet
Lecture 9: October 2: 9.1.1 Stochastic Block Model
6 pages
Types of Polynomials in Mathematics 8
No ratings yet
Types of Polynomials in Mathematics 8
4 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
OT MCQ 1 - Merged
No ratings yet
OT MCQ 1 - Merged
97 pages
Solutions For HW10-CS 6033 Fall 2023
No ratings yet
Solutions For HW10-CS 6033 Fall 2023
10 pages
CM Viva Voce (Snaped)
No ratings yet
CM Viva Voce (Snaped)
5 pages
Numerical Methods for Chem Eng Students
No ratings yet
Numerical Methods for Chem Eng Students
2 pages
Fill Algorithm
No ratings yet
Fill Algorithm
5 pages
Aiml ZC416 Course Handout
No ratings yet
Aiml ZC416 Course Handout
7 pages
12 Chap3-2,3,5 Imp Ques
No ratings yet
12 Chap3-2,3,5 Imp Ques
4 pages
Title of The Project Using A Suitable Data Find The Minimum Cost by Applying The Concept of Transportation Problem
No ratings yet
Title of The Project Using A Suitable Data Find The Minimum Cost by Applying The Concept of Transportation Problem
17 pages
Homework 3 PDF
No ratings yet
Homework 3 PDF
2 pages
Big M Method
No ratings yet
Big M Method
4 pages