Understanding Encoder-Decoder Models

The document discusses encoder-decoder models and attention mechanisms. It explains that encoders transform input into features while decoders generate output from encoded input. Attention allows the model to focus on relevant parts of the input. Transformer models use multi-head attention in both the encoder and decoder. The encoder maps the input sequence to a representation, and the decoder generates output while attending to the encoder's representations.

Uploaded by

HoàngTuấnAnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views5 pages

Understanding Encoder-Decoder Models

Uploaded by

HoàngTuấnAnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Lesson 4: Attention is all you need

Encoder and Decoder processes

Computer fails to learn words from raw data such as images, file text, sound file, videos…, so
that it need processes for transforming information into number, and from number into
output’s results, what actually called encoder and decoder:
 Encoder: The phrase transforming input into features learning with ability to learn task.
With model Neural Network, Encoder is hidden layer, while CNN model’s encoder refers
to layers chain of Convolutional and Maxpooling, and RNN’s are embedding and
recurrent neural network layers.
 Decoder: decoder’s input is actually encoder’s output. This phrase finds out the
probability distribution of Encoder’s features learning, identifying where is the label of
output. The result may be a label with classification model or a chain of labels in time
order with seq2seq model.
RNN, a kind seq2seq’s model, has advanced technique referring to architectures introduced
such as LSTM, GRU to remove limitation on long-word dependencies learning ability.
However, witnessing the weak evidence for improvement, another replacing method called
attention is applied for getting higher effectiveness.
Seq2seq model is the model series in order of time sequence. Within a machine-translating
function, input’s words have stronger connection with those at the same position of output.
As a result, attention practically helps algorithms manipulate the focus on (input, output)
words in case they are nearby on location or almost.
(image)
As simulated in the image2, the letter “l” in English correspondent to France’s “Je”, meaning
that attention layer manipulating an  weighted greater at context vector in comparison with
others.
The blue area represents for encoder process while the red one is decoder. Blue cards actually
are hidden state ht got out at each unit (in keras, RNN constituted when using
“return_sequences = True” parameter to get out hidden states, not like others with hidden
state extracted only at the final unit). Context vectors actually are linear combinations of
output weighted by attention. At the first position of phase decoder, context vectors will
distribute attention with weighted higher rather than the remaining positions, that shows the
context vector at each time step takes priority by marking higher weighted for words at the
same place with that time step. Attention’s superposition refers to its’ covering all scene of
sentences instead of just 1 input word in comparison with the lack-attention layer- model at
image 1. Process for building attention layer as follow:
1. Firstly, at tth time step, we can calculate the score lists with each score associating
with 1 location pair at input t and the other follow this equation: score (ht ,h s)
ht fix at time step t and is the hidden state of the tth target word in decoder phase, , h s
is the hidden state of the sth word in encoder phase. The equation for calculating score
can be “dot product” or “cosine similarity” base on choice.
2. After step 1, the scores are still not standardized yet. For constituting a probability
distribution, it’s necessary to pass across the soft-max function for getting the
attention weight as following:
exp(score (ht , h s))
α ts = s
∑ ¿1 exp( score (ht , h s ))
'
'

s
α ts is the attention weight of input words to the word at t position in output or target.
3. Vector combination of probability distribution α ts with hidden state vectors for
getting context vector
s
c t =∑ α ts hs '

s' =1

4. Calculating attention vector to decode the associated words at the target language.
Attention vector is the combination of context vector and hidden states at decoder
phase. If that, attention vector will not only the learning word at final unit’s hidden
state but also all words at different position through context vector. The equation for
calculate hidden state’s output is similar as the way carried for input gate layer in
RNN network:
Transformer and Seq2seq model
This article’s idea based on the theme of reference documents “attention is all you need”,
introduces some kinds of attention model used in machine learning. Some renovated
architectures on transformer are more advanced rather than the old RNN model, even though
both of them belongs to seq2seq model layer for transferring input’s sentences at A language
into output’s sentences at B language. The transformation process is based on 2 phases
encoder and decoder.
RNN is consider the best model layer in machine translating thanks to its potential to record
the dependence on time of words within a sentence, however, new research shows that these
performances can be significantly improved through using attention structure. One of the
most outstanding improvements refers to BERT model (look at reference document at link
Pre-training of Deep Bidirectional Transformers for Language Understanding )
So, what is actually Transformer mean, and what’s its architectures, please take a look at the
following graph:
(image)
This architecture contains 2 main parts: encoder at the left and decoder at the right.
Encoder: the accumulation of 6 identified layers stacking up, each containing 2 sub-layers.
The first sub-layer is multi-head self-attention that we will look at this later, the second
basically is fully-connected feed-forward layers, one of the most machine learning’s basic
definition ( accessing link Multi-layer Perceptron for more information). Being carefully when
using residual linkage at each sub-layer right after normalization layer, this architecture takes
similar idea with resnet network in CNN. The output of each sub-layer is
LayerNorm(x+Sublayer(x)) with the number of dimensions is 512.
Decoder: Decoder is also the accumulation of 6 layers stacking up with the similar
architecture as Encoder’s sub-layer except 1 special sub-layer representing the attention
distribution at the first place. This layer has none of difference with multi-head seft-attention
layer excepting being altered to not bringing future words into attention. At the step i of
decoder phase, we just ably know words standing at lower position than i, meaning that the
attention adjustment only being applied for words standing at these lower positions. Residual
structure is also similarly applied as in Encoder phase.
It’s necessary to consider that we always have a step to add Positional Encoding into encoder
and decoder’s inputs in order to adding time elements on the model for more accuracy. This
is just the addition of position-encoded vector of sentences’ words with word-representing
vector. We can use the [0,1] vector encrypt or sin, cos like in the article.
Attention regime/operation
Scale dot product attention
This actually is a self-attention structure, allowing to alter its weights for different words
within sentence with the strict rule as the closer word’s position the heavier weight and vice
versa. After passing through embedding layer, we got the encoder and decoder’s inputs is X
matrix with the size of m*n: m, n in order is the length and the number of dimensions of an
embedded word vector.
(image)
In order to count out score for each pair of (w i , w j ¿, calculating dot-product between query
and key to find out the relationship between word’s weights. However, final score is
unstandardized, need using soft-max function to get a probability distribution with its volume
representing for attention level of word from query to key. The heavier weight shows that w i
returns a greater attraction than w j . Then, multiplying soft-max function with words’ value
vector to find out attention vector after learning the whole input’s sentences.
(image)
Multi-head Attention
After Scale dot production process, we got an attention matrix with parameters accurately
calculated must be W q , W k , W v. Each of these processes is called attention’s head, repeating
these processes many times, we got Multi-head Attention like following:
(image)
Encoder and Decoder processes
At the second sub-layer, we will pass through all fully connected and get out the result at
output that has same shape with input for looping this block Nx times.
(image)
After carrying backpropagation transformation, typically we lost information about word
position, so that, a residual connection method applied to update initial information into
output, the idea of this as same as CNN’s resnet. Also, another normalization layer added
right after the addition to ensure the stability of training process.
Repeating the above block layer 6 times and signing this as an encoder, we possibly
simplifier encoder process’s graph as follow:
(image)
Decoder process is similar to encoder except these following features:
Decoder process creates word in order from left to right at each time step.
We activate the value for the first step (<start> card at the beginning) that may be considered
as the booster for decoder operating. This will stop until meeting the <end> card for ending
the translated sentence.
We must add encoder’s final matrix in each decoder’s block layer as an input of multi-head
attention.
Add a Masked Multi-head Attention layer at the beginning of each block layer, this is similar
to Multi-head Attention excluding not counting for future words’ attention.
(image)
At each t time step, decoder takes the inputting value as the encoder’s final-output, and the
input of word at t-1 position at decoder. After passing through decoder’s 6 block layers, the
model will return a representative vector for forecast word, its probability distribution
calculated by using linear in combination with soft-max function. In the article, the author
mentions using label smoothing to renovate accuracy and BLUE score at the label level
of ϵls=0.1 to cut down the label at target position up to smaller than 1 and other position
greater than 0. This influences model’s stability but increasing accuracy as in reality, with
one sentence we can have many translation versions.
BLUE score
Measuring the machine translating functions is not easy as other classification tasks because
with each input sentence, having different translating versions that it’s impossible to use only
a fix label for all like in precision or recall.
How to calculate BLUE score (bilingual evaluation understudy)?
BLUE score as introduced above has advantage in evaluating the Machine learning result
based on the score on reference to its close meaning. BLUE score can be calculated based on
P1, P2, P3, P4 on exponentiation of e:
P 1+ P 2+ P 3+ P 4
¿=exp ⁡( )
4
However, one of the most considerable BLUE score’s restriction is that the shorter sentences
are, the higher BLUE score sense is. The reason can be explained as shorter sentences
meaning less n-grams and higher frequency of them appearing within translation documents.
A penalty index for the shortage called Brevity Penalty (BP) comes to fix this mistake:
(image)
At that:
P 1+ P2+ P 3+ P 4
¿=BP∗exp ⁡( )
4

Todays, multiple packages on multi-language machine learning support BLUE score

calculation, such as on Python using nlkt package as follow:
(image)
Practicing with Attention Layer
We can find code for Multi-head Attention layer from different public source on internet, one
of the most famous one is Transformer – Kyubyoung. We will not reinterpret the algorithm
for coding but just explaining function used within the reference code.
Scale_dot_product_attention function will return the result as equation (1) when knowing
3 matrixes Q, K, V through mask function as follow:
(image)
Example for mark function’s result with key.
(image)
Example for scale_dot_product_attention function.
(image)
Multi-head attention function: calculated based on scale_dot_product_attention.
(image)
The major consideration with code in multi-head is that the size of 3 matrixes Q, K, V must
be recalculated to create h transformations on sub-matrixes Q ' , K ' ,V 'those have the same
size. Each of these transformations will become a multi-attention’s head.
Looking at the algorithms’ result in the article “Attention layer is all you need”, it’s easy to
realize that the accuracy level is higher than when using different machine translating model
layers, also the expense for calculating is much cheaper.
(image)
Conclusions
We can learn from this article those following major things:
 1 machine translating algorithm contains 2 phases encoder and decoder
 Transformer’s net structure
 Transforming scale_dot_product_attention to find out the attention weight matrix
 Multi-head attention layers carrying on each single-head
 Metric measuring the accuracy of machine translating model through BLUE score
(bilanguage evaluation understudy).
This article supported from knowledge on lots of reference documents. I’m truly thankful for
all of authors of original article, vlogs for sharing valuable information about this topic, and
hopefully to have your contribution on this article’s mistakes.
Reference documents
1. Attention layer is all you need - Nhóm tác giả Google Brain
2. Effective Approaches to Attention-based Neural Machine Translation - Nhóm tác giả
Minh-Thang Luong, Hieu Pham, Christopher D. Manning
3. Tensorflow implementation of the Transformer: Attention Is All You Need -
kyubyong
4. Tensor2tensor project
5. BLUE score - machinelearningmastery
6. Minsuk Heo - youtube chanel
7. What is tranformer? - Maxime Allard
8. Deep learing for NLP - CS244d standford

Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Attention Is All We Need
No ratings yet
Attention Is All We Need
5 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Lec06 Attention Transformer
No ratings yet
Lec06 Attention Transformer
70 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Self-Attention Mechanism in NLP
No ratings yet
Self-Attention Mechanism in NLP
18 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
LLM Report
No ratings yet
LLM Report
6 pages
Deep Learning: Attention Explained
No ratings yet
Deep Learning: Attention Explained
65 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
29 pages
Transformer
No ratings yet
Transformer
14 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
AI Transformers for Researchers
No ratings yet
AI Transformers for Researchers
65 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
Generative AI
No ratings yet
Generative AI
54 pages
Contact Session16-LLM Models
No ratings yet
Contact Session16-LLM Models
65 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
NLP 8
No ratings yet
NLP 8
42 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Transformer
No ratings yet
Transformer
58 pages
Understanding the Attention Mechanism
No ratings yet
Understanding the Attention Mechanism
18 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
N-gram vs Negative Sampling in NLP
No ratings yet
N-gram vs Negative Sampling in NLP
117 pages
Attention
No ratings yet
Attention
12 pages
Understanding Transformer Models and BERT
No ratings yet
Understanding Transformer Models and BERT
10 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Encode and Decoder Diagram Explanation
No ratings yet
Encode and Decoder Diagram Explanation
8 pages
Transformer
No ratings yet
Transformer
10 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
Attention-Based Models in Deep Learning
No ratings yet
Attention-Based Models in Deep Learning
69 pages
M1 L4 RR2 (Section 11)
No ratings yet
M1 L4 RR2 (Section 11)
41 pages
NLP Attention Mechanism Guide
No ratings yet
NLP Attention Mechanism Guide
27 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
6-PDE-Laplace Equation
No ratings yet
6-PDE-Laplace Equation
27 pages
Chapter 3-4n
No ratings yet
Chapter 3-4n
21 pages
Let Reviewer
No ratings yet
Let Reviewer
28 pages
Single Phase Transformer Design Lab
No ratings yet
Single Phase Transformer Design Lab
3 pages
Topic Vehicle Marine
No ratings yet
Topic Vehicle Marine
226 pages
Quiz 2 Week 2
No ratings yet
Quiz 2 Week 2
1 page
Practice WorK KHALID
No ratings yet
Practice WorK KHALID
8 pages
Concrete Compressive Strength
100% (1)
Concrete Compressive Strength
3 pages
European Subtitling Standards Guide
No ratings yet
European Subtitling Standards Guide
10 pages
PSK and FM Modulation Lab Manual
No ratings yet
PSK and FM Modulation Lab Manual
73 pages
All TOC E-Lecture Notes
No ratings yet
All TOC E-Lecture Notes
57 pages
Advanced Accounting - Dayag 2015 - Chapter 15 - Multiple Choice Solution (23-25)
No ratings yet
Advanced Accounting - Dayag 2015 - Chapter 15 - Multiple Choice Solution (23-25)
1 page
Unbalanced Fault (SLG)
No ratings yet
Unbalanced Fault (SLG)
13 pages
(Thesis) New Materials in Structural Design
No ratings yet
(Thesis) New Materials in Structural Design
57 pages
ArcGis 9.3 Installation Tutorial
50% (2)
ArcGis 9.3 Installation Tutorial
35 pages
Bio Codoped BCZT
No ratings yet
Bio Codoped BCZT
10 pages
Performance Analysis of Security Algorithms For Iot Devices
No ratings yet
Performance Analysis of Security Algorithms For Iot Devices
4 pages
F M Theory Notes
No ratings yet
F M Theory Notes
27 pages
Chemistry Practicals First Years
100% (1)
Chemistry Practicals First Years
65 pages
Planet Report PDF
No ratings yet
Planet Report PDF
7 pages
Basic Concepts of Analytical Chemistry - (Chapter 29. Inductively Coupled Plasma - Atomic Emission Spectroscopy)
No ratings yet
Basic Concepts of Analytical Chemistry - (Chapter 29. Inductively Coupled Plasma - Atomic Emission Spectroscopy)
9 pages
Online Banking
No ratings yet
Online Banking
49 pages
Denver MAXX: Downloaded From Manuals Search Engine
No ratings yet
Denver MAXX: Downloaded From Manuals Search Engine
20 pages
Sumeer Internship 3 - Removed
No ratings yet
Sumeer Internship 3 - Removed
83 pages
Irc - SP 83 2008 PDF
100% (2)
Irc - SP 83 2008 PDF
154 pages
Modeling Second Order de
No ratings yet
Modeling Second Order de
68 pages
Network Basics for IT Students
No ratings yet
Network Basics for IT Students
68 pages
4 Answers
No ratings yet
4 Answers
4 pages
EC2-Wind Calculation R1
No ratings yet
EC2-Wind Calculation R1
6 pages
Painting Polynomials 12 5
0% (1)
Painting Polynomials 12 5
16 pages