0% found this document useful (0 votes)
40 views5 pages

Understanding Encoder-Decoder Models

The document discusses encoder-decoder models and attention mechanisms. It explains that encoders transform input into features while decoders generate output from encoded input. Attention allows the model to focus on relevant parts of the input. Transformer models use multi-head attention in both the encoder and decoder. The encoder maps the input sequence to a representation, and the decoder generates output while attending to the encoder's representations.

Uploaded by

HoàngTuấnAnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views5 pages

Understanding Encoder-Decoder Models

The document discusses encoder-decoder models and attention mechanisms. It explains that encoders transform input into features while decoders generate output from encoded input. Attention allows the model to focus on relevant parts of the input. Transformer models use multi-head attention in both the encoder and decoder. The encoder maps the input sequence to a representation, and the decoder generates output while attending to the encoder's representations.

Uploaded by

HoàngTuấnAnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Lesson 4: Attention is all you need

Encoder and Decoder processes


Computer fails to learn words from raw data such as images, file text, sound file, videos…, so
that it need processes for transforming information into number, and from number into
output’s results, what actually called encoder and decoder:
 Encoder: The phrase transforming input into features learning with ability to learn task.
With model Neural Network, Encoder is hidden layer, while CNN model’s encoder refers
to layers chain of Convolutional and Maxpooling, and RNN’s are embedding and
recurrent neural network layers.
 Decoder: decoder’s input is actually encoder’s output. This phrase finds out the
probability distribution of Encoder’s features learning, identifying where is the label of
output. The result may be a label with classification model or a chain of labels in time
order with seq2seq model.
RNN, a kind seq2seq’s model, has advanced technique referring to architectures introduced
such as LSTM, GRU to remove limitation on long-word dependencies learning ability.
However, witnessing the weak evidence for improvement, another replacing method called
attention is applied for getting higher effectiveness.
Seq2seq model is the model series in order of time sequence. Within a machine-translating
function, input’s words have stronger connection with those at the same position of output.
As a result, attention practically helps algorithms manipulate the focus on (input, output)
words in case they are nearby on location or almost.
(image)
As simulated in the image2, the letter “l” in English correspondent to France’s “Je”, meaning
that attention layer manipulating an  weighted greater at context vector in comparison with
others.
The blue area represents for encoder process while the red one is decoder. Blue cards actually
are hidden state ht got out at each unit (in keras, RNN constituted when using
“return_sequences = True” parameter to get out hidden states, not like others with hidden
state extracted only at the final unit). Context vectors actually are linear combinations of
output weighted by attention. At the first position of phase decoder, context vectors will
distribute attention with weighted higher rather than the remaining positions, that shows the
context vector at each time step takes priority by marking higher weighted for words at the
same place with that time step. Attention’s superposition refers to its’ covering all scene of
sentences instead of just 1 input word in comparison with the lack-attention layer- model at
image 1. Process for building attention layer as follow:
1. Firstly, at tth time step, we can calculate the score lists with each score associating
with 1 location pair at input t and the other follow this equation: score (ht ,h s)
ht fix at time step t and is the hidden state of the tth target word in decoder phase, , h s
is the hidden state of the sth word in encoder phase. The equation for calculating score
can be “dot product” or “cosine similarity” base on choice.
2. After step 1, the scores are still not standardized yet. For constituting a probability
distribution, it’s necessary to pass across the soft-max function for getting the
attention weight as following:
exp(score (ht , h s))
α ts = s
∑ ¿1 exp( score (ht , h s ))
'
'

s
α ts is the attention weight of input words to the word at t position in output or target.
3. Vector combination of probability distribution α ts with hidden state vectors for
getting context vector
s
c t =∑ α ts hs '

s' =1

4. Calculating attention vector to decode the associated words at the target language.
Attention vector is the combination of context vector and hidden states at decoder
phase. If that, attention vector will not only the learning word at final unit’s hidden
state but also all words at different position through context vector. The equation for
calculate hidden state’s output is similar as the way carried for input gate layer in
RNN network:
Transformer and Seq2seq model
This article’s idea based on the theme of reference documents “attention is all you need”,
introduces some kinds of attention model used in machine learning. Some renovated
architectures on transformer are more advanced rather than the old RNN model, even though
both of them belongs to seq2seq model layer for transferring input’s sentences at A language
into output’s sentences at B language. The transformation process is based on 2 phases
encoder and decoder.
RNN is consider the best model layer in machine translating thanks to its potential to record
the dependence on time of words within a sentence, however, new research shows that these
performances can be significantly improved through using attention structure. One of the
most outstanding improvements refers to BERT model (look at reference document at link
Pre-training of Deep Bidirectional Transformers for Language Understanding )
So, what is actually Transformer mean, and what’s its architectures, please take a look at the
following graph:
(image)
This architecture contains 2 main parts: encoder at the left and decoder at the right.
Encoder: the accumulation of 6 identified layers stacking up, each containing 2 sub-layers.
The first sub-layer is multi-head self-attention that we will look at this later, the second
basically is fully-connected feed-forward layers, one of the most machine learning’s basic
definition ( accessing link Multi-layer Perceptron for more information). Being carefully when
using residual linkage at each sub-layer right after normalization layer, this architecture takes
similar idea with resnet network in CNN. The output of each sub-layer is
LayerNorm(x+Sublayer(x)) with the number of dimensions is 512.
Decoder: Decoder is also the accumulation of 6 layers stacking up with the similar
architecture as Encoder’s sub-layer except 1 special sub-layer representing the attention
distribution at the first place. This layer has none of difference with multi-head seft-attention
layer excepting being altered to not bringing future words into attention. At the step i of
decoder phase, we just ably know words standing at lower position than i, meaning that the
attention adjustment only being applied for words standing at these lower positions. Residual
structure is also similarly applied as in Encoder phase.
It’s necessary to consider that we always have a step to add Positional Encoding into encoder
and decoder’s inputs in order to adding time elements on the model for more accuracy. This
is just the addition of position-encoded vector of sentences’ words with word-representing
vector. We can use the [0,1] vector encrypt or sin, cos like in the article.
Attention regime/operation
Scale dot product attention
This actually is a self-attention structure, allowing to alter its weights for different words
within sentence with the strict rule as the closer word’s position the heavier weight and vice
versa. After passing through embedding layer, we got the encoder and decoder’s inputs is X
matrix with the size of m*n: m, n in order is the length and the number of dimensions of an
embedded word vector.
(image)
In order to count out score for each pair of (w i , w j ¿, calculating dot-product between query
and key to find out the relationship between word’s weights. However, final score is
unstandardized, need using soft-max function to get a probability distribution with its volume
representing for attention level of word from query to key. The heavier weight shows that w i
returns a greater attraction than w j . Then, multiplying soft-max function with words’ value
vector to find out attention vector after learning the whole input’s sentences.
(image)
Multi-head Attention
After Scale dot production process, we got an attention matrix with parameters accurately
calculated must be W q , W k , W v. Each of these processes is called attention’s head, repeating
these processes many times, we got Multi-head Attention like following:
(image)
Encoder and Decoder processes
At the second sub-layer, we will pass through all fully connected and get out the result at
output that has same shape with input for looping this block Nx times.
(image)
After carrying backpropagation transformation, typically we lost information about word
position, so that, a residual connection method applied to update initial information into
output, the idea of this as same as CNN’s resnet. Also, another normalization layer added
right after the addition to ensure the stability of training process.
Repeating the above block layer 6 times and signing this as an encoder, we possibly
simplifier encoder process’s graph as follow:
(image)
Decoder process is similar to encoder except these following features:
Decoder process creates word in order from left to right at each time step.
We activate the value for the first step (<start> card at the beginning) that may be considered
as the booster for decoder operating. This will stop until meeting the <end> card for ending
the translated sentence.
We must add encoder’s final matrix in each decoder’s block layer as an input of multi-head
attention.
Add a Masked Multi-head Attention layer at the beginning of each block layer, this is similar
to Multi-head Attention excluding not counting for future words’ attention.
(image)
At each t time step, decoder takes the inputting value as the encoder’s final-output, and the
input of word at t-1 position at decoder. After passing through decoder’s 6 block layers, the
model will return a representative vector for forecast word, its probability distribution
calculated by using linear in combination with soft-max function. In the article, the author
mentions using label smoothing to renovate accuracy and BLUE score at the label level
of ϵls=0.1 to cut down the label at target position up to smaller than 1 and other position
greater than 0. This influences model’s stability but increasing accuracy as in reality, with
one sentence we can have many translation versions.
BLUE score
Measuring the machine translating functions is not easy as other classification tasks because
with each input sentence, having different translating versions that it’s impossible to use only
a fix label for all like in precision or recall.
How to calculate BLUE score (bilingual evaluation understudy)?
BLUE score as introduced above has advantage in evaluating the Machine learning result
based on the score on reference to its close meaning. BLUE score can be calculated based on
P1, P2, P3, P4 on exponentiation of e:
P 1+ P 2+ P 3+ P 4
¿=exp ⁡( )
4
However, one of the most considerable BLUE score’s restriction is that the shorter sentences
are, the higher BLUE score sense is. The reason can be explained as shorter sentences
meaning less n-grams and higher frequency of them appearing within translation documents.
A penalty index for the shortage called Brevity Penalty (BP) comes to fix this mistake:
(image)
At that:
P 1+ P2+ P 3+ P 4
¿=BP∗exp ⁡( )
4

Todays, multiple packages on multi-language machine learning support BLUE score


calculation, such as on Python using nlkt package as follow:
(image)
Practicing with Attention Layer
We can find code for Multi-head Attention layer from different public source on internet, one
of the most famous one is Transformer – Kyubyoung. We will not reinterpret the algorithm
for coding but just explaining function used within the reference code.
Scale_dot_product_attention function will return the result as equation (1) when knowing
3 matrixes Q, K, V through mask function as follow:
(image)
Example for mark function’s result with key.
(image)
Example for scale_dot_product_attention function.
(image)
Multi-head attention function: calculated based on scale_dot_product_attention.
(image)
The major consideration with code in multi-head is that the size of 3 matrixes Q, K, V must
be recalculated to create h transformations on sub-matrixes Q ' , K ' ,V 'those have the same
size. Each of these transformations will become a multi-attention’s head.
Looking at the algorithms’ result in the article “Attention layer is all you need”, it’s easy to
realize that the accuracy level is higher than when using different machine translating model
layers, also the expense for calculating is much cheaper.
(image)
Conclusions
We can learn from this article those following major things:
 1 machine translating algorithm contains 2 phases encoder and decoder
 Transformer’s net structure
 Transforming scale_dot_product_attention to find out the attention weight matrix
 Multi-head attention layers carrying on each single-head
 Metric measuring the accuracy of machine translating model through BLUE score
(bilanguage evaluation understudy).
This article supported from knowledge on lots of reference documents. I’m truly thankful for
all of authors of original article, vlogs for sharing valuable information about this topic, and
hopefully to have your contribution on this article’s mistakes.
Reference documents
1. Attention layer is all you need - Nhóm tác giả Google Brain
2. Effective Approaches to Attention-based Neural Machine Translation - Nhóm tác giả
Minh-Thang Luong, Hieu Pham, Christopher D. Manning
3. Tensorflow implementation of the Transformer: Attention Is All You Need -
kyubyong
4. Tensor2tensor project
5. BLUE score - machinelearningmastery
6. Minsuk Heo - youtube chanel
7. What is tranformer? - Maxime Allard
8. Deep learing for NLP - CS244d standford

You might also like