Encoder-Decoder architecture and Transformers in
MT
Rishu Kumar
August 9, 2024
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 1 / 22
Recap
Outline
1 Recap
2 Attention is all you need
3 Encoder-Decoder models
4 Transformer
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 2 / 22
Recap
RNN
Figure: This is RNN1
1
Image Credits: Jindřich Helcl, Jindřich Libovický; unless explicitly mentioned
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 3 / 22
Recap
A Fancy image for RNN
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 4 / 22
Recap
Vanilla RNN
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 5 / 22
Recap
LSTM
Figure: LSTM: Long Short Term Memory
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 6 / 22
Attention is all you need
Outline
1 Recap
2 Attention is all you need
3 Encoder-Decoder models
4 Transformer
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 7 / 22
Attention is all you need
But why?
Long Short Term Memory is still unable to keep the relevant
information in long input
Attention Mechanism is introduced to mitigate this shortcoming
Transformer based models are one of the models to use attention
Recommended Reading:
1 Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanu et al. 20162 )
2 Effective Approaches to Attention-based Neural Machine
Translation (Luong et al. 2015)
2
Original paper published in 2014
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 8 / 22
Encoder-Decoder models
Outline
1 Recap
2 Attention is all you need
3 Encoder-Decoder models
4 Transformer
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 9 / 22
Encoder-Decoder models
Encoder-Decoder
Figure: Encoder-Decoder overview
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 10 / 22
Encoder-Decoder models
A Better Representation
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 11 / 22
Encoder-Decoder models
Shortcomings
Figure: Losing information from words3
3
Image Credit: [Link]
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 12 / 22
Encoder-Decoder models
Shortcomings
Figure: Information and processing Bottleneck
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 13 / 22
Encoder-Decoder models
Let’s introduce Attention
Attention: Probabilistic retrieval of encoder states for estimating
probability of target words.
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 14 / 22
Encoder-Decoder models
RNN with Attention
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 15 / 22
Encoder-Decoder models
Attention, but with Maths
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 16 / 22
Encoder-Decoder models
Attention, but with Maths
Bahdanu et al. describes eij in a different manner
eij is an alignment model which scores how well the inputs around
position j and the output at position i match
We are just expanding the equation as described in this paper, if you
remember RNN(s), it’s the output of recurrent operations
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 17 / 22
Encoder-Decoder models
Visualisation of RNN with attention
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 18 / 22
Transformer
Outline
1 Recap
2 Attention is all you need
3 Encoder-Decoder models
4 Transformer
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 19 / 22
Transformer
Finally, let’s start with Transformer
Throughout the discussion about Transformers, we will keep coming
across three words, namely key, query, value, so let’s define them.
key: vectors representing all inputs
query: hidden states of the decoder
value: encoder hidden states
The attention is defined in Attention is all you need paper as:
QK T
AT T EN T ION (K, Q, V ) = √ V
dk
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 20 / 22
Transformer
Transformer Visualized
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 21 / 22
Transformer
The Illustrated Transformer
[Link]
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 22 / 22