0% found this document useful (0 votes)
36 views22 pages

Transformers Tutorial

A tutorial covering transformers model in simple terms.

Uploaded by

Rishu Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views22 pages

Transformers Tutorial

A tutorial covering transformers model in simple terms.

Uploaded by

Rishu Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Encoder-Decoder architecture and Transformers in

MT

Rishu Kumar

August 9, 2024

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 1 / 22


Recap

Outline

1 Recap

2 Attention is all you need

3 Encoder-Decoder models

4 Transformer

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 2 / 22


Recap

RNN

Figure: This is RNN1

1
Image Credits: Jindřich Helcl, Jindřich Libovický; unless explicitly mentioned
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 3 / 22
Recap

A Fancy image for RNN

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 4 / 22


Recap

Vanilla RNN

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 5 / 22


Recap

LSTM

Figure: LSTM: Long Short Term Memory

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 6 / 22


Attention is all you need

Outline

1 Recap

2 Attention is all you need

3 Encoder-Decoder models

4 Transformer

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 7 / 22


Attention is all you need

But why?

Long Short Term Memory is still unable to keep the relevant


information in long input

Attention Mechanism is introduced to mitigate this shortcoming

Transformer based models are one of the models to use attention

Recommended Reading:
1 Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanu et al. 20162 )
2 Effective Approaches to Attention-based Neural Machine
Translation (Luong et al. 2015)

2
Original paper published in 2014
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 8 / 22
Encoder-Decoder models

Outline

1 Recap

2 Attention is all you need

3 Encoder-Decoder models

4 Transformer

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 9 / 22


Encoder-Decoder models

Encoder-Decoder

Figure: Encoder-Decoder overview

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 10 / 22


Encoder-Decoder models

A Better Representation

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 11 / 22


Encoder-Decoder models

Shortcomings

Figure: Losing information from words3

3
Image Credit: [Link]
Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 12 / 22
Encoder-Decoder models

Shortcomings

Figure: Information and processing Bottleneck

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 13 / 22


Encoder-Decoder models

Let’s introduce Attention

Attention: Probabilistic retrieval of encoder states for estimating


probability of target words.

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 14 / 22


Encoder-Decoder models

RNN with Attention

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 15 / 22


Encoder-Decoder models

Attention, but with Maths

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 16 / 22


Encoder-Decoder models

Attention, but with Maths

Bahdanu et al. describes eij in a different manner

eij is an alignment model which scores how well the inputs around
position j and the output at position i match

We are just expanding the equation as described in this paper, if you


remember RNN(s), it’s the output of recurrent operations

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 17 / 22


Encoder-Decoder models

Visualisation of RNN with attention

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 18 / 22


Transformer

Outline

1 Recap

2 Attention is all you need

3 Encoder-Decoder models

4 Transformer

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 19 / 22


Transformer

Finally, let’s start with Transformer

Throughout the discussion about Transformers, we will keep coming


across three words, namely key, query, value, so let’s define them.

key: vectors representing all inputs


query: hidden states of the decoder
value: encoder hidden states
The attention is defined in Attention is all you need paper as:

QK T
AT T EN T ION (K, Q, V ) = √ V
dk

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 20 / 22


Transformer

Transformer Visualized

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 21 / 22


Transformer

The Illustrated Transformer

[Link]

Rishu Kumar Encoder-Decoder architecture in MT August 9, 2024 22 / 22

You might also like