EMNLP 2020 Tutorial High Performance NLP

High Performance Natural
Language Processing
EMNLP 2020
1
Presenters
Gabriel Ilharco Cesar Ilharco Iulia Turc

University of Washington Google Google
Tim Dettmers Felipe Ferreira Kenton Lee

University of Washington Google Google
2
High Performance NLP
Slides available at: bit.ly/2SmhKY7
3
01 Introduction
Agenda
02 Fundamentals
03 Core Techniques
04 Efficient Attention
05 Case Studies
06 Scaling in Practice
07 Closing Notes
4
01
Introduction
5
Motivation & Applications
Why do we need it ?
SCALE
● NEWS
○ Realtime: Majority of content is consumed within a few hours after

publication [1]
○ Thousands of news articles per second
○ 40-80 sentences per article
● SOCIAL NETWORKS: ~6 Thousand tweets per second [2]
● THE WEB: Orders of magnitude bigger
What could we do, if we had it ? [1] Tatar, A., Antoniadis, P., Amorim, M.D.d. et al. From popularity prediction to ranking online news.
Soc. Netw. Anal. Min. 4, 174 (2014). https://doi.org/10.1007/s13278-014-0174-8
[2] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html
6
Summarization
7
Summarization
8
Facts Extraction
Highest-ever drafted Black player in history.

9
Facts Extraction
● Noteworthy Facts
● Trendiness
VS
10
Sentence Entailment
11
Recent years in Natural
Language Processing
12
Benchmarks through the years - SQuAD 1.1
Human Performance
91.2
The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/

13
Benchmarks through the years - SQuAD 2.0
Human Performance
89.5
The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/

14
Benchmarks through the years - GLUE
Human Performance
87.1
The GLUE Benchmark (Wang et al., 2018)

15
A brief recent history of scale in NLP
BERT “[...] scaling to extreme

(340M) model sizes also leads to
large improvements [...]”
(Devlin et al., 2018)

GPT
(110M)
16
BERT
T5 “[...] scaling the model
(340M)
(11B) size to 11 billion
parameters was the most
MegatronLM important ingredient for
(8.3B)
achieving our best
performance.”
GPT (Raffel et al, 2019)
(110M)
Transformer GPT-2 Grover
GPT BERT ELMo (1.5B) (1.5B)
(110M) (340M) (465M)
17
DeepSpeed
BERT
(1T)
(340M)
GShard
(600B)
GPT GPT-3
(110M) (175B)
GPT BERT GPT-2 T5 Turing-NLG

(1.5B) (11B) (17B)
(110M) (340M)
18
Scaling Laws
Henighan et al., 2020

19
The drawbacks of naive scaling
1) Disconnect with production systems
- Latency
- Hardware constraints
- Energy costs
Memory CPU/GPU/TPU Storage Battery
20
2) Costs
- Hardware
- 2048 TPU v3 accelerators (GShard, Lepikhin et al., 2020)
- 285,000 CPU cores, 10,000 GPUs (GPT-3, Brown et al., 2020)
- Financial
- GPT-3 training cost is estimated at 4.6 million dollars.
21
2) Costs
3) Accessibility
- Ever-larger hardware and financial requirements impose great barriers to many

researchers and institutions
- This can have a serious impact in our research community
- For instance, 62% of PhD students have access to 4 or less GPUs, according to a
recent poll.
22
2) Costs
3) Accessibility
Altogether, this is especially relevant to a field that scaled by

3 orders of magnitude in 2 years.
23
We should strive for
efficiency
24
Towards more efficient NLP
1) Core techniques
- Knowledge Distillation
Source: unsplash.com
25
1) Core techniques
- Quantization
26
1) Core techniques
- Quantization
- Pruning
27
1) Core techniques
2) Efficient attention
- Data-Independent Patterns
28
1) Core techniques
- Data-Dependent Patterns
29
1) Core techniques
- Kernels and Alternative Attention Mechanisms
30
1) Core techniques
- Alternative Attention Mechanisms
- Recurrence
31
1) Core techniques
3) Case studies
- Efficient Language Models
32
1) Core techniques
3) Case studies
- Efficient Language Models
- Retrieval
33
1) Core techniques
3) Case studies
4) Scaling in Practice
- Scaling Laws of Neural Language Models
34
1) Core techniques
3) Case studies

- Parallelism Techniques
Source: Microsoft Blog Post
35
1) Core techniques
CPU GPU
3) Case studies

- Methods to Reduce Memory Footprint
Active
Swap in to Layer
GPU
36
1) Core techniques
3) Case studies

- Methods to Reduce Memory Footprint
- Mixture of Experts
37
02
Fundamentals
38
Sequence-to-sequence models
outputs
piles of sequential,
differentiable tensor
operations
inputs
https://xkcd.com/1838/
39
Sequence-to-sequence models
inputs outputs
outputs
Machine Translation Hello, world Olá, mundo
Sentiment Analysis Amazing movie, 10/10! ★★★★★

operations
Language modeling - The quick brown fox...
Speech recognition Hello, world

inputs
40
RNNs
RNNs allow computations

over sequences of arbitrary
length
outputs RNN
tanhcell RNN
tanhcell RNN
tanhcell
operations
inputs
41
Encoders and
Decoders
RNNs allow computations

over sequences of arbitrary
Encoder Decoder
length
t t t t t t
a a a
outputs n n n . .an. a
n
a
n
h h h h h h
operations
inputs
42
The encoder-decoder bottleneck
n n
English h t
h ...
h t
h ...
h t
The agreement on the European Economic Area was signed in August 1992 . <EOS>
ck
n bottlene
informatio
l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>
n n
French h t
h ...
h t
h ...
h t
Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)
43
Attention
n n
French h t
h ...
h t
h ...
h t
été
Attention head
n n
English h t
h ...
h t
h ...
h t
44
Attention
A summary of i,
based on how similar their
are with the query
Bahdanau et al.
Neural machine translation by
jointly learning to align and
translate. 2014
Attention head
Thang Luong et al.
Effective approaches to
attention-based neural machine
...
translation. 2015
45
Dot product
attention
Thang Luong et al.

Effective approaches to
attention-based neural machine
translation. 2015
...
Attention head
46
Attention
mechanisms
l
te
s
ho
Va
al
?
¿
Are
The attention matrix
you
going
to
the
hotel
?
47
Transformers
48
Motivation
n n
French h t
h ...
h t
h ...
h t
été
Attention head
n n
English h t
h ...
h t
h ...
h t
49
Motivation
t t t t t t t t
a a a
n n n . .an. a
n
a
n
a
n
a
n
h h h h h h h h
sequential
t t t t t t t t
a a a a a a a a
n n n n n n n n
h h h h h h h h
parallel
50
Scaled
A summary of values,
Dot-Product based on how similar their
corresponding keys are
Attention
with the query
Queries, keys and values
51
Scaled
Dot-Product
Attention
Queries, keys and values
For some similarity

function
52
Scaled
Dot-Product
Attention
Using dot-product similarity,

we can vectorize nicely
Normalization factor for

= feature dim numerical stability
53
Scaled
Dot-Product
Attention
Let’s dive into the dimensions

(batch omitted for simplicity)
= sequence length
= feature dim
54
Scaled
Dot-Product
Attention

= sequence length
= feature dim
55
Scaled
Dot-Product
Attention

= sequence length
= feature dim
56
Scaled
Dot-Product
Attention

= sequence length
= feature dim
57
quadratic in
Scaled sequence length!
Dot-Product
Attention

= sequence length
= feature dim
58
Multi-head
attention
Scaled Dot-Product
Attention
= sequence length
= feature dim
= # of attention heads
linear linear linear
59
Multi-head
attention
Scaled Dot-Product
Attention
= sequence length
each head:
= feature dim
60
Multi-head
attention
Scaled Dot-Product
Attention each head:
= sequence length
each head:
= feature dim
61
Multi-head
attention
linear
Scaled Dot-Product
= sequence length
each head:
= feature dim
62
Multi-head
attention
linear
Scaled Dot-Product
bottleneck is quadratic in
sequence length due to QK!
= sequence length
each head:
= feature dim
63
Positional
encodings
So far, attention has been a

set operation. positional
encodings
t
Let’s add positional a
n
h
information!
64
Positional
encodings positional
encodings
t
a
n
h
So far, attention has been a
set operation.
Let’s add positional Fixed:

information! for a position in the sequence
and in the feature space
position
These can be either learned
or fixed.
depth
65
The transformer encoder
add & norm
dense
add & norm xN
multi-head
attention
positional
encoding
input
embedding Ba et al, 2016.
Vaswani et al., 2017
66
The transformer decoder
add & norm
dense
add & norm
multi-head
attention
K and V xN
from encoder
add & norm
masked
multi-head
attention
prevent model from
peeking at the
future by masking
positional
encoding attention weights
Ba et al, 2016.
Vaswani et al., 2017 input

embedding
67
Putting it all together
add & norm
dense
add & norm
multi-head
attention
add & norm
xN
dense
add & norm add & norm
xN
masked
multi-head
multi-head
attention
attention
Vaswani et al., 2017
68
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)
https://ai.facebook.com/blog/using-neural-networks-to-solve-advanced-mathematics-equations/
69
- Music generation (e.g. Anna Huang et al., 2019)
https://magenta.tensorflow.org/music-transformer
70
- Biology (e.g. Rives et al., 2019, Madani et al., 2020)
71
- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)
Visual Question Answering

(Agrawal et al., 2015)
72
- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)
- Computer Vision (e.g. Ramachandran et al., 2019, Dosovitskiy et al., 2020)
73
Transformers in NLP
Transformers are ubiquitous in NLP.
Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).
Models are typically used in 3 scenarios:
74
Transformers in NLP
Pre-training
- Large corpus
(e.g. web crawled data)
- Typically unsupervised
(e.g. masked language
modeling)
- Usually runs in GPUs or
TPUs
75
Transformers in NLP
Pre-training Fine-tuning
- Large corpus - Smaller corpus
(e.g. web crawled data) - Typically supervised
- Typically unsupervised (e.g. question answering,
(e.g. masked language natural language inference)
modeling) - Usually runs in GPUs or
- Usually runs in GPUs or TPUs
TPUs
76
Transformers in NLP
Pre-training Fine-tuning Production

- Large corpus - Smaller corpus - Inference
(e.g. web crawled data) - Typically supervised - Usually runs in CPUs,
- Typically unsupervised (e.g. question answering, sometimes in mobile
(e.g. masked language natural language inference) devices
modeling) - Usually runs in GPUs or
- Usually runs in GPUs or TPUs
TPUs
77
03
Core Techniques
78
Knowledge
Distillation
79
Knowledge
Distillation
Hinton et al., 2015 (x, y=0.8)

Distilling the Knowledge in a Neural Network Teacher Student
x (x, y=1.0)
Data
80
Knowledge are
Distillation
do
well
...
for Pre-training
Sanh et al., 2019

DistilBERT, a distilled version of BERT: smaller, Teacher
faster, cheaper and lighter
Student
how [MASK] you how [MASK] you
Data
81
Knowledge are
Distillation
do
well
...
for Pre-training
feature map transfer
Sun et al., 2019
MobileBERT: a Compact Task-Agnostic BERT Teacher Student
for Resource-Limited Devices
attention transfer
Data
82
1 Regular Pre-training Student
Knowledge
how [MASK] you
Distillation
Data
for Fine-Tuning
Turc et al., 2019

Well-Read Students Learn Better: On the
2 Fine-tuning via distillation Teacher
(x, y=0.8)
Student
Importance of Pre-training Compact Models
x Data
3 (Optional) regular fine-tuning Student
(x, y=1.0)
Data
83
1 Pre-training via distillation
Knowledge
Teacher
Distillation per-layer transfer
for Pre-training embeddings transfer

Student
and Fine-tuning
Data
Jiao et al., 2019
TinyBERT: Distilling BERT for Natural (x, y=0.8)
Language Understanding
2 Fine-training via distillation
Teacher per-layer transfer
Student
embeddings transfer
(x, y=1)
Data
84
Quantization
85
Quantization Definition
Q(z) = qj z ∈ (tj, tj+1] j=0, …, 2k-1
real-valued tensor (activation or weight) quantization precision
quantization operator
86
Quantization Definition
Q(z) = qj z ∈ (tj, tj+1] j=0, …, 2k-1
real-valued tensor (activation or weight) quantization precision
quantization operator
Linear Quantization
z = S (qj - Z)
zero point
scaling factor
87
Quantization Quantization-Aware Training
Backward pass on w
Forward pass on ŵ
Jacob et al., 2017
Quantization and Training of Neural Networks
for Efficient
Integer-Arithmetic-Only Inference
88
Quantization
● Q8BERT: symmetric linear quantization:
Q(z) = clamp(⌊z ✕ Sz⌉, -127, +127), where Sz is a statistic
computed during or post-training.
Zafrir et al., 2019
Q8BERT: Quantized 8Bit BERT
● Q-BERT: uniform quantization to {0, …, 2k-1} with:
○ mixed precision (higher Hessian spectrum => higher
Shen et al., 2019 precision for layer)
Q-BERT: Hessian Based Ultra Low Precision ○ group precision (each matrix Wk Wq Wv Wo is its own group)
Quantization of BERT
89
Quantization
with Distillation
Zhang et al., 2020

TernaryBERT: Distillation-aware Ultra-low Bit
BERT
90
Pruning
91
Pruning Definition
Pruning removes “unimportant” weights from a network:
a = (W ⊙ M) x
input
pruning mask
model weight
activation
Main Questions (Hassibi and Stork)

● Which weights should be eliminated?
● How should the remaining weights be adjusted?
● How can such network pruning be done in an efficient way?
92
Pruning Pruning based on second-order derivatives
Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence
● Prune in multiple iterations, based on second-order derivatives:
Hassibi and Stork, 1993 ○ OBD: prune and train

○ OBS: prune and update weights based on second-order statistics
OBS: Second order derivatives for network
pruning: Optimal Brain Surgeon
93
Pruning Pruning based on second-order derivatives
Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence
● Prune in multiple iterations, based on second-order derivatives:
Hassibi and Stork, 1993 ○ OBD: prune and train

○ OBS: prune and update weights based on second-order statistics
OBS: Second order derivatives for network
pruning: Optimal Brain Surgeon
Why do we not train this smaller architecture instead?
94
Pruning
The LTH
Searching for Tickets: One-Shot Magnitude Pruning
Frankle and Carbin, 2018

The Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks
Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 95
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning
Frankle and Carbin, 2018

The Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning with Rewinding
Frankle et al, 2019

Stabilizing the Lottery Ticket Hypothesis
Pruning
The LTH, ctd
Brix et al., 2020

Successfully Applying the Stabilized Lottery
Ticket Hypothesis to the Transformer
Architecture
MP = Magnitude Pruning
LT = Lottery Ticket
SLT = Stabilized Lottery Ticket
CLT = Constant Lottery Ticket
98
Pruning Movement Pruning
● First-order strategy: “instead of selecting weights that are far from zero, we retain
connections that are moving away from zero during the training process”
Sanh et al., 2020 ● The pruning mask M is learnt together with the model parameters.
Movement Pruning: Adaptive Sparsity by
Fine-Tuning ○ hard version: M = Topv(S), where score S is learnt and v is a
hyperparameter.
○ soft version: M = (S > τ), where score S is learnt and threshold τ is a

hyperparameter.
99
Pruning On standard hardware:
& Hardware Unstructured Structured

Pruning Pruning
Storage ✅ ✅
Hooker, 2020
The Hardware Lottery Inference ❌ ✅
Flexibility ✅ ❌
● “In many ways, hardware is catching up to the present state of ML research.”
● There is research for specialized software kernels to support unstructured

sparsity (see paper for references).
100
04
Efficient Attention
101
Recap
add & norm
The Transformer architecture dense
add & norm
multi-head
attention
add & norm
xN
dense
xN
masked
multi-head
multi-head
attention
attention
102
Recap
add & norm
add & norm
Quadratic bottleneck in sequence multi-head

attention
length due to multi-head attention add & norm
xN
dense
xN
masked
multi-head
multi-head
attention
attention
103
Recap
add & norm
add & norm
Quadratic bottleneck in sequence multi-head

attention
length due to multi-head attention add & norm
xN
dense
This poses a serious problem when add & norm add & norm
large sequences are required, e.g.:
xN
masked
multi-head
multi-head
● Long-range dependencies attention
attention
● Character-level models
● Speech processing
● High-resolution image
processing
104
Efficient Attention
In the past months, there has been much progress in making self-attention more efficient
time
Sparse Transformer Routing Transformer Linformer Big Bird

(Child et al., 2019) (Roy et al, 2020) (Wang et al., 2020) (Zaheer et al., 2020)
Reformer Performer Linear Transformer

(Kitaev et al., 2020) (Choromanski et al., 2020) (Katharopoulos et al., 2020)
105
Efficient Attention
In the past months, there has been much progress in making self-attention more efficient
time
Sparse Transformer Routing Transformer Linformer Big Bird

(Child et al., 2019) (Roy et al, 2020) (Wang et al., 2020) (Zaheer et al., 2020)
Reformer Performer Linear Transformer

(Kitaev et al., 2020) (Choromanski et al., 2020) (Katharopoulos et al., 2020)
We are going to cover some ideas that make this possible
106
Beyond a Dense Attention Matrix
Keys
Goal: Queries
Approximate the computation of attention
via more efficient operations
107
Efficient Attention
A wide range of recent techniques!
● Data-Independent Patterns
○ Blockwise Transformer (Qiu et al., 2019)

○ Sparse Transformer (Child et al., 2019)
○ Longformer (Beltagy et al., 2020)
○ Big Bird (Zaheer et al., 2020)
Taxonomy inspired by Tay et al., 2020

108
Efficient Attention
● Data-Dependent Patterns
○ Linformer (Wang et al., 2020)

○ Reformer (Kitaev et al., 2020)
○ Routing Transformer (Roy et al., 2020)
○ Clustered Attention (Vyas et al., 2020)
○ Sinkhorn Transformer (Tay et al., 2020)
109
Efficient Attention
● Kernels and Alternative Attention Mechanisms
○ Linear Transformer (Katharopoulos et al., 2020)

○ Random Feature Attention (Anonymous, 2020)
○ Performer (Choromanski et al., 2020)
○ Synthesizer (Tay et al., 2020)

110
Efficient Attention
● Alternative Attention Mechanisms
● Recurrence
○ Transformer XL (Dai et al., 2019)

○ Compressive Transformers
(Rae et al., 2019)

111
Data-Independent Patterns
112
Keys
Blockwise Patterns Queries
Divide sequence into local blocks and

restrict attention within them
Examples:
Blockwise Transformer (Qiu et al., 2019)
Local Attention (Parmar et al., 2018)
113
Keys
Strided Patterns Queries
Skip some query/key pairs.
Quadratic in sequence length / stride
Examples:
Sparse Transformer (Child et al., 2019)
Longformer (Beltagy et al, 2020)
114
Keys
Diagonal Patterns Queries
Compute attention over the diagonal.
Linear in sequence length and window

size.
Examples:
Longformer (Beltagy et al, 2020)
Big Bird (Zaheer et al., 2020)
115
Keys
Random Patterns Queries
Compute attention over random query/key

pairs.
Linear in number of points.
Examples:
116
Keys
Global Attention Queries
Applied to one or a few special tokens,

often prepended to the sequence.
Usually combined with other patterns
Examples:
Longformer (Beltagy et al., 2020)
ETC (Ainslie et al., 2020)
117
Keys
Combination of Patterns Queries
Combine multiple patterns
(e.g. Global + Diagonal + Random)
Examples:
Longformer (Beltagy et al., 2020)
118
Data-Dependent Patterns
119
Buckets
Keys
Create buckets/clusters and compute
Queries
attention within.
Ideally, buckets should contain the

highests attention weights in the matrix
Examples:
Reformer (Kitaev et al., 2020)
Routing Transformer (Roy et al., 2020)

Attention
head
120
Buckets: Hashing
Locality-Sensitive Hashing (LSH)

Key idea: take a random projection
matrix , compute hash for a
vector through:
Examples:
Reformer (Kitaev et al., 2020)
121
Buckets: Clustering
E.g. online k-means
Examples:
Routing Transformer (Roy et al., 2020)
Clustered Attention (Vyas et al., 2020)
122
Sorting and blocking
E.g. Sparse Sinkhorn Attention Queries

Key ideas:
● A differentiable sorting network

that learns to rearrange blocked Sorted keys
inputs, using the Sinkhorn
balancing mechanism to create a
permutation matrix
● Attention is computed only on local
neighborhoods (before and after
sorting)
Keys
Examples:
Sinkhorn Transformer (Tay et al., 2020)
123
Keys
Compression
Compressed Keys
E.g. pooling, strided convolution, low-rank

projections with learnable weights
Queries
Examples:
Compressed Attention (Liu et al., 2018)
Linformer (Wang et al., 2020)
Synthesizers (Tay et al., 2020)
124
Kernels and Alternative
Attention Mechanisms
125
Kernels and Alternative Attention Mechanisms
Kernels
Recap: attention in its general

form uses a similarity function
Standard transformers use dot

product attention:
Attention head
126
Kernels

However, we can simplify things

with a decomposable kernel:
Attention head
127
Kernels


128
Kernels


129
Kernels


Independent of query!
130
Kernels
This allows us to compute attention in

linear time!
In Katharopoulos et al., 2020:
131
Kernels
Random Feature Attention (Anonymous, 2020)
Random features can be used to generate an

unbiased estimation of the standard softmax
function
132
Performer: Generalized Attention and FAVOR (Choromanski et al., 2020)
Rethink attention as , parametrized by a kernel
and functions and
This work presents an unbiased, low-variance approximation of attention via random feature map
decompositions, with linear time and space complexity.
133
Synthesizers (Tay et al., 2020)
Are token-to-token interactions really

necessary?
Random attention matrices are

surprisingly competitively!
Low-rank alternatives can be used
134
Recurrence
135
Recurrence
Transformer-XL (Dai et al., 2019)
How can models process long sequences

under limited hardware constraints?
A naive approach is to split the sequence

into multiple smaller ones and process
them separately
Current
segment
136
Recurrence
Transformer-XL (Dai et al., 2019)
A better way is to add a segment-level

recurrence mechanism
Representations from the previous segment

are cached and re-used (no gradients
flowing at training)
This increases receptive field proportionally

to the depth of the transformer
Previous Current
segment (fixed) segment
137
Recurrence
compression
Compressive Transformers
(Rae et al., 2019)
Dual memory system:
- Primary mem. contains activations

from previous segment
- Secondary mem. is compresses

activations from all previous segments
Secondary Primary Current

(compressed) memory segment
memory
138
Recurrence
compression
Compressive Transformers
(Rae et al., 2019)
When a new segment comes:
- Primary memory is updated with

activations from previous segment
- Secondary memory is updated with

the activations from the primary
memory, where a compression
function is applied (e.g. pooling,
convolutions, most used)
Secondary Primary Current
(compressed) memory segment
memory
139
Overview
140
Benchmarking
How do these models compare in practice?
The Long-Range Arena: a benchmark for

efficient transformers
141
Benchmarking
How do these models compare in practice? List operations example:

Longer sequences: 1K-16K
5 tasks:
- List operations (e.g. max, min, median)

- Byte-level text classification
- Byte-level document retrieval
- Image classification
- Long-range spatial dependency
142
Benchmarking
How do these models compare in practice? List operations example:

Long-range spatial dependency example:
Longer sequences: 1K-16K
5 tasks:
- List operations (e.g. max, min, median)

- Byte-level text classification
- Byte-level document retrieval
- Image classification
- Long-range spatial dependency
(positive) (negative)
143
Benchmarking
The Long-Range Arena: a benchmark for efficient transformers
144
Benchmarking
145
Benchmarking
146
Benchmarking

Putting it all together (size of circles

corresponds to memory footprint)
Note: these results might be sensitive to

implementation details, hardware and
hyper-parameters.
147
Key Takeaways
There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.
148
Key Takeaways
There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.
149
Key Takeaways
large sequences.
Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)
150
Key Takeaways
large sequences.
Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)
The ideas presented in this section are often orthogonal to each other and to other efforts presented in
this tutorial, and can be combined for more efficient models.
151
05
Case Studies
152
Efficient
Language models
153
1) Core techniques
3) Case studies
a) Efficient Language Models

- Natural Language Processing with Small Feed-Forward Networks
154
1) Core techniques
3) Case studies

NAS
- The Evolved Transformer

(Neural Architecture
Search)
155
1) Core techniques
3) Case studies

- PRADO + pQRNN
156
1) Core techniques
3) Case studies

- PRADO + pQRNN
- MobileBERT
157
1) Core techniques
3) Case studies

- PRADO + pQRNN
- MobileBERT
- Lite Transformer with Long-Short Range Attention
158
1) Core techniques
3) Case studies

- PRADO + pQRNN
- MobileBERT
- MicroNet for Efficient Language Modeling
159
1) Core techniques
3) Case studies

NAS

(Neural Architecture
Search)
- PRADO + pQRNN
- MobileBERT
- Hardware-Aware Transformers
160
1) Core techniques
3) Case studies more expensive operation

PointwiseFullyConnected
- The Evolved Transformer equivalent to
- PRADO + pQRNN
Convolution
- MobileBERT
- Lite Transformer with Long-Short Range Attention implemented as

GroupedConvolution
- SqueezeBERT
more efficient operation
161
1) Core techniques
3) Case studies

- PRADO + pQRNN
- MobileBERT
- SqueezeBERT
- DeLighT: Very Deep and Light-weight Transformer

162
Natural ● Useful accuracies on a variety of tasks
Language ● Great runtime and memory value in resource

constrained environments
Processing with
Small
Feed-Forward ● Features defined over character n-grams, embeddings
Networks learned from scratch
● Random feature mixing hashing for small feature
vocabularies
Botha et al., 2017

● Quantization for embedding weight compression
arxiv.org/abs/1708.00214
163
softmax
Natural
ReLU activated
Language
Processing with fully connected
Small
reshaping
Feed-Forward
Networks
Discrete feature
embedding
matrix
Botha et al., 2017

164
Natural Example result:
Language POS Tagging, compared to BTS (Gillick et al., 2016)
Processing with ● +0.3% accuracy (95.4%, near state-of-the-art)
Small ● 6x fewer parameters
Feed-Forward ● 36x fewer FLOPs
Networks
Botha et al., 2017

165
The Evolved ● Consistent improvement over Transformer on well
Transformer established WMT and LM1B.
● NAS to search Transformer alternatives

● Large search space from feed-forward sequence
models
● Evolutionary architecture search
So et al., 2019
166
The Evolved
Transformer
So et al., 2019
167
The Evolved
Transformer
So et al., 2019
168
Same quality as original “big” Transformer with 37.6% fewer
The Evolved parameters and outperforms Transformer by 0.7 BLEU at a
Transformer mobile-friendly model size of ~7M params.
So et al., 2019
169
PRADO + PRADO: Projection Attention Networks for Document
Classification On-Device
pQRNN
● Combines trainable projections with attention and
convolutions
● With only 200 Kilobytes in size, outperformed prior CNN
and LSTM models and achieved near state of the art
performance on multiple long document classification
tasks.
Kaliamoorthi et al., 2019-2020

www.aclweb.org/anthology/D19-1506/
https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html
170
PRADO +
pQRNN

171
pQRNN
PRADO +
pQRNN ● A projection layer with a quasi-RNN encoder
● Same projection layer used in PRADO

ncing-nlp-with-efficient-projection.html ● pQRNN is also quantized
172
PRADO +
pQRNN

173
MobileBERT ● Designed for running on mobile phones with acceptable
latency
● Inverted-Bottleneck BERTLARGE teacher

● Distilled into a compact MobileBERT student
● As deep as BERTLARGE, but narrower
● Task-agnostic compression (task specific fine-tuning
performed directly on the compact model)
Sun et al., 2020

174
MobileBERT
Sun et al., 2020

175
MobileBERT
Sun et al., 2020

176
MobileBERT ● 4.3x smaller, 5.5x faster than BERTBASE
● 77.7 GLUE score ~ BERTBASE
● 90.0/79.2 SQuAD v1.1/v2.0 F1 ~ BERTBASE
● 62 ms latency on a Pixel 4 phone
Sun et al., 2020

177
Lite Transformer ● Long-Short Range Attention (LSRA)
with Long-Short ○ Local context modeling by convolution
Range Attention ○ Long distance modeling by attention
● 2.5× reduced computation vs Transformer base

● 18.2× smaller with pruning and quantization
● 0.5 higher BLUE compared to Evolved Transformer,

without the 250 GPU-year NAS cost.
Wu et al., 2020
178
MicroNet for NeurIPS 2019 MicroNet Challenge1
Efficient
Language modeling track: train efficient word-level language
Language models on the Wikitext-103 Dataset2 (word-level perplexity <
Modeling 35)
Score = Normalized Parameter Storage +

Normalized Math Operations
(Normalized by LSTM Rae et at, 2018 arxiv.org/abs/1803.10049)
Yan et al., 2020

1
Gale et al 2019, micronet-challenge.github.io
2
Merity et al 2016, arxiv.org/abs/1609.07843
179
● Core Language Model
MicroNet for
○ Transformer-XL
Efficient
○ Short Context Group Joint Optimization
Language
○ Adaptive Embedding and Softmax
Modeling
○ Hebbian Updates
● Compression Techniques
○ Knowledge Distillation
○ Pruning
Yan et al., 2020 ○ Quantization
180
● 90-fold reduction in parameter size and a 36-fold
MicroNet for
reduction in math operations compared to the MicroNet
Efficient
baseline
Language
Modeling
Yan et al., 2020

181
● Neural Architecture Search
Hardware-Aware
○ Train a SuperTransformer to cover a large space
Transformers
○ Evolutionary search with hardware latency
constraint to find a specialized SubTransformer
● Speed up and smaller size over baseline Transformer,

and low search cost
Wang et al., 2020

182
Hardware-Aware
Transformers
Wang et al., 2020

183
Hardware-Aware SubTransformer search
Transformers
● Evolutionary search
● Find a satisfactory SubTransformer given a latency

requirement
● Latency predictor trained for offline latency estimation

(fast and accurate)
Wang et al., 2020

184
Hardware-Aware WMT’14 results on Raspberry Pi-4:
Transformers
● 3× speedup, 3.7× smaller size over baseline
Transformer
● 2.7× speedup, 3.6× smaller size over Evolved

Transformer with 12,041× less search cost
Wang et al., 2020

185
● Replace several operations in self-attention layers with
SqueezeBERT
grouped convolutions
● Much faster inference on mobile devices
Iandola et al., 2020

186
● Previous takeaways from CV into NLP
SqueezeBERT (already adopted in MobileBERT)
○ Bottleneck layers
○ High-information flow residual connections
● New contributions from CV incorporated

into SqueezeBERT’s self-attention
○ Convolutions
Iandola et al., 2020 ○ Grouped convolutions

187
● Results
SqueezeBERT
○ 4.3x faster than BERT-base (while MobileBERT is
reported as 3.0x faster than BERT-base) on a Pixel 3
phone.
○ GLUE score 76.9 (vs 79.0 for BERT-base)
Iandola et al., 2020

188
● More efficient parameter allocation within and across
DeLighT: Very
Transformer blocks
Deep and
Light-weight
Transformer
● Similar performance with substantially fewer parameters
compared to baseline transformers.
Mehta et al., 2020

189
DeLighT: Very
Deep and
Light-weight
Transformer
Mehta et al., 2020

190
DeLighT: Very
Deep and
Light-weight
Transformer
Mehta et al., 2020

191
Retrieval
192
1) Core techniques
prediction
3) Case studies .
b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks
Encoder Encoder
193
1) Core techniques
3) Case studies
b) Retrieval
- Generalization through Memorization: Nearest Neighbor Language Models
194
1) Core techniques
3) Case studies
b) Retrieval
- Generalization through Memorization: Nearest Neighbor Language Models
- REALM: Retrieval-Augmented Language Model Pre-Training
195
● Cross-attention, single tower models such as BERT have set
Sentence-BERT: state-of-the-art results on sentence-pair tasks such as STS.
Sentence
Embeddings ● For sentence-retrieval tasks, cross-attention model requires
expensive re-encoding the entire retrieval corpus.
using Siamese
BERT-Networks ● Sentence-BERT modifies the pretrained encoder to perform
a single inference per input sentence, followed by cheap
pairwise comparisons e.g. cosine similarity.
Reimers et al., 2019

196
Sentence-BERT: prediction prediction
Sentence .
regression layer
Embeddings
using Siamese
BERT-Networks
Encoder Encoder Encoder
Reimers et al., 2019 [CLS] S [SEP] S [SEP] [CLS] S [SEP] [CLS] S [SEP]
A B A B
Cross-attentional Dual-encoder
(Single tower) (Two tower)
197
● Finding the most similar sentence in a collection of 10,000
Sentence-BERT: sentences on a V100 GPU
Sentence ○ BERT (cross-attention): 65 hours
Embeddings ○ SBERT (dual encoder): 5 seconds
using Siamese
● Can also be combined with Maximum Inner Product Search
BERT-Networks
tools for sublinear scaling
○ https://github.com/google-research/google-research/tree/master/scann
○ https://github.com/facebookresearch/faiss
○ https://github.com/spotify/annoy
Reimers et al., 2019

198
Generalization ● Introduces kNN-LMs, which extends a pre-trained neural
language model (LM) by linearly interpolating it with a
through k-nearest neighbors (kNN) model.
Memorization:
● Allows for efficiently scaling up to larger training sets and
Nearest for effective domain adaptation
Neighbor
Language
Models
Khandelwal et al., 2019

199
200
201
202
Generalization
through
Memorization:
Nearest
Neighbor
Language
Models
Khandelwal et al., 2019

P 203
203
REALM: ● Language model pre-training can capture world
knowledge by storing it implicitly in the network
Retrieval-Augmented parameters, but storage space is limited by the network
Language Model size (prompting for ever-larger networks).
Pre-Training
● REALM introduces a latent knowledge retriever to
augment the language model, and shows for the first
time how to pretrain it in an unsupervised manner.
Guu et al., 2020

204
REALM:
Retrieval-Augmented
Language Model
Pre-Training
Guu et al., 2020

205
REALM: ● Fine-tuning for open-domain question answering
Retrieval-Augmented
Language Model
Pre-Training
Guu et al., 2020

206
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training
Guu et al., 2020

207
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training
Guu et al., 2020

208
06
Scaling in Practice
209
Why Do We Need Scale?
210
Scale More Important Than Architecture
Kaplan et al., 2020: arXiv
211
Attention Size vs Model Size vs Test Loss
Kaplan et al., 2020: arXiv
212
Attention vs Fully Connected Time for Various Transformers
213
Conclusions From Measuring Scaling
● Performance increases further and further the more parameters a model has
● Attention is very important for efficiency: Transformers scale better than LSTMs
● Attention has diminishing returns (on general “internet data”)
● Size of data and model are more important than architecture
214
Practical Considerations
215
Experimental vs Theoretical Perspective
Theoretical:
● FLOPS/Operation Complexity/Memory: O(n) better than O(n^2); 100 FLOPS better than 1000
● (Possibly) Analysis of occupancy, memory access patterns for certain hardware
Experimental:
● Three criteria:
a. Does it fit into my GPU/TPU/Accelerator?
b. Is it faster than other methods?
c. Can most people use it (+62% of PhD students)?
● Device oriented walltime/memory: CPU for inference, GPU/TPU for training
216
Theory vs ● Algorithm: (1) Divide matrix B into chunks of 128; (2) take
the maximum element, set others to zero; (3) perform
Practice
matrix multiply A*B=C and skip all zero elements
217
Theory vs Practice
Tan & Le, 2019: arXiv

218
GPU Architecture
Ampere Architecture (NVIDIA)
219
Occupancy vs Memory Bandwidth vs FLOPS
= Matrix multiplication
Occupancy FLOPS Memory

Performance
Bandwidth
= Convolution

Performance
Bandwidth
220
= Sparse Matrix multiplication

Performance
Bandwidth
= Depthwise Convolution

Performance
Bandwidth
221
(custom implementation)

Performance
Bandwidth

Performance
Bandwidth
222
BERT Large vs BERT Base
BERT Base is 3.1x

smaller than BERT
Large but only trains
1.5x faster.
BERT Base is too

small to saturate
modern GPUs.
223
Better Performance at Lower Occupancy
= Matrix multiplication

Performance
Bandwidth
= Matrix Multiplication
(lower occupancy, higher instruction parallelism)

Performance
Bandwidth
Volkov, 2010
224
Conclusion
● Occupancy, and FLOPS/memory bandwidth utilization are important for runtime performance
● Understanding of hardware needed for performance analysis
● Even with deep understanding of hardware, it is difficult to analyze performance theoretically
● Runtime performance of different algorithms can often only be understood if they are run on
the actual device
● Conclusion: To estimate deep neural network runtime performance, it is best to run the network
and measure its performance directly.
225
Memory Optimizations
226
Resources: Academia vs Industry
227
Memory Optimizations Overview
● Memory Swapping/Memory Paging

● FP16/BF16 training
● Gradient checkpointing
● Gradient accumulation
● Reversible residual connections
228
CPU<->GPU Memory Swapping / Paging
● Swap-out activations / weights to CPU once a layer is completed

● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time
CPU GPU
Active
Layer Swap in to GPU
Pupipeddi et al., 2020: arXiv

229

CPU GPU
Active

230

CPU GPU
Active

231

CPU GPU
Active
Swap in to GPU Layer
232

CPU GPU
Active
233

CPU GPU
Active
234

● Benefits:
○ 60-80% memory reduction
○ Network usually not slower. If it is slower, swap-int layers earlier (less memory
reduction)
○ Faster training due to larger batch size for very large models
235
Mixed Precision Training (FP16+FP32) / BF16 training
Mixed Precision Training:
● Keep 32-bit master weights

● Do forward pass with 16-bit
● Scale 16-bit loss to prevent under/overflow
● Compute gradients
● Update 32-bit weights; copy 32-bit weights to 16-bit buffers
BrainFloat-16 Training:
● Range: FP16 +-65504; BF16 & FP32 -+3e^38

● Cast everything to BF16
● Train normally (no under/overflow due to larger range)
Benefits:
● Faster training, depending on network about 2x speedup

● Usually save some memory, especially if your activations are large Micikevicius et al., 2018: arXiv
236
Gradient Checkpointing: Forward
● Do not store activation gradients in the forward pass

● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node
Dropped Checkpointed Not Computed Yet
Active
Layer
Chen et al., 2016: arXiv

237
Gradient Checkpointing: Forward

checkpoint node
Dropped Checkpointed Not Computed Yet
Active
Layer

238
Gradient Checkpointing: Backward

checkpoint node
Dropped Checkpointed Not Computed Yet Has Gradient
Missing gradients! Active

Recompute from last checkpoint Layer
with forward pass
239

checkpoint node
Forward pass to compute Active

activation gradient Layer

240

checkpoint node
Active
Layer

241

checkpoint node
Active
Layer

242

checkpoint node
Active
Layer

243

checkpoint node
Active
Layer

244

checkpoint node
Active
Layer

245
Gradient Checkpointing
Benefits:
● Trade computation to reduce memory footprint

● Best used for functions that are cheap to compute but have a large activation gradient (ReLU,
layer norm, softmax)
● Very beneficial for nonlinear activation functions
● Easy to use in PyTorch (torch.utils.checkpoint) and TensorFlow 2.0 (recompute_grad (nightly))
246
Reversible Residual Connections
● Divide network output and residual connection into two halves. Compound them into a
reversible structure:
Forward Backward
Benefits:
● Saves some memory for free (if your framework supports it, e.g. JAX)
● Usually, gradient checkpointing should be prefered:
○ Can save more memory due to being more general.
○ Easy to implement. Supported by major frameworks.
Gomez et al., 2017: arXiv

247
Gradient Accumulation
● Split larger mini-batches into “micro-batches”

● Do standard forward/backward passes with micro-batches, but do not update the weights right away (and
do not reset the gradient on the weights)
● Accumulate the gradient on the weights for all micro-batches
● Update the weights once enough micro-batches have been computed
Benefits / Tradeoffs:
● As long as your model runs with batch size 1 you can simulate any batch size
● Easy to implement and can reduce memory footprint significantly
● Slow if micro-batch size is very small
● Can improve data parallel performance significantly (speedups) especially for very large models
248
Parallelism
249
Parallelism Overview
● Data parallelism
● Model parallelism
● Pipeline parallelism
● ZeRo parallelism optimizations
● 3D parallelism
250
Data Parallelism
Idea: Keep the same model parameters across multiple accelerators. Feed them different mini-batches and
average the gradient across accelerators.
Forward Backward
Device0 Device1 Device0 Device1
Layer 2 Layer 2 Layer 2 Layer 2

Same model
parameters for
Layer 1 Layer 1 each device Layer 1 Layer 1
Sync
Grad Grad
Different input
Krizhevsky 2014 (arXiv) mini-batches Update Update
251
Model Parallelism
Idea: Keep the same mini-batch across multiple accelerators; split the layers parameters across all devices and
synchronize layer outputs after each layer.
Forward Backward
Device0 Sync Device1 Device0 Device1
Layer 2 Layer 2 Layer 2 Layer 2

Different model
Sync Sync
parameters for
Layer 1 Layer 1 each device Layer 1 Layer 1
Sync
Grad Grad
Same input
Krizhevsky 2014 (arXiv) mini-batches Update Update
252
Pipeline Parallelism
Idea: Split network by depth into k pieces onto k accelerators. Each accelerator holds 1/kth of layers. Use
micro-batches to overlap computation and communication.
Krizhevsky 2014 (arXiv); Harlap et al., 2018 (arXiv); Huang et al., 2018 (arXiv)
253
ZeRO Parallelism Optimizations
Idea: Gradients, parameters, and optimizer state only needed for active layer. We distribute the state across all
GPUs and gather them together when we need them (when they become “active”).
Rajbhandari et al., 2020 (arXiv)

254
3D Parallelism
Data
Model
Pipeline ZeRO
Microsoft Blog Post

(paper coming soon?) 255
Why 3D Parallelism?
● Model parallelism bad if batch size is too large. Communication cannot be overlapped with computation.
● Data parallelism bad if the batch size is too small.
● Pipeline parallelism decreases mini-batch size through micro-batches.
● Pipeline parallelism increase min-batch size through aggregation of micro-batches.
● Pipeline parallelism allows for simple overlap of communication and computation..
Microsoft Blog Post

(paper coming soon?) 256
Efficiency Optimizations
257
Larger Batch Size
● GPUs are more efficient if fully utilized. That usually only happens if batch size is large
● GPUs run better if the mini-batch dimension is 32 or larger
● Often you can achieve faster training by using a memory efficiency technique which slows
down training but enables training with larger batch size
● Larger batch sizes enables larger learning rates. While computation is slower, training might be
faster.
258
Fused Kernels
● Adam with 10^9 parameters:

○ 14 read/writes
○ 32-bit 10^9 parameters = 4 GB
○ Normal Adam: GPU with 600 GB/s -> 14*4/600 = 100ms
○ Fused Adam: 6ms
259
Mixture of Experts
260
Mixture of Experts: Overview
Shazeer et al., 2017: arXiv
Lepikhin et al., 2020: arXiv
261
Transformers Mini-batch Time
262
Mixture of Experts: Balancing and Specialization v1
Version 1 (Shazeer et al., 2017):
Initialize W_g and W_noise with zeros, so outputs are driven by standard normal noise. This
guarantees balancing across experts at the start of training.
The noise also helps to decrease early advantage of previously picked experts.
263
Version 1 (Shazeer et al., 2017):
An additional balancing loss assigns high loss to experts which have very high probability. This
prevents failure cases where an expert is always picked with 100% probability.
Coefficient of variation: CV(X) = std(X)/mean(X)
264
Version 1 (Shazeer et al., 2017)
Importance loss can be satisfied by picking a subset of experts. To prevent this degeneration we want
to pick all experts with roughly the same probability over time.
If we view the softplus term as something analogous to a standard deviation and the mean softmax
as the expected value, we can express an approximate probability for this with a CDF of the normal
distribution.
265
Version 2 (Lepikhin et al., 2020):
No noise. Initialize layers normally. Keep track of , how many times each expert was used for the
sequence S. With the mean gate probability of we can now define a balancing auxiliary loss:
Where k is a constant loss weight (a good value is 0.1; usually between 0.01 and 1.0)
266
Version 2 (Lepikhin et al., 2020):
● Random dispatch: Use 2nd expert proportionally to the softmax gate probability.
● Have a frequency cutoff — a token budget — for each expert. If this budget is exceeded the
expert degenerated to a zero matrix. This effectively reduces the output of the MoE layer to zero
and thus only the residual connection output around the MoE layer is fed to the next layer.
267
Mixture of Experts: Balancing and Specialization
Many cases of expert degeneration:

1. Overbalancing: All experts are approximately equally used. However, gate probability
approaches 1/#Experts. No expert is better than another expert.
2. Underbalancing: The same top-k experts are used for every token. This leads to two strong
experts, but all other experts do not learn anything and are “wasted capacity”.
3. Sequence-level degeneration: Model balances experts by using each expert for a particular
sequence index. For example, for indices 0, 1, 2, 3 always experts E3, E1, E2, E0. This leads to
sequence experts, but not content experts.
268
Mixture of Experts: Benefits
● Works well on diverse data like

multilingual machine translation
● Can be difficult to train due to
balancing/specialization issues
● Only faster than transformers if you
can run it with a large enough
batch size to saturate distributed
experts
● If you scale the model across a
cluster, you will need excellent
interconnect performance
(TPU v4 Pod, NVIDIA SuperPod)
Shazeer et al., 2017: arXiv

Lepikhin et al., 2020: arXiv
269
07
Closing Notes
270
Why we should strive for efficiency
Our field has seen a dramatic increase in scale in the past 2 years.
Striving for efficiency means caring about:
1) Costs
2) Accessibility
3) Production needs
4) The sustainability of this growth
271
Closing Notes
In this tutorial, we covered a wide range of ideas, applications and practical considerations that helps us
build more efficient systems, including:
1) Core efficiency techniques
2) Efficiency improvements to attention mechanisms
3) Case studies of efficient models
4) Practical considerations for scaling models
272
We hope you
enjoyed it and
learned something
new!
273
Thank you!
Slides available at: bit.ly/2SmhKY7
274

EMNLP 2020 Tutorial High Performance NLP

Uploaded by

EMNLP 2020 Tutorial High Performance NLP

Uploaded by

High Performance Natural

Gabriel Ilharco Cesar Ilharco Iulia Turc

Tim Dettmers Felipe Ferreira Kenton Lee

○ Realtime: Majority of content is consumed within a few hours after

● THE WEB: Orders of magnitude bigger

Highest-ever drafted Black player in history.

The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/

The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/

The GLUE Benchmark (Wang et al., 2018)

BERT “[...] scaling to extreme

(Devlin et al., 2018)

GPT BERT GPT-2 T5 Turing-NLG

Henighan et al., 2020

1) Disconnect with production systems

Memory CPU/GPU/TPU Storage Battery

1) Disconnect with production systems

- 2048 TPU v3 accelerators (GShard, Lepikhin et al., 2020)

- 285,000 CPU cores, 10,000 GPUs (GPT-3, Brown et al., 2020)

- GPT-3 training cost is estimated at 4.6 million dollars.

1) Disconnect with production systems

- Ever-larger hardware and ﬁnancial requirements impose great barriers to many

- This can have a serious impact in our research community

1) Disconnect with production systems

Altogether, this is especially relevant to a ﬁeld that scaled by

- Kernels and Alternative Attention Mechanisms

- Alternative Attention Mechanisms

- Eﬃcient Language Models

- Eﬃcient Language Models

- Scaling Laws of Neural Language Models

- Scaling Laws of Neural Language Models

Source: Microsoft Blog Post

- Scaling Laws of Neural Language Models

- Scaling Laws of Neural Language Models

Sentiment Analysis Amazing movie, 10/10! ★★★★★

Speech recognition Hello, world

RNNs allow computations

RNNs allow computations

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

Thang Luong et al.

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

Queries, keys and values

Queries, keys and values

For some similarity

Using dot-product similarity,

Normalization factor for

Let’s dive into the dimensions

Let’s dive into the dimensions

Let’s dive into the dimensions

Let’s dive into the dimensions

Let’s dive into the dimensions

So far, attention has been a

Let’s add positional Fixed:

add & norm

add & norm xN

Vaswani et al., 2017

add & norm

add & norm

add & norm

Vaswani et al., 2017 input

add & norm

add & norm

add & norm add & norm

Vaswani et al., 2017

- Music generation (e.g. Anna Huang et al., 2019)

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

Visual Question Answering

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

- Computer Vision (e.g. Ramachandran et al., 2019, Dosovitskiy et al., 2020)

Transformers are ubiquitous in NLP.

Models are typically used in 3 scenarios:

Transformers are ubiquitous in NLP.