EMNLP 2020 Tutorial High Performance NLP
EMNLP 2020 Tutorial High Performance NLP
Language Processing
EMNLP 2020
1
Presenters
3
01 Introduction
Agenda
02 Fundamentals
03 Core Techniques
04 Efficient Attention
05 Case Studies
06 Scaling in Practice
07 Closing Notes
4
01
Introduction
5
Motivation & Applications
Why do we need it ?
SCALE
● NEWS
What could we do, if we had it ? [1] Tatar, A., Antoniadis, P., Amorim, M.D.d. et al. From popularity prediction to ranking online news.
Soc. Netw. Anal. Min. 4, 174 (2014). https://doi.org/10.1007/s13278-014-0174-8
[2] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html
6
Summarization
7
Summarization
8
Facts Extraction
● Noteworthy Facts
● Trendiness
VS
10
Sentence Entailment
11
Recent years in Natural
Language Processing
12
Benchmarks through the years - SQuAD 1.1
Human Performance
91.2
Human Performance
89.5
Human Performance
87.1
16
A brief recent history of scale in NLP
BERT
T5 “[...] scaling the model
(340M)
(11B) size to 11 billion
parameters was the most
MegatronLM important ingredient for
(8.3B)
achieving our best
performance.”
GPT (Raffel et al, 2019)
(110M)
Transformer GPT-2 Grover
GPT BERT ELMo (1.5B) (1.5B)
(110M) (340M) (465M)
17
A brief recent history of scale in NLP
DeepSpeed
BERT
(1T)
(340M)
GShard
(600B)
GPT GPT-3
(110M) (175B)
18
Scaling Laws
- Latency
- Hardware constraints
- Energy costs
20
The drawbacks of naive scaling
2) Costs
- Hardware
- Financial
21
The drawbacks of naive scaling
2) Costs
3) Accessibility
- For instance, 62% of PhD students have access to 4 or less GPUs, according to a
recent poll.
22
The drawbacks of naive scaling
2) Costs
3) Accessibility
23
We should strive for
efficiency
24
Towards more efficient NLP
1) Core techniques
- Knowledge Distillation
Source: unsplash.com
25
Towards more efficient NLP
1) Core techniques
- Knowledge Distillation
- Quantization
Source: unsplash.com
26
Towards more efficient NLP
1) Core techniques
- Knowledge Distillation
- Quantization
- Pruning
Source: unsplash.com
27
Towards more efficient NLP
1) Core techniques
2) Efficient attention
- Data-Independent Patterns
28
Towards more efficient NLP
1) Core techniques
2) Efficient attention
- Data-Independent Patterns
- Data-Dependent Patterns
29
Towards more efficient NLP
1) Core techniques
2) Efficient attention
- Data-Independent Patterns
- Data-Dependent Patterns
30
Towards more efficient NLP
1) Core techniques
2) Efficient attention
- Data-Independent Patterns
- Data-Dependent Patterns
- Recurrence
31
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
32
Source: unsplash.com
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- Retrieval
Source: unsplash.com
33
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
4) Scaling in Practice
34
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
4) Scaling in Practice
35
Towards more efficient NLP
1) Core techniques
CPU GPU
2) Efficient attention
3) Case studies
4) Scaling in Practice
36
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
4) Scaling in Practice
37
02
Fundamentals
38
Sequence-to-sequence models
outputs
piles of sequential,
differentiable tensor
operations
inputs
https://xkcd.com/1838/
39
Sequence-to-sequence models
inputs outputs
outputs
Machine Translation Hello, world Olá, mundo
40
RNNs
outputs RNN
tanhcell RNN
tanhcell RNN
tanhcell
piles of sequential,
differentiable tensor
operations
inputs
41
Encoders and
Decoders
piles of sequential,
differentiable tensor
operations
inputs
42
The encoder-decoder bottleneck
n n
English h t
h ...
h t
h ...
h t
The agreement on the European Economic Area was signed in August 1992 . <EOS>
ck
n bottlene
informatio
l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>
n n
French h t
h ...
h t
h ...
h t
43
Attention
l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>
n n
French h t
h ...
h t
h ...
h t
été
Attention head
n n
English h t
h ...
h t
h ...
h t
The agreement on the European Economic Area was signed in August 1992 . <EOS>
44
Attention
A summary of i,
based on how similar their
are with the query
Bahdanau et al.
Neural machine translation by
jointly learning to align and
translate. 2014
Attention head
Thang Luong et al.
Effective approaches to
attention-based neural machine
...
translation. 2015
45
Dot product
attention
...
Attention head
46
Attention
mechanisms
l
te
s
ho
Va
al
?
¿
Are
The attention matrix
you
going
to
the
hotel
?
47
Transformers
48
Motivation
l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>
n n
French h t
h ...
h t
h ...
h t
été
Attention head
n n
English h t
h ...
h t
h ...
h t
The agreement on the European Economic Area was signed in August 1992 . <EOS>
49
Motivation
t t t t t t t t
a a a
n n n . .an. a
n
a
n
a
n
a
n
h h h h h h h h
sequential
t t t t t t t t
a a a a a a a a
n n n n n n n n
h h h h h h h h
parallel
50
Scaled
A summary of values,
Dot-Product based on how similar their
corresponding keys are
Attention
with the query
51
Scaled
Dot-Product
Attention
52
Scaled
Dot-Product
Attention
53
Scaled
Dot-Product
Attention
= sequence length
= feature dim
54
Scaled
Dot-Product
Attention
= sequence length
= feature dim
55
Scaled
Dot-Product
Attention
= sequence length
= feature dim
56
Scaled
Dot-Product
Attention
= sequence length
= feature dim
57
quadratic in
Scaled sequence length!
Dot-Product
Attention
= sequence length
= feature dim
58
Multi-head
attention
Scaled Dot-Product
Attention
= sequence length
= feature dim
= # of attention heads
linear linear linear
59
Multi-head
attention
Scaled Dot-Product
Attention
= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear
60
Multi-head
attention
Scaled Dot-Product
Attention each head:
= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear
61
Multi-head
attention
linear
Scaled Dot-Product
Attention each head:
= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear
62
Multi-head
attention
linear
Scaled Dot-Product
Attention each head:
bottleneck is quadratic in
sequence length due to QK!
= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear
63
Positional
encodings
64
Positional
encodings positional
encodings
t
a
n
h
So far, attention has been a
set operation.
position
These can be either learned
or fixed.
depth
65
The transformer encoder
dense
multi-head
attention
positional
encoding
input
embedding Ba et al, 2016.
66
The transformer decoder
dense
multi-head
attention
K and V xN
from encoder
masked
multi-head
attention
prevent model from
peeking at the
future by masking
positional
encoding attention weights
Ba et al, 2016.
dense
multi-head
attention
add & norm
xN
dense
xN
masked
multi-head
multi-head
attention
attention
68
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)
https://ai.facebook.com/blog/using-neural-networks-to-solve-advanced-mathematics-equations/
69
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)
https://magenta.tensorflow.org/music-transformer
70
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)
71
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)
- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)
- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)
73
Transformers in NLP
Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).
74
Transformers in NLP
Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).
Pre-training
- Large corpus
(e.g. web crawled data)
- Typically unsupervised
(e.g. masked language
modeling)
- Usually runs in GPUs or
TPUs
75
Transformers in NLP
Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).
Pre-training Fine-tuning
- Large corpus - Smaller corpus
(e.g. web crawled data) - Typically supervised
- Typically unsupervised (e.g. question answering,
(e.g. masked language natural language inference)
modeling) - Usually runs in GPUs or
- Usually runs in GPUs or TPUs
TPUs
76
Transformers in NLP
Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).
77
03
Core Techniques
78
Knowledge
Distillation
Source: unsplash.com
79
Knowledge
Distillation
x (x, y=1.0)
Data
80
Knowledge are
Distillation
do
well
...
for Pre-training
Student
Data
81
Knowledge are
Distillation
do
well
...
for Pre-training
feature map transfer
Sun et al., 2019
MobileBERT: a Compact Task-Agnostic BERT Teacher Student
for Resource-Limited Devices
attention transfer
Data
82
1 Regular Pre-training Student
Knowledge
how [MASK] you
Distillation
Data
for Fine-Tuning
x Data
(x, y=1.0)
Data
83
1 Pre-training via distillation
Knowledge
Teacher
Distillation per-layer transfer
and Fine-tuning
how [MASK] you how [MASK] you
Data
Jiao et al., 2019
TinyBERT: Distilling BERT for Natural (x, y=0.8)
Language Understanding
2 Fine-training via distillation
Teacher per-layer transfer
Student
embeddings transfer
(x, y=1)
Data
84
Quantization
Source: unsplash.com
85
Quantization Definition
quantization operator
86
Quantization Definition
quantization operator
Linear Quantization
z = S (qj - Z)
zero point
scaling factor
87
Quantization Quantization-Aware Training
Backward pass on w
Forward pass on ŵ
Jacob et al., 2017
Quantization and Training of Neural Networks
for Efficient
Integer-Arithmetic-Only Inference
88
Quantization
● Q8BERT: symmetric linear quantization:
Q(z) = clamp(⌊z ✕ Sz⌉, -127, +127), where Sz is a statistic
computed during or post-training.
Zafrir et al., 2019
Q8BERT: Quantized 8Bit BERT
● Q-BERT: uniform quantization to {0, …, 2k-1} with:
○ mixed precision (higher Hessian spectrum => higher
Shen et al., 2019 precision for layer)
Q-BERT: Hessian Based Ultra Low Precision ○ group precision (each matrix Wk Wq Wv Wo is its own group)
Quantization of BERT
89
Quantization
with Distillation
90
Pruning
Source: unsplash.com
91
Pruning Definition
a = (W ⊙ M) x
input
pruning mask
model weight
activation
92
Pruning Pruning based on second-order derivatives
Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence
93
Pruning Pruning based on second-order derivatives
Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence
94
Pruning
The LTH
Searching for Tickets: One-Shot Magnitude Pruning
Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 95
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning
Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 96
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning with Rewinding
Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 97
Pruning
The LTH, ctd
MP = Magnitude Pruning
LT = Lottery Ticket
SLT = Stabilized Lottery Ticket
CLT = Constant Lottery Ticket
98
Pruning Movement Pruning
● First-order strategy: “instead of selecting weights that are far from zero, we retain
connections that are moving away from zero during the training process”
Sanh et al., 2020 ● The pruning mask M is learnt together with the model parameters.
Movement Pruning: Adaptive Sparsity by
Fine-Tuning ○ hard version: M = Topv(S), where score S is learnt and v is a
hyperparameter.
99
Pruning On standard hardware:
Storage ✅ ✅
Hooker, 2020
The Hardware Lottery Inference ❌ ✅
Flexibility ✅ ❌
100
04
Efficient Attention
101
Recap
add & norm
multi-head
attention
add & norm
xN
dense
xN
masked
multi-head
multi-head
attention
attention
102
Recap
add & norm
xN
masked
multi-head
multi-head
attention
attention
103
Recap
add & norm
This poses a serious problem when add & norm add & norm
large sequences are required, e.g.:
xN
masked
multi-head
multi-head
● Long-range dependencies attention
attention
● Character-level models
● Speech processing
● High-resolution image
processing
104
Efficient Attention
In the past months, there has been much progress in making self-attention more efficient
time
105
Efficient Attention
In the past months, there has been much progress in making self-attention more efficient
time
106
Beyond a Dense Attention Matrix
Keys
Goal: Queries
Approximate the computation of attention
107
Efficient Attention
● Data-Independent Patterns
● Data-Independent Patterns
● Data-Dependent Patterns
● Data-Independent Patterns
● Data-Dependent Patterns
● Kernels and Alternative Attention Mechanisms
● Data-Independent Patterns
● Data-Dependent Patterns
● Alternative Attention Mechanisms
● Recurrence
112
Data-Independent Patterns
Keys
Examples:
113
Data-Independent Patterns
Keys
Examples:
114
Data-Independent Patterns
Keys
Examples:
115
Data-Independent Patterns
Keys
Examples:
116
Data-Independent Patterns
Keys
Examples:
117
Data-Independent Patterns
Keys
Examples:
118
Data-Dependent Patterns
119
Data-Dependent Patterns
Buckets
Keys
Create buckets/clusters and compute
Queries
attention within.
Examples:
120
Data-Dependent Patterns
Buckets: Hashing
vector through:
Examples:
121
Data-Dependent Patterns
Buckets: Clustering
Examples:
122
Data-Dependent Patterns
123
Data-Dependent Patterns
Keys
Compression
Compressed Keys
Examples:
124
Kernels and Alternative
Attention Mechanisms
125
Kernels and Alternative Attention Mechanisms
Kernels
Attention head
126
Kernels and Alternative Attention Mechanisms
Kernels
Attention head
127
Kernels and Alternative Attention Mechanisms
Kernels
128
Kernels and Alternative Attention Mechanisms
Kernels
129
Kernels and Alternative Attention Mechanisms
Kernels
Independent of query!
130
Kernels and Alternative Attention Mechanisms
Kernels
Independent of query!
131
Kernels and Alternative Attention Mechanisms
Kernels
Random Feature Attention (Anonymous, 2020)
Independent of query!
132
Kernels and Alternative Attention Mechanisms
This work presents an unbiased, low-variance approximation of attention via random feature map
decompositions, with linear time and space complexity.
133
Kernels and Alternative Attention Mechanisms
134
Recurrence
135
Recurrence
Current
segment
136
Recurrence
Previous Current
segment (fixed) segment
137
Recurrence
compression
Compressive Transformers
(Rae et al., 2019)
138
Recurrence
compression
Compressive Transformers
(Rae et al., 2019)
139
Overview
140
Benchmarking
141
Benchmarking
5 tasks:
142
Benchmarking
5 tasks:
(positive) (negative)
143
Benchmarking
144
Benchmarking
145
Benchmarking
146
Benchmarking
147
Key Takeaways
There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.
148
Key Takeaways
There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.
There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.
149
Key Takeaways
There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.
There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.
Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)
150
Key Takeaways
There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.
There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.
Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)
The ideas presented in this section are often orthogonal to each other and to other efforts presented in
this tutorial, and can be combined for more efficient models.
151
05
Case Studies
152
Efficient
Language models
153
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
Source: unsplash.com
154
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
155
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- PRADO + pQRNN
Source: unsplash.com
156
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- PRADO + pQRNN
- MobileBERT
Source: unsplash.com
Source: unsplash.com
157
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- PRADO + pQRNN
- MobileBERT
Source: unsplash.com
158
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- PRADO + pQRNN
- MobileBERT
Source: unsplash.com
159
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- MobileBERT
- Hardware-Aware Transformers
160
Towards more efficient NLP
1) Core techniques
2) Efficient attention
- PRADO + pQRNN
Convolution
- MobileBERT
- SqueezeBERT
more efficient operation
161
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
- PRADO + pQRNN
- MobileBERT
- Hardware-Aware Transformers
- SqueezeBERT
163
softmax
Natural
ReLU activated
Language
Processing with fully connected
Small
reshaping
Feed-Forward
Networks
Discrete feature
embedding
matrix
164
Natural Example result:
Language POS Tagging, compared to BTS (Gillick et al., 2016)
Processing with ● +0.3% accuracy (95.4%, near state-of-the-art)
Small ● 6x fewer parameters
Feed-Forward ● 36x fewer FLOPs
Networks
165
The Evolved ● Consistent improvement over Transformer on well
Transformer established WMT and LM1B.
So et al., 2019
arxiv.org/abs/1901.11117
166
The Evolved
Transformer
So et al., 2019
arxiv.org/abs/1901.11117
167
The Evolved
Transformer
So et al., 2019
arxiv.org/abs/1901.11117
168
Same quality as original “big” Transformer with 37.6% fewer
The Evolved parameters and outperforms Transformer by 0.7 BLEU at a
Transformer mobile-friendly model size of ~7M params.
So et al., 2019
arxiv.org/abs/1901.11117
169
PRADO + PRADO: Projection Attention Networks for Document
Classification On-Device
pQRNN
● Combines trainable projections with attention and
convolutions
● With only 200 Kilobytes in size, outperformed prior CNN
and LSTM models and achieved near state of the art
performance on multiple long document classification
tasks.
https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html
170
PRADO +
pQRNN
https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html
171
pQRNN
PRADO +
pQRNN ● A projection layer with a quasi-RNN encoder
● Same projection layer used in PRADO
https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html ● pQRNN is also quantized
172
PRADO +
pQRNN
https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html
173
MobileBERT ● Designed for running on mobile phones with acceptable
latency
174
MobileBERT
175
MobileBERT
176
MobileBERT ● 4.3x smaller, 5.5x faster than BERTBASE
● 77.7 GLUE score ~ BERTBASE
● 90.0/79.2 SQuAD v1.1/v2.0 F1 ~ BERTBASE
● 62 ms latency on a Pixel 4 phone
177
Lite Transformer ● Long-Short Range Attention (LSRA)
with Long-Short ○ Local context modeling by convolution
Range Attention ○ Long distance modeling by attention
178
MicroNet for NeurIPS 2019 MicroNet Challenge1
Efficient
Language modeling track: train efficient word-level language
Language models on the Wikitext-103 Dataset2 (word-level perplexity <
Modeling 35)
179
● Core Language Model
MicroNet for
○ Transformer-XL
Efficient
○ Short Context Group Joint Optimization
Language
○ Adaptive Embedding and Softmax
Modeling
○ Hebbian Updates
● Compression Techniques
○ Knowledge Distillation
○ Pruning
Yan et al., 2020 ○ Quantization
arxiv.org/abs/2005.07877
180
● 90-fold reduction in parameter size and a 36-fold
MicroNet for
reduction in math operations compared to the MicroNet
Efficient
baseline
Language
Modeling
181
● Neural Architecture Search
Hardware-Aware
○ Train a SuperTransformer to cover a large space
Transformers
○ Evolutionary search with hardware latency
constraint to find a specialized SubTransformer
182
Hardware-Aware
Transformers
183
Hardware-Aware SubTransformer search
Transformers
● Evolutionary search
184
Hardware-Aware WMT’14 results on Raspberry Pi-4:
Transformers
● 3× speedup, 3.7× smaller size over baseline
Transformer
185
● Replace several operations in self-attention layers with
SqueezeBERT
grouped convolutions
186
● Previous takeaways from CV into NLP
SqueezeBERT (already adopted in MobileBERT)
○ Bottleneck layers
○ Convolutions
187
● Results
SqueezeBERT
○ 4.3x faster than BERT-base (while MobileBERT is
reported as 3.0x faster than BERT-base) on a Pixel 3
phone.
188
● More efficient parameter allocation within and across
DeLighT: Very
Transformer blocks
Deep and
Light-weight
Transformer
● Similar performance with substantially fewer parameters
compared to baseline transformers.
189
DeLighT: Very
Deep and
Light-weight
Transformer
190
DeLighT: Very
Deep and
Light-weight
Transformer
191
Retrieval
192
Towards more efficient NLP
1) Core techniques
prediction
2) Efficient attention
3) Case studies .
b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks
Encoder Encoder
193
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks
Source: unsplash.com
194
Towards more efficient NLP
1) Core techniques
2) Efficient attention
3) Case studies
b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks
Source: unsplash.com
195
● Cross-attention, single tower models such as BERT have set
Sentence-BERT: state-of-the-art results on sentence-pair tasks such as STS.
Sentence
Embeddings ● For sentence-retrieval tasks, cross-attention model requires
expensive re-encoding the entire retrieval corpus.
using Siamese
BERT-Networks ● Sentence-BERT modifies the pretrained encoder to perform
a single inference per input sentence, followed by cheap
pairwise comparisons e.g. cosine similarity.
196
Sentence-BERT: prediction prediction
Sentence .
regression layer
Embeddings
using Siamese
BERT-Networks
Encoder Encoder Encoder
Reimers et al., 2019 [CLS] S [SEP] S [SEP] [CLS] S [SEP] [CLS] S [SEP]
A B A B
arxiv.org/abs/1908.10084
Cross-attentional Dual-encoder
(Single tower) (Two tower)
197
● Finding the most similar sentence in a collection of 10,000
Sentence-BERT: sentences on a V100 GPU
Sentence ○ BERT (cross-attention): 65 hours
Embeddings ○ SBERT (dual encoder): 5 seconds
using Siamese
● Can also be combined with Maximum Inner Product Search
BERT-Networks
tools for sublinear scaling
○ https://github.com/google-research/google-research/tree/master/scann
○ https://github.com/facebookresearch/faiss
○ https://github.com/spotify/annoy
198
Generalization ● Introduces kNN-LMs, which extends a pre-trained neural
language model (LM) by linearly interpolating it with a
through k-nearest neighbors (kNN) model.
Memorization:
● Allows for efficiently scaling up to larger training sets and
Nearest for effective domain adaptation
Neighbor
Language
Models
199
200
201
202
Generalization
through
Memorization:
Nearest
Neighbor
Language
Models
P 203
203
REALM: ● Language model pre-training can capture world
knowledge by storing it implicitly in the network
Retrieval-Augmented parameters, but storage space is limited by the network
Language Model size (prompting for ever-larger networks).
Pre-Training
● REALM introduces a latent knowledge retriever to
augment the language model, and shows for the first
time how to pretrain it in an unsupervised manner.
204
REALM:
Retrieval-Augmented
Language Model
Pre-Training
205
REALM: ● Fine-tuning for open-domain question answering
Retrieval-Augmented
Language Model
Pre-Training
206
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training
207
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training
208
06
Scaling in Practice
209
Why Do We Need Scale?
210
Scale More Important Than Architecture
211
Attention Size vs Model Size vs Test Loss
212
Attention vs Fully Connected Time for Various Transformers
213
Conclusions From Measuring Scaling
● Performance increases further and further the more parameters a model has
● Attention is very important for efficiency: Transformers scale better than LSTMs
● Attention has diminishing returns (on general “internet data”)
● Size of data and model are more important than architecture
214
Practical Considerations
215
Experimental vs Theoretical Perspective
Theoretical:
● FLOPS/Operation Complexity/Memory: O(n) better than O(n^2); 100 FLOPS better than 1000
● (Possibly) Analysis of occupancy, memory access patterns for certain hardware
Experimental:
● Three criteria:
a. Does it fit into my GPU/TPU/Accelerator?
b. Is it faster than other methods?
c. Can most people use it (+62% of PhD students)?
● Device oriented walltime/memory: CPU for inference, GPU/TPU for training
216
Theory vs ● Algorithm: (1) Divide matrix B into chunks of 128; (2) take
the maximum element, set others to zero; (3) perform
Practice
matrix multiply A*B=C and skip all zero elements
217
Theory vs Practice
219
Occupancy vs Memory Bandwidth vs FLOPS
= Matrix multiplication
= Convolution
220
Occupancy vs Memory Bandwidth vs FLOPS
= Depthwise Convolution
221
Occupancy vs Memory Bandwidth vs FLOPS
= Depthwise Convolution
(custom implementation)
= Depthwise Convolution
222
BERT Large vs BERT Base
223
Better Performance at Lower Occupancy
= Matrix multiplication
= Matrix Multiplication
(lower occupancy, higher instruction parallelism)
Volkov, 2010
224
Conclusion
● Occupancy, and FLOPS/memory bandwidth utilization are important for runtime performance
● Understanding of hardware needed for performance analysis
● Even with deep understanding of hardware, it is difficult to analyze performance theoretically
● Runtime performance of different algorithms can often only be understood if they are run on
the actual device
● Conclusion: To estimate deep neural network runtime performance, it is best to run the network
and measure its performance directly.
225
Memory Optimizations
226
Resources: Academia vs Industry
227
Memory Optimizations Overview
228
CPU<->GPU Memory Swapping / Paging
CPU GPU
Active
Layer Swap in to GPU
CPU GPU
Active
Layer Swap in to GPU
CPU GPU
Active
Layer Swap in to GPU
CPU GPU
Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
232
CPU<->GPU Memory Swapping / Paging
CPU GPU
Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
233
CPU<->GPU Memory Swapping / Paging
CPU GPU
Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
234
CPU<->GPU Memory Swapping / Paging
● Benefits:
○ 60-80% memory reduction
○ Network usually not slower. If it is slower, swap-int layers earlier (less memory
reduction)
○ Faster training due to larger batch size for very large models
235
Mixed Precision Training (FP16+FP32) / BF16 training
BrainFloat-16 Training:
Benefits:
236
Gradient Checkpointing: Forward
Active
Layer
Active
Layer
Active
Layer
Active
Layer
Active
Layer
Active
Layer
Active
Layer
Benefits:
246
Reversible Residual Connections
● Divide network output and residual connection into two halves. Compound them into a
reversible structure:
Forward Backward
Benefits:
● Saves some memory for free (if your framework supports it, e.g. JAX)
● Usually, gradient checkpointing should be prefered:
○ Can save more memory due to being more general.
○ Easy to implement. Supported by major frameworks.
Benefits / Tradeoffs:
● As long as your model runs with batch size 1 you can simulate any batch size
● Easy to implement and can reduce memory footprint significantly
● Slow if micro-batch size is very small
● Can improve data parallel performance significantly (speedups) especially for very large models
248
Parallelism
249
Parallelism Overview
● Data parallelism
● Model parallelism
● Pipeline parallelism
● ZeRo parallelism optimizations
● 3D parallelism
250
Data Parallelism
Idea: Keep the same model parameters across multiple accelerators. Feed them different mini-batches and
average the gradient across accelerators.
Forward Backward
Device0 Device1 Device0 Device1
Sync
Grad Grad
Different input
Krizhevsky 2014 (arXiv) mini-batches Update Update
251
Model Parallelism
Idea: Keep the same mini-batch across multiple accelerators; split the layers parameters across all devices and
synchronize layer outputs after each layer.
Forward Backward
Device0 Sync Device1 Device0 Device1
Sync
Grad Grad
Same input
Krizhevsky 2014 (arXiv) mini-batches Update Update
252
Pipeline Parallelism
Idea: Split network by depth into k pieces onto k accelerators. Each accelerator holds 1/kth of layers. Use
micro-batches to overlap computation and communication.
Krizhevsky 2014 (arXiv); Harlap et al., 2018 (arXiv); Huang et al., 2018 (arXiv)
253
ZeRO Parallelism Optimizations
Idea: Gradients, parameters, and optimizer state only needed for active layer. We distribute the state across all
GPUs and gather them together when we need them (when they become “active”).
Data
Model
Pipeline ZeRO
● Model parallelism bad if batch size is too large. Communication cannot be overlapped with computation.
● Data parallelism bad if the batch size is too small.
● Pipeline parallelism decreases mini-batch size through micro-batches.
● Pipeline parallelism increase min-batch size through aggregation of micro-batches.
● Pipeline parallelism allows for simple overlap of communication and computation..
257
Larger Batch Size
● GPUs are more efficient if fully utilized. That usually only happens if batch size is large
● GPUs run better if the mini-batch dimension is 32 or larger
● Often you can achieve faster training by using a memory efficiency technique which slows
down training but enables training with larger batch size
● Larger batch sizes enables larger learning rates. While computation is slower, training might be
faster.
258
Fused Kernels
260
Mixture of Experts: Overview
261
Transformers Mini-batch Time
262
Mixture of Experts: Balancing and Specialization v1
Initialize W_g and W_noise with zeros, so outputs are driven by standard normal noise. This
guarantees balancing across experts at the start of training.
The noise also helps to decrease early advantage of previously picked experts.
263
Mixture of Experts: Balancing and Specialization v1
An additional balancing loss assigns high loss to experts which have very high probability. This
prevents failure cases where an expert is always picked with 100% probability.
264
Mixture of Experts: Balancing and Specialization v1
Importance loss can be satisfied by picking a subset of experts. To prevent this degeneration we want
to pick all experts with roughly the same probability over time.
If we view the softplus term as something analogous to a standard deviation and the mean softmax
as the expected value, we can express an approximate probability for this with a CDF of the normal
distribution.
265
Mixture of Experts: Balancing and Specialization v2
No noise. Initialize layers normally. Keep track of , how many times each expert was used for the
sequence S. With the mean gate probability of we can now define a balancing auxiliary loss:
Where k is a constant loss weight (a good value is 0.1; usually between 0.01 and 1.0)
266
Mixture of Experts: Balancing and Specialization v2
● Random dispatch: Use 2nd expert proportionally to the softmax gate probability.
● Have a frequency cutoff — a token budget — for each expert. If this budget is exceeded the
expert degenerated to a zero matrix. This effectively reduces the output of the MoE layer to zero
and thus only the residual connection output around the MoE layer is fed to the next layer.
267
Mixture of Experts: Balancing and Specialization
2. Underbalancing: The same top-k experts are used for every token. This leads to two strong
experts, but all other experts do not learn anything and are “wasted capacity”.
3. Sequence-level degeneration: Model balances experts by using each expert for a particular
sequence index. For example, for indices 0, 1, 2, 3 always experts E3, E1, E2, E0. This leads to
sequence experts, but not content experts.
268
Mixture of Experts: Benefits
Closing Notes
270
Why we should strive for efficiency
Our field has seen a dramatic increase in scale in the past 2 years.
1) Costs
2) Accessibility
3) Production needs
271
Closing Notes
In this tutorial, we covered a wide range of ideas, applications and practical considerations that helps us
build more efficient systems, including:
272
We hope you
enjoyed it and
learned something
new!
273
Thank you!
Slides available at: bit.ly/2SmhKY7
274