0% found this document useful (0 votes)
131 views274 pages

EMNLP 2020 Tutorial High Performance NLP

This document outlines an agenda for a presentation on high performance natural language processing. The presentation covers fundamentals of NLP, core techniques like knowledge distillation and quantization, efficient attention mechanisms, case studies, and scaling practices. The goal is to discuss how to build more efficient NLP systems through these techniques in order to address issues with naive scaling like high costs, hardware requirements, and limited accessibility for many researchers.

Uploaded by

刘江
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
131 views274 pages

EMNLP 2020 Tutorial High Performance NLP

This document outlines an agenda for a presentation on high performance natural language processing. The presentation covers fundamentals of NLP, core techniques like knowledge distillation and quantization, efficient attention mechanisms, case studies, and scaling practices. The goal is to discuss how to build more efficient NLP systems through these techniques in order to address issues with naive scaling like high costs, hardware requirements, and limited accessibility for many researchers.

Uploaded by

刘江
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 274

High Performance Natural

Language Processing
EMNLP 2020

1
Presenters

Gabriel Ilharco Cesar Ilharco Iulia Turc


University of Washington Google Google

Tim Dettmers Felipe Ferreira Kenton Lee


University of Washington Google Google
2
High Performance NLP
Slides available at: bit.ly/2SmhKY7

3
01 Introduction
Agenda
02 Fundamentals

03 Core Techniques

04 Efficient Attention

05 Case Studies

06 Scaling in Practice

07 Closing Notes
4
01

Introduction

5
Motivation & Applications

Why do we need it ?
SCALE

● NEWS

○ Realtime: Majority of content is consumed within a few hours after


publication [1]
○ Thousands of news articles per second
○ 40-80 sentences per article
● SOCIAL NETWORKS: ~6 Thousand tweets per second [2]

● THE WEB: Orders of magnitude bigger

What could we do, if we had it ? [1] Tatar, A., Antoniadis, P., Amorim, M.D.d. et al. From popularity prediction to ranking online news.
Soc. Netw. Anal. Min. 4, 174 (2014). https://doi.org/10.1007/s13278-014-0174-8

[2] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how.html
6
Summarization

7
Summarization

8
Facts Extraction

Highest-ever drafted Black player in history.


9
Facts Extraction

● Noteworthy Facts

● Trendiness

VS

10
Sentence Entailment

11
Recent years in Natural
Language Processing

12
Benchmarks through the years - SQuAD 1.1

Human Performance
91.2

The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/


13
Benchmarks through the years - SQuAD 2.0

Human Performance
89.5

The Stanford Question Answering Dataset, https://rajpurkar.github.io/SQuAD-explorer/


14
Benchmarks through the years - GLUE

Human Performance

87.1

The GLUE Benchmark (Wang et al., 2018)


15
A brief recent history of scale in NLP

BERT “[...] scaling to extreme


(340M) model sizes also leads to
large improvements [...]”

(Devlin et al., 2018)


GPT
(110M)

16
A brief recent history of scale in NLP

BERT
T5 “[...] scaling the model
(340M)
(11B) size to 11 billion
parameters was the most
MegatronLM important ingredient for
(8.3B)
achieving our best
performance.”
GPT (Raffel et al, 2019)
(110M)
Transformer GPT-2 Grover
GPT BERT ELMo (1.5B) (1.5B)
(110M) (340M) (465M)

17
A brief recent history of scale in NLP

DeepSpeed
BERT
(1T)
(340M)

GShard
(600B)

GPT GPT-3
(110M) (175B)

GPT BERT GPT-2 T5 Turing-NLG


(1.5B) (11B) (17B)
(110M) (340M)

18
Scaling Laws

Henighan et al., 2020


19
The drawbacks of naive scaling

1) Disconnect with production systems

- Latency
- Hardware constraints
- Energy costs

Memory CPU/GPU/TPU Storage Battery

20
The drawbacks of naive scaling

1) Disconnect with production systems

2) Costs

- Hardware

- 2048 TPU v3 accelerators (GShard, Lepikhin et al., 2020)

- 285,000 CPU cores, 10,000 GPUs (GPT-3, Brown et al., 2020)

- Financial

- GPT-3 training cost is estimated at 4.6 million dollars.

21
The drawbacks of naive scaling

1) Disconnect with production systems

2) Costs

3) Accessibility

- Ever-larger hardware and financial requirements impose great barriers to many


researchers and institutions

- This can have a serious impact in our research community

- For instance, 62% of PhD students have access to 4 or less GPUs, according to a
recent poll.

22
The drawbacks of naive scaling

1) Disconnect with production systems

2) Costs

3) Accessibility

Altogether, this is especially relevant to a field that scaled by


3 orders of magnitude in 2 years.

23
We should strive for
efficiency

24
Towards more efficient NLP

1) Core techniques

- Knowledge Distillation

Source: unsplash.com

25
Towards more efficient NLP

1) Core techniques

- Knowledge Distillation

- Quantization

Source: unsplash.com

26
Towards more efficient NLP

1) Core techniques

- Knowledge Distillation

- Quantization

- Pruning

Source: unsplash.com

27
Towards more efficient NLP

1) Core techniques

2) Efficient attention

- Data-Independent Patterns

28
Towards more efficient NLP

1) Core techniques

2) Efficient attention

- Data-Independent Patterns

- Data-Dependent Patterns

29
Towards more efficient NLP

1) Core techniques

2) Efficient attention

- Data-Independent Patterns

- Data-Dependent Patterns

- Kernels and Alternative Attention Mechanisms

30
Towards more efficient NLP

1) Core techniques

2) Efficient attention

- Data-Independent Patterns

- Data-Dependent Patterns

- Alternative Attention Mechanisms

- Recurrence

31
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

- Efficient Language Models

32
Source: unsplash.com
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

- Efficient Language Models

- Retrieval

Source: unsplash.com
33
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

4) Scaling in Practice

- Scaling Laws of Neural Language Models

34
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

4) Scaling in Practice

- Scaling Laws of Neural Language Models


- Parallelism Techniques

Source: Microsoft Blog Post

35
Towards more efficient NLP

1) Core techniques
CPU GPU
2) Efficient attention

3) Case studies

4) Scaling in Practice

- Scaling Laws of Neural Language Models


- Parallelism Techniques
- Methods to Reduce Memory Footprint
Active
Swap in to Layer
GPU

36
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

4) Scaling in Practice

- Scaling Laws of Neural Language Models


- Parallelism Techniques
- Methods to Reduce Memory Footprint
- Mixture of Experts

37
02

Fundamentals

38
Sequence-to-sequence models

outputs

piles of sequential,
differentiable tensor
operations

inputs

https://xkcd.com/1838/
39
Sequence-to-sequence models

inputs outputs
outputs
Machine Translation Hello, world Olá, mundo

Sentiment Analysis Amazing movie, 10/10! ★★★★★


piles of sequential,
differentiable tensor
operations
Language modeling - The quick brown fox...

Speech recognition Hello, world


inputs

40
RNNs

RNNs allow computations


over sequences of arbitrary
length

outputs RNN
tanhcell RNN
tanhcell RNN
tanhcell

piles of sequential,
differentiable tensor
operations

inputs
41
Encoders and
Decoders

RNNs allow computations


over sequences of arbitrary
Encoder Decoder
length
t t t t t t
a a a
outputs n n n . .an. a
n
a
n
h h h h h h

piles of sequential,
differentiable tensor
operations

inputs
42
The encoder-decoder bottleneck

n n
English h t
h ...
h t
h ...
h t

The agreement on the European Economic Area was signed in August 1992 . <EOS>

ck
n bottlene
informatio

l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>

n n
French h t
h ...
h t
h ...
h t

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

43
Attention

l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>

n n
French h t
h ...
h t
h ...
h t

été

Attention head

n n
English h t
h ...
h t
h ...
h t

The agreement on the European Economic Area was signed in August 1992 . <EOS>

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

44
Attention
A summary of i,
based on how similar their
are with the query

Bahdanau et al.
Neural machine translation by
jointly learning to align and
translate. 2014

Attention head
Thang Luong et al.
Effective approaches to
attention-based neural machine
...
translation. 2015

45
Dot product
attention

Thang Luong et al.


Effective approaches to
attention-based neural machine
translation. 2015

...

Attention head

46
Attention
mechanisms

l
te
s

ho
Va
al

?
¿
Are
The attention matrix
you
going

to
the
hotel
?

47
Transformers

48
Motivation

l’ accord sur la zone économique européenne a été signé en août 1992 . <EOS>

n n
French h t
h ...
h t
h ...
h t

été

Attention head

n n
English h t
h ...
h t
h ...
h t

The agreement on the European Economic Area was signed in August 1992 . <EOS>

Example derived from Bahdanau, et al. 2014 (https://arxiv.org/pdf/1409.0473.pdf)

49
Motivation

t t t t t t t t
a a a
n n n . .an. a
n
a
n
a
n
a
n
h h h h h h h h

sequential

t t t t t t t t
a a a a a a a a
n n n n n n n n
h h h h h h h h

parallel

50
Scaled
A summary of values,
Dot-Product based on how similar their
corresponding keys are
Attention
with the query

Queries, keys and values

51
Scaled
Dot-Product
Attention

Queries, keys and values

For some similarity


function

52
Scaled
Dot-Product
Attention

Using dot-product similarity,


we can vectorize nicely

Normalization factor for


= feature dim numerical stability

53
Scaled
Dot-Product
Attention

Let’s dive into the dimensions


(batch omitted for simplicity)

= sequence length
= feature dim

54
Scaled
Dot-Product
Attention

Let’s dive into the dimensions


(batch omitted for simplicity)

= sequence length
= feature dim

55
Scaled
Dot-Product
Attention

Let’s dive into the dimensions


(batch omitted for simplicity)

= sequence length
= feature dim

56
Scaled
Dot-Product
Attention

Let’s dive into the dimensions


(batch omitted for simplicity)

= sequence length
= feature dim

57
quadratic in
Scaled sequence length!

Dot-Product
Attention

Let’s dive into the dimensions


(batch omitted for simplicity)

= sequence length
= feature dim

58
Multi-head
attention
Scaled Dot-Product
Attention

= sequence length
= feature dim
= # of attention heads
linear linear linear

59
Multi-head
attention
Scaled Dot-Product
Attention

= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear

60
Multi-head
attention
Scaled Dot-Product
Attention each head:

= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear

61
Multi-head
attention
linear

Scaled Dot-Product
Attention each head:

= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear

62
Multi-head
attention
linear

Scaled Dot-Product
Attention each head:

bottleneck is quadratic in
sequence length due to QK!
= sequence length
each head:
= feature dim
= # of attention heads
linear linear linear

63
Positional
encodings

So far, attention has been a


set operation. positional
encodings
t
Let’s add positional a
n
h
information!

64
Positional
encodings positional
encodings
t
a
n
h
So far, attention has been a
set operation.

Let’s add positional Fixed:


information! for a position in the sequence
and in the feature space

position
These can be either learned
or fixed.

depth

65
The transformer encoder

add & norm

dense

add & norm xN

multi-head
attention

positional
encoding

input
embedding Ba et al, 2016.

Vaswani et al., 2017

66
The transformer decoder

add & norm

dense

add & norm

multi-head
attention

K and V xN
from encoder

add & norm

masked
multi-head
attention
prevent model from
peeking at the
future by masking
positional
encoding attention weights
Ba et al, 2016.

Vaswani et al., 2017 input


embedding
67
Putting it all together

add & norm

dense

add & norm

multi-head
attention
add & norm
xN
dense

add & norm add & norm

xN
masked
multi-head
multi-head
attention
attention

Vaswani et al., 2017

68
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)

https://ai.facebook.com/blog/using-neural-networks-to-solve-advanced-mathematics-equations/
69
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)

- Music generation (e.g. Anna Huang et al., 2019)

https://magenta.tensorflow.org/music-transformer
70
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

71
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)

Visual Question Answering


(Agrawal et al., 2015)
72
Transformers in recent literature
Transformers have become successful in a wide range of domains and applications, including:
- Mathematics and theorem proving (e.g. Lample et al., 2019, Clark et al., 2020)

- Music generation (e.g. Anna Huang et al., 2019)

- Biology (e.g. Rives et al., 2019, Madani et al., 2020)

- Vision and Language (e.g. Tan et al., 2019, Lu et al., 2019, Chen et al., 2020)

- Computer Vision (e.g. Ramachandran et al., 2019, Dosovitskiy et al., 2020)

73
Transformers in NLP

Transformers are ubiquitous in NLP.

Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).

Models are typically used in 3 scenarios:

74
Transformers in NLP

Transformers are ubiquitous in NLP.

Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).

Models are typically used in 3 scenarios:

Pre-training
- Large corpus
(e.g. web crawled data)
- Typically unsupervised
(e.g. masked language
modeling)
- Usually runs in GPUs or
TPUs

75
Transformers in NLP

Transformers are ubiquitous in NLP.

Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).

Models are typically used in 3 scenarios:

Pre-training Fine-tuning
- Large corpus - Smaller corpus
(e.g. web crawled data) - Typically supervised
- Typically unsupervised (e.g. question answering,
(e.g. masked language natural language inference)
modeling) - Usually runs in GPUs or
- Usually runs in GPUs or TPUs
TPUs

76
Transformers in NLP

Transformers are ubiquitous in NLP.

Large-scale pre-training has been enormously successful (e.g. BERT, ALBERT, T5, GPT-3).

Models are typically used in 3 scenarios:

Pre-training Fine-tuning Production


- Large corpus - Smaller corpus - Inference
(e.g. web crawled data) - Typically supervised - Usually runs in CPUs,
- Typically unsupervised (e.g. question answering, sometimes in mobile
(e.g. masked language natural language inference) devices
modeling) - Usually runs in GPUs or
- Usually runs in GPUs or TPUs
TPUs

77
03

Core Techniques

78
Knowledge
Distillation

Source: unsplash.com

79
Knowledge
Distillation

Hinton et al., 2015 (x, y=0.8)


Distilling the Knowledge in a Neural Network Teacher Student

x (x, y=1.0)
Data

80
Knowledge are

Distillation
do
well
...
for Pre-training

Sanh et al., 2019


DistilBERT, a distilled version of BERT: smaller, Teacher
faster, cheaper and lighter

Student

how [MASK] you how [MASK] you

Data

81
Knowledge are

Distillation
do
well
...
for Pre-training
feature map transfer
Sun et al., 2019
MobileBERT: a Compact Task-Agnostic BERT Teacher Student
for Resource-Limited Devices
attention transfer

how [MASK] you how [MASK] you

Data

82
1 Regular Pre-training Student
Knowledge
how [MASK] you
Distillation
Data
for Fine-Tuning

Turc et al., 2019


Well-Read Students Learn Better: On the
2 Fine-tuning via distillation Teacher
(x, y=0.8)
Student
Importance of Pre-training Compact Models

x Data

3 (Optional) regular fine-tuning Student

(x, y=1.0)

Data
83
1 Pre-training via distillation
Knowledge
Teacher
Distillation per-layer transfer

for Pre-training embeddings transfer


Student

and Fine-tuning
how [MASK] you how [MASK] you

Data
Jiao et al., 2019
TinyBERT: Distilling BERT for Natural (x, y=0.8)
Language Understanding
2 Fine-training via distillation
Teacher per-layer transfer

Student
embeddings transfer

(x, y=1)

Data
84
Quantization

Source: unsplash.com

85
Quantization Definition

Q(z) = qj z ∈ (tj, tj+1] j=0, …, 2k-1

real-valued tensor (activation or weight) quantization precision

quantization operator

86
Quantization Definition

Q(z) = qj z ∈ (tj, tj+1] j=0, …, 2k-1

real-valued tensor (activation or weight) quantization precision

quantization operator

Linear Quantization

z = S (qj - Z)

zero point

scaling factor
87
Quantization Quantization-Aware Training

Backward pass on w
Forward pass on ŵ
Jacob et al., 2017
Quantization and Training of Neural Networks
for Efficient
Integer-Arithmetic-Only Inference

88
Quantization
● Q8BERT: symmetric linear quantization:
Q(z) = clamp(⌊z ✕ Sz⌉, -127, +127), where Sz is a statistic
computed during or post-training.
Zafrir et al., 2019
Q8BERT: Quantized 8Bit BERT
● Q-BERT: uniform quantization to {0, …, 2k-1} with:
○ mixed precision (higher Hessian spectrum => higher
Shen et al., 2019 precision for layer)
Q-BERT: Hessian Based Ultra Low Precision ○ group precision (each matrix Wk Wq Wv Wo is its own group)
Quantization of BERT

89
Quantization
with Distillation

Zhang et al., 2020


TernaryBERT: Distillation-aware Ultra-low Bit
BERT

90
Pruning

Source: unsplash.com

91
Pruning Definition

Pruning removes “unimportant” weights from a network:

a = (W ⊙ M) x
input

pruning mask

model weight

activation

Main Questions (Hassibi and Stork)


● Which weights should be eliminated?
● How should the remaining weights be adjusted?
● How can such network pruning be done in an efficient way?

92
Pruning Pruning based on second-order derivatives

Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence

● Prune in multiple iterations, based on second-order derivatives:

Hassibi and Stork, 1993 ○ OBD: prune and train


○ OBS: prune and update weights based on second-order statistics
OBS: Second order derivatives for network
pruning: Optimal Brain Surgeon

93
Pruning Pruning based on second-order derivatives

Early Work
Main idea:
LeCun et al., 1990 ● Start with a “reasonably large” network
OBD: Optimal Brain Damage ● Train it to convergence

● Prune in multiple iterations, based on second-order derivatives:

Hassibi and Stork, 1993 ○ OBD: prune and train


○ OBS: prune and update weights based on second-order statistics
OBS: Second order derivatives for network
pruning: Optimal Brain Surgeon

Why do we not train this smaller architecture instead?

94
Pruning
The LTH
Searching for Tickets: One-Shot Magnitude Pruning

Frankle and Carbin, 2018


The Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks

Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 95
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning

Frankle and Carbin, 2018


The Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks

Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 96
Pruning
The LTH
Searching for Tickets: Iterative Magnitude Pruning with Rewinding

Frankle et al, 2019


Stabilizing the Lottery Ticket Hypothesis

Source: https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/ 97
Pruning
The LTH, ctd

Brix et al., 2020


Successfully Applying the Stabilized Lottery
Ticket Hypothesis to the Transformer
Architecture

MP = Magnitude Pruning
LT = Lottery Ticket
SLT = Stabilized Lottery Ticket
CLT = Constant Lottery Ticket

98
Pruning Movement Pruning

● First-order strategy: “instead of selecting weights that are far from zero, we retain
connections that are moving away from zero during the training process”
Sanh et al., 2020 ● The pruning mask M is learnt together with the model parameters.
Movement Pruning: Adaptive Sparsity by
Fine-Tuning ○ hard version: M = Topv(S), where score S is learnt and v is a
hyperparameter.

○ soft version: M = (S > τ), where score S is learnt and threshold τ is a


hyperparameter.

99
Pruning On standard hardware:

& Hardware Unstructured Structured


Pruning Pruning

Storage ✅ ✅
Hooker, 2020
The Hardware Lottery Inference ❌ ✅

Flexibility ✅ ❌

● “In many ways, hardware is catching up to the present state of ML research.”

● There is research for specialized software kernels to support unstructured


sparsity (see paper for references).

100
04

Efficient Attention

101
Recap
add & norm

The Transformer architecture dense

add & norm

multi-head
attention
add & norm
xN
dense

add & norm add & norm

xN
masked
multi-head
multi-head
attention
attention

102
Recap
add & norm

The Transformer architecture dense

add & norm

Quadratic bottleneck in sequence multi-head


attention
length due to multi-head attention add & norm
xN
dense

add & norm add & norm

xN
masked
multi-head
multi-head
attention
attention

103
Recap
add & norm

The Transformer architecture dense

add & norm

Quadratic bottleneck in sequence multi-head


attention
length due to multi-head attention add & norm
xN
dense

This poses a serious problem when add & norm add & norm
large sequences are required, e.g.:
xN
masked
multi-head
multi-head
● Long-range dependencies attention
attention
● Character-level models
● Speech processing
● High-resolution image
processing

104
Efficient Attention

In the past months, there has been much progress in making self-attention more efficient

time

Sparse Transformer Routing Transformer Linformer Big Bird


(Child et al., 2019) (Roy et al, 2020) (Wang et al., 2020) (Zaheer et al., 2020)

Reformer Performer Linear Transformer


(Kitaev et al., 2020) (Choromanski et al., 2020) (Katharopoulos et al., 2020)

105
Efficient Attention

In the past months, there has been much progress in making self-attention more efficient

time

Sparse Transformer Routing Transformer Linformer Big Bird


(Child et al., 2019) (Roy et al, 2020) (Wang et al., 2020) (Zaheer et al., 2020)

Reformer Performer Linear Transformer


(Kitaev et al., 2020) (Choromanski et al., 2020) (Katharopoulos et al., 2020)

We are going to cover some ideas that make this possible

106
Beyond a Dense Attention Matrix
Keys

Goal: Queries
Approximate the computation of attention

via more efficient operations

107
Efficient Attention

A wide range of recent techniques!

● Data-Independent Patterns

○ Blockwise Transformer (Qiu et al., 2019)


○ Sparse Transformer (Child et al., 2019)
○ Longformer (Beltagy et al., 2020)
○ Big Bird (Zaheer et al., 2020)

Taxonomy inspired by Tay et al., 2020


108
Efficient Attention

A wide range of recent techniques!

● Data-Independent Patterns
● Data-Dependent Patterns

○ Linformer (Wang et al., 2020)


○ Reformer (Kitaev et al., 2020)
○ Routing Transformer (Roy et al., 2020)
○ Clustered Attention (Vyas et al., 2020)
○ Sinkhorn Transformer (Tay et al., 2020)
Taxonomy inspired by Tay et al., 2020
109
Efficient Attention

A wide range of recent techniques!

● Data-Independent Patterns
● Data-Dependent Patterns
● Kernels and Alternative Attention Mechanisms

○ Linear Transformer (Katharopoulos et al., 2020)


○ Random Feature Attention (Anonymous, 2020)
○ Performer (Choromanski et al., 2020)
○ Synthesizer (Tay et al., 2020)

Taxonomy inspired by Tay et al., 2020


110
Efficient Attention

A wide range of recent techniques!

● Data-Independent Patterns
● Data-Dependent Patterns
● Alternative Attention Mechanisms
● Recurrence

○ Transformer XL (Dai et al., 2019)


○ Compressive Transformers
(Rae et al., 2019)

Taxonomy inspired by Tay et al., 2020


111
Data-Independent Patterns

112
Data-Independent Patterns
Keys

Blockwise Patterns Queries

Divide sequence into local blocks and


restrict attention within them

Examples:

Blockwise Transformer (Qiu et al., 2019)

Local Attention (Parmar et al., 2018)

113
Data-Independent Patterns
Keys

Strided Patterns Queries

Skip some query/key pairs.

Quadratic in sequence length / stride

Examples:

Sparse Transformer (Child et al., 2019)

Longformer (Beltagy et al, 2020)

114
Data-Independent Patterns
Keys

Diagonal Patterns Queries

Compute attention over the diagonal.

Linear in sequence length and window


size.

Examples:

Longformer (Beltagy et al, 2020)

Big Bird (Zaheer et al., 2020)

115
Data-Independent Patterns
Keys

Random Patterns Queries

Compute attention over random query/key


pairs.

Linear in number of points.

Examples:

Big Bird (Zaheer et al., 2020)

116
Data-Independent Patterns
Keys

Global Attention Queries

Applied to one or a few special tokens,


often prepended to the sequence.

Usually combined with other patterns

Examples:

Big Bird (Zaheer et al., 2020)

Longformer (Beltagy et al., 2020)

ETC (Ainslie et al., 2020)

117
Data-Independent Patterns
Keys

Combination of Patterns Queries

Combine multiple patterns

(e.g. Global + Diagonal + Random)

Examples:

Big Bird (Zaheer et al., 2020)

Longformer (Beltagy et al., 2020)

118
Data-Dependent Patterns

119
Data-Dependent Patterns

Buckets
Keys
Create buckets/clusters and compute
Queries
attention within.

Ideally, buckets should contain the


highests attention weights in the matrix

Examples:

Reformer (Kitaev et al., 2020)

Routing Transformer (Roy et al., 2020)


Attention
head

120
Data-Dependent Patterns

Buckets: Hashing

Locality-Sensitive Hashing (LSH)


Key idea: take a random projection

matrix , compute hash for a

vector through:

Examples:

Reformer (Kitaev et al., 2020)

121
Data-Dependent Patterns

Buckets: Clustering

E.g. online k-means

Examples:

Routing Transformer (Roy et al., 2020)

Clustered Attention (Vyas et al., 2020)

122
Data-Dependent Patterns

Sorting and blocking

E.g. Sparse Sinkhorn Attention Queries


Key ideas:

● A differentiable sorting network


that learns to rearrange blocked Sorted keys
inputs, using the Sinkhorn
balancing mechanism to create a
permutation matrix
● Attention is computed only on local
neighborhoods (before and after
sorting)
Keys
Examples:

Sinkhorn Transformer (Tay et al., 2020)

123
Data-Dependent Patterns
Keys

Compression
Compressed Keys

E.g. pooling, strided convolution, low-rank


projections with learnable weights
Queries

Examples:

Compressed Attention (Liu et al., 2018)

Linformer (Wang et al., 2020)

Synthesizers (Tay et al., 2020)

124
Kernels and Alternative
Attention Mechanisms

125
Kernels and Alternative Attention Mechanisms

Kernels

Recap: attention in its general


form uses a similarity function

Standard transformers use dot


product attention:

Attention head

126
Kernels and Alternative Attention Mechanisms

Kernels

Recap: attention in its general


form uses a similarity function

However, we can simplify things


with a decomposable kernel:

Attention head

127
Kernels and Alternative Attention Mechanisms

Kernels

Recap: attention in its general


form uses a similarity function

However, we can simplify things


with a decomposable kernel:

128
Kernels and Alternative Attention Mechanisms

Kernels

Recap: attention in its general


form uses a similarity function

However, we can simplify things


with a decomposable kernel:

129
Kernels and Alternative Attention Mechanisms

Kernels

Recap: attention in its general


form uses a similarity function

However, we can simplify things


with a decomposable kernel:

Independent of query!

130
Kernels and Alternative Attention Mechanisms

Kernels

This allows us to compute attention in


linear time!
In Katharopoulos et al., 2020:

Independent of query!

131
Kernels and Alternative Attention Mechanisms

Kernels
Random Feature Attention (Anonymous, 2020)

Random features can be used to generate an


unbiased estimation of the standard softmax
function

Independent of query!

132
Kernels and Alternative Attention Mechanisms

Performer: Generalized Attention and FAVOR (Choromanski et al., 2020)

Rethink attention as , parametrized by a kernel

and functions and

This work presents an unbiased, low-variance approximation of attention via random feature map
decompositions, with linear time and space complexity.

133
Kernels and Alternative Attention Mechanisms

Synthesizers (Tay et al., 2020)

Are token-to-token interactions really


necessary?

Random attention matrices are


surprisingly competitively!

Low-rank alternatives can be used

134
Recurrence

135
Recurrence

Transformer-XL (Dai et al., 2019)

How can models process long sequences


under limited hardware constraints?

A naive approach is to split the sequence


into multiple smaller ones and process
them separately

Current
segment

136
Recurrence

Transformer-XL (Dai et al., 2019)

A better way is to add a segment-level


recurrence mechanism

Representations from the previous segment


are cached and re-used (no gradients
flowing at training)

This increases receptive field proportionally


to the depth of the transformer

Previous Current
segment (fixed) segment

137
Recurrence

compression
Compressive Transformers
(Rae et al., 2019)

Dual memory system:

- Primary mem. contains activations


from previous segment

- Secondary mem. is compresses


activations from all previous segments

Secondary Primary Current


(compressed) memory segment
memory

138
Recurrence

compression
Compressive Transformers
(Rae et al., 2019)

When a new segment comes:

- Primary memory is updated with


activations from previous segment

- Secondary memory is updated with


the activations from the primary
memory, where a compression
function is applied (e.g. pooling,
convolutions, most used)
Secondary Primary Current
(compressed) memory segment
memory

139
Overview

140
Benchmarking

How do these models compare in practice?

The Long-Range Arena: a benchmark for


efficient transformers

141
Benchmarking

How do these models compare in practice? List operations example:

The Long-Range Arena: a benchmark for


efficient transformers

Longer sequences: 1K-16K

5 tasks:

- List operations (e.g. max, min, median)


- Byte-level text classification
- Byte-level document retrieval
- Image classification
- Long-range spatial dependency

142
Benchmarking

How do these models compare in practice? List operations example:

The Long-Range Arena: a benchmark for


efficient transformers
Long-range spatial dependency example:
Longer sequences: 1K-16K

5 tasks:

- List operations (e.g. max, min, median)


- Byte-level text classification
- Byte-level document retrieval
- Image classification
- Long-range spatial dependency

(positive) (negative)

143
Benchmarking

The Long-Range Arena: a benchmark for efficient transformers

144
Benchmarking

The Long-Range Arena: a benchmark for efficient transformers

145
Benchmarking

The Long-Range Arena: a benchmark for efficient transformers

146
Benchmarking

The Long-Range Arena: a benchmark for


efficient transformers

Putting it all together (size of circles


corresponds to memory footprint)

Note: these results might be sensitive to


implementation details, hardware and
hyper-parameters.

147
Key Takeaways

There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.

148
Key Takeaways

There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.

There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.

149
Key Takeaways

There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.

There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.

Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)

150
Key Takeaways

There has been a surge in ideas for improving the efficiency of attention and transformers, especially for
improving their capacity to handle long sequences.

There has been good progress in recent months: we are now able to compute attention in linear time with
respect to sequence length, leading to large speed improvements without much performance drops for
large sequences.

Future improvements in hardware, e.g. on the efficiency of sparse computations, may make these ideas
even more appealing in the long run (Hooker, 2020)

The ideas presented in this section are often orthogonal to each other and to other efforts presented in
this tutorial, and can be combined for more efficient models.

151
05

Case Studies

152
Efficient
Language models

153
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

Source: unsplash.com

154
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


NAS
- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer


(Neural Architecture
Search)

155
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer

- PRADO + pQRNN

Source: unsplash.com

156
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer

- PRADO + pQRNN

- MobileBERT

Source: unsplash.com

Source: unsplash.com

157
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer

- PRADO + pQRNN

- MobileBERT

- Lite Transformer with Long-Short Range Attention

Source: unsplash.com

158
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer

- PRADO + pQRNN

- MobileBERT

- Lite Transformer with Long-Short Range Attention

- MicroNet for Efficient Language Modeling

Source: unsplash.com

159
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


NAS
- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer


(Neural Architecture
Search)
- PRADO + pQRNN

- MobileBERT

- Lite Transformer with Long-Short Range Attention

- MicroNet for Efficient Language Modeling

- Hardware-Aware Transformers

160
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies more expensive operation

a) Efficient Language Models


PointwiseFullyConnected
- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer equivalent to

- PRADO + pQRNN
Convolution
- MobileBERT

- Lite Transformer with Long-Short Range Attention implemented as


- MicroNet for Efficient Language Modeling
GroupedConvolution
- Hardware-Aware Transformers

- SqueezeBERT
more efficient operation

161
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models


- Natural Language Processing with Small Feed-Forward Networks

- The Evolved Transformer

- PRADO + pQRNN

- MobileBERT

- Lite Transformer with Long-Short Range Attention

- MicroNet for Efficient Language Modeling

- Hardware-Aware Transformers

- SqueezeBERT

- DeLighT: Very Deep and Light-weight Transformer


162
Natural ● Useful accuracies on a variety of tasks

Language ● Great runtime and memory value in resource


constrained environments
Processing with
Small
Feed-Forward ● Features defined over character n-grams, embeddings
Networks learned from scratch
● Random feature mixing hashing for small feature
vocabularies

Botha et al., 2017


● Quantization for embedding weight compression
arxiv.org/abs/1708.00214

163
softmax
Natural
ReLU activated
Language
Processing with fully connected

Small
reshaping
Feed-Forward
Networks
Discrete feature
embedding
matrix

Botha et al., 2017


arxiv.org/abs/1708.00214

164
Natural Example result:
Language POS Tagging, compared to BTS (Gillick et al., 2016)
Processing with ● +0.3% accuracy (95.4%, near state-of-the-art)
Small ● 6x fewer parameters
Feed-Forward ● 36x fewer FLOPs
Networks

Botha et al., 2017


arxiv.org/abs/1708.00214

165
The Evolved ● Consistent improvement over Transformer on well
Transformer established WMT and LM1B.

● NAS to search Transformer alternatives


● Large search space from feed-forward sequence
models
● Evolutionary architecture search

So et al., 2019
arxiv.org/abs/1901.11117

166
The Evolved
Transformer

So et al., 2019
arxiv.org/abs/1901.11117

167
The Evolved
Transformer

So et al., 2019
arxiv.org/abs/1901.11117

168
Same quality as original “big” Transformer with 37.6% fewer
The Evolved parameters and outperforms Transformer by 0.7 BLEU at a
Transformer mobile-friendly model size of ~7M params.

So et al., 2019
arxiv.org/abs/1901.11117

169
PRADO + PRADO: Projection Attention Networks for Document
Classification On-Device
pQRNN
● Combines trainable projections with attention and
convolutions
● With only 200 Kilobytes in size, outperformed prior CNN
and LSTM models and achieved near state of the art
performance on multiple long document classification
tasks.

Kaliamoorthi et al., 2019-2020


www.aclweb.org/anthology/D19-1506/

https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html

170
PRADO +
pQRNN

Kaliamoorthi et al., 2019-2020


www.aclweb.org/anthology/D19-1506/

https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html

171
pQRNN
PRADO +
pQRNN ● A projection layer with a quasi-RNN encoder
● Same projection layer used in PRADO

Kaliamoorthi et al., 2019-2020


www.aclweb.org/anthology/D19-1506/

https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html ● pQRNN is also quantized

172
PRADO +
pQRNN

Kaliamoorthi et al., 2019-2020


www.aclweb.org/anthology/D19-1506/

https://ai.googleblog.com/2020/09/adva
ncing-nlp-with-efficient-projection.html

173
MobileBERT ● Designed for running on mobile phones with acceptable
latency

● Inverted-Bottleneck BERTLARGE teacher


● Distilled into a compact MobileBERT student
● As deep as BERTLARGE, but narrower
● Task-agnostic compression (task specific fine-tuning
performed directly on the compact model)

Sun et al., 2020


arxiv.org/abs/2004.02984

174
MobileBERT

Sun et al., 2020


arxiv.org/abs/2004.02984

175
MobileBERT

Sun et al., 2020


arxiv.org/abs/2004.02984

176
MobileBERT ● 4.3x smaller, 5.5x faster than BERTBASE
● 77.7 GLUE score ~ BERTBASE
● 90.0/79.2 SQuAD v1.1/v2.0 F1 ~ BERTBASE
● 62 ms latency on a Pixel 4 phone

Sun et al., 2020


arxiv.org/abs/2004.02984

177
Lite Transformer ● Long-Short Range Attention (LSRA)
with Long-Short ○ Local context modeling by convolution
Range Attention ○ Long distance modeling by attention

● 2.5× reduced computation vs Transformer base


● 18.2× smaller with pruning and quantization

● 0.5 higher BLUE compared to Evolved Transformer,


without the 250 GPU-year NAS cost.
Wu et al., 2020
arxiv.org/abs/2004.11886

178
MicroNet for NeurIPS 2019 MicroNet Challenge1
Efficient
Language modeling track: train efficient word-level language
Language models on the Wikitext-103 Dataset2 (word-level perplexity <
Modeling 35)

Score = Normalized Parameter Storage +


Normalized Math Operations

(Normalized by LSTM Rae et at, 2018 arxiv.org/abs/1803.10049)

Yan et al., 2020


1
Gale et al 2019, micronet-challenge.github.io
arxiv.org/abs/2005.07877
2
Merity et al 2016, arxiv.org/abs/1609.07843

179
● Core Language Model
MicroNet for
○ Transformer-XL
Efficient
○ Short Context Group Joint Optimization
Language
○ Adaptive Embedding and Softmax
Modeling
○ Hebbian Updates

● Compression Techniques
○ Knowledge Distillation
○ Pruning
Yan et al., 2020 ○ Quantization
arxiv.org/abs/2005.07877

180
● 90-fold reduction in parameter size and a 36-fold
MicroNet for
reduction in math operations compared to the MicroNet
Efficient
baseline
Language
Modeling

Yan et al., 2020


arxiv.org/abs/2005.07877

181
● Neural Architecture Search
Hardware-Aware
○ Train a SuperTransformer to cover a large space
Transformers
○ Evolutionary search with hardware latency
constraint to find a specialized SubTransformer

● Speed up and smaller size over baseline Transformer,


and low search cost

Wang et al., 2020


arxiv.org/abs/2005.14187

182
Hardware-Aware
Transformers

Wang et al., 2020


arxiv.org/abs/2005.14187

183
Hardware-Aware SubTransformer search
Transformers
● Evolutionary search

● Find a satisfactory SubTransformer given a latency


requirement

● Latency predictor trained for offline latency estimation


(fast and accurate)

Wang et al., 2020


arxiv.org/abs/2005.14187

184
Hardware-Aware WMT’14 results on Raspberry Pi-4:
Transformers
● 3× speedup, 3.7× smaller size over baseline
Transformer

● 2.7× speedup, 3.6× smaller size over Evolved


Transformer with 12,041× less search cost

Wang et al., 2020


arxiv.org/abs/2005.14187

185
● Replace several operations in self-attention layers with
SqueezeBERT
grouped convolutions

● Much faster inference on mobile devices

Iandola et al., 2020


arxiv.org/abs/2006.11316

186
● Previous takeaways from CV into NLP
SqueezeBERT (already adopted in MobileBERT)

○ Bottleneck layers

○ High-information flow residual connections

● New contributions from CV incorporated


into SqueezeBERT’s self-attention

○ Convolutions

Iandola et al., 2020 ○ Grouped convolutions


arxiv.org/abs/2006.11316

187
● Results
SqueezeBERT
○ 4.3x faster than BERT-base (while MobileBERT is
reported as 3.0x faster than BERT-base) on a Pixel 3
phone.

○ GLUE score 76.9 (vs 79.0 for BERT-base)

Iandola et al., 2020


arxiv.org/abs/2006.11316

188
● More efficient parameter allocation within and across
DeLighT: Very
Transformer blocks
Deep and
Light-weight
Transformer
● Similar performance with substantially fewer parameters
compared to baseline transformers.

Mehta et al., 2020


arxiv.org/abs/2008.00623

189
DeLighT: Very
Deep and
Light-weight
Transformer

Mehta et al., 2020


arxiv.org/abs/2008.00623

190
DeLighT: Very
Deep and
Light-weight
Transformer

Mehta et al., 2020


arxiv.org/abs/2008.00623

191
Retrieval

192
Towards more efficient NLP

1) Core techniques
prediction
2) Efficient attention

3) Case studies .

a) Efficient Language Models

b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks
Encoder Encoder

193
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models

b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks

- Generalization through Memorization: Nearest Neighbor Language Models

Source: unsplash.com

194
Towards more efficient NLP

1) Core techniques

2) Efficient attention

3) Case studies

a) Efficient Language Models

b) Retrieval
- Sentence Embeddings using Siamese BERT-Networks

- Generalization through Memorization: Nearest Neighbor Language Models

- REALM: Retrieval-Augmented Language Model Pre-Training

Source: unsplash.com

195
● Cross-attention, single tower models such as BERT have set
Sentence-BERT: state-of-the-art results on sentence-pair tasks such as STS.
Sentence
Embeddings ● For sentence-retrieval tasks, cross-attention model requires
expensive re-encoding the entire retrieval corpus.
using Siamese
BERT-Networks ● Sentence-BERT modifies the pretrained encoder to perform
a single inference per input sentence, followed by cheap
pairwise comparisons e.g. cosine similarity.

Reimers et al., 2019


arxiv.org/abs/1908.10084

196
Sentence-BERT: prediction prediction

Sentence .
regression layer
Embeddings
using Siamese
BERT-Networks
Encoder Encoder Encoder

Reimers et al., 2019 [CLS] S [SEP] S [SEP] [CLS] S [SEP] [CLS] S [SEP]
A B A B
arxiv.org/abs/1908.10084

Cross-attentional Dual-encoder
(Single tower) (Two tower)
197
● Finding the most similar sentence in a collection of 10,000
Sentence-BERT: sentences on a V100 GPU
Sentence ○ BERT (cross-attention): 65 hours
Embeddings ○ SBERT (dual encoder): 5 seconds
using Siamese
● Can also be combined with Maximum Inner Product Search
BERT-Networks
tools for sublinear scaling
○ https://github.com/google-research/google-research/tree/master/scann

○ https://github.com/facebookresearch/faiss

○ https://github.com/spotify/annoy

Reimers et al., 2019


arxiv.org/abs/1908.10084

198
Generalization ● Introduces kNN-LMs, which extends a pre-trained neural
language model (LM) by linearly interpolating it with a
through k-nearest neighbors (kNN) model.
Memorization:
● Allows for efficiently scaling up to larger training sets and
Nearest for effective domain adaptation
Neighbor
Language
Models

Khandelwal et al., 2019


arxiv.org/abs/1911.00172

199
200
201
202
Generalization
through
Memorization:
Nearest
Neighbor
Language
Models

Khandelwal et al., 2019


arxiv.org/abs/1911.00172

P 203
203
REALM: ● Language model pre-training can capture world
knowledge by storing it implicitly in the network
Retrieval-Augmented parameters, but storage space is limited by the network
Language Model size (prompting for ever-larger networks).
Pre-Training
● REALM introduces a latent knowledge retriever to
augment the language model, and shows for the first
time how to pretrain it in an unsupervised manner.

Guu et al., 2020


arxiv.org/abs/2002.08909

204
REALM:
Retrieval-Augmented
Language Model
Pre-Training

Guu et al., 2020


arxiv.org/abs/2002.08909

205
REALM: ● Fine-tuning for open-domain question answering

Retrieval-Augmented
Language Model
Pre-Training

Guu et al., 2020


arxiv.org/abs/2002.08909

206
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training

Guu et al., 2020


arxiv.org/abs/2002.08909

207
REALM: ● State-of-the-art Open-QA, with a relatively small model
size (e.g. REALM outperforms T5-11b while being 30
Retrieval-Augmented times smaller)
Language Model
Pre-Training

Guu et al., 2020


arxiv.org/abs/2002.08909

208
06

Scaling in Practice

209
Why Do We Need Scale?

210
Scale More Important Than Architecture

Kaplan et al., 2020: arXiv

211
Attention Size vs Model Size vs Test Loss

Kaplan et al., 2020: arXiv

212
Attention vs Fully Connected Time for Various Transformers

213
Conclusions From Measuring Scaling

● Performance increases further and further the more parameters a model has
● Attention is very important for efficiency: Transformers scale better than LSTMs
● Attention has diminishing returns (on general “internet data”)
● Size of data and model are more important than architecture

214
Practical Considerations

215
Experimental vs Theoretical Perspective

Theoretical:

● FLOPS/Operation Complexity/Memory: O(n) better than O(n^2); 100 FLOPS better than 1000
● (Possibly) Analysis of occupancy, memory access patterns for certain hardware

Experimental:

● Three criteria:
a. Does it fit into my GPU/TPU/Accelerator?
b. Is it faster than other methods?
c. Can most people use it (+62% of PhD students)?
● Device oriented walltime/memory: CPU for inference, GPU/TPU for training

216
Theory vs ● Algorithm: (1) Divide matrix B into chunks of 128; (2) take
the maximum element, set others to zero; (3) perform
Practice
matrix multiply A*B=C and skip all zero elements

217
Theory vs Practice

Tan & Le, 2019: arXiv


218
GPU Architecture

Ampere Architecture (NVIDIA)

219
Occupancy vs Memory Bandwidth vs FLOPS

= Matrix multiplication

Occupancy FLOPS Memory


Performance
Bandwidth

= Convolution

Occupancy FLOPS Memory


Performance
Bandwidth

220
Occupancy vs Memory Bandwidth vs FLOPS

= Sparse Matrix multiplication

Occupancy FLOPS Memory


Performance
Bandwidth

= Depthwise Convolution

Occupancy FLOPS Memory


Performance
Bandwidth

221
Occupancy vs Memory Bandwidth vs FLOPS

= Depthwise Convolution
(custom implementation)

Occupancy FLOPS Memory


Performance
Bandwidth

= Depthwise Convolution

Occupancy FLOPS Memory


Performance
Bandwidth

222
BERT Large vs BERT Base

BERT Base is 3.1x


smaller than BERT
Large but only trains
1.5x faster.

BERT Base is too


small to saturate
modern GPUs.

223
Better Performance at Lower Occupancy

= Matrix multiplication

Occupancy FLOPS Memory


Performance
Bandwidth

= Matrix Multiplication
(lower occupancy, higher instruction parallelism)

Occupancy FLOPS Memory


Performance
Bandwidth

Volkov, 2010
224
Conclusion

● Occupancy, and FLOPS/memory bandwidth utilization are important for runtime performance
● Understanding of hardware needed for performance analysis
● Even with deep understanding of hardware, it is difficult to analyze performance theoretically
● Runtime performance of different algorithms can often only be understood if they are run on
the actual device

● Conclusion: To estimate deep neural network runtime performance, it is best to run the network
and measure its performance directly.

225
Memory Optimizations

226
Resources: Academia vs Industry

227
Memory Optimizations Overview

● Memory Swapping/Memory Paging


● FP16/BF16 training
● Gradient checkpointing
● Gradient accumulation
● Reversible residual connections

228
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Layer Swap in to GPU

Pupipeddi et al., 2020: arXiv


229
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Layer Swap in to GPU

Pupipeddi et al., 2020: arXiv


230
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Layer Swap in to GPU

Pupipeddi et al., 2020: arXiv


231
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
232
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
233
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

CPU GPU

Active
Swap in to GPU Layer
Pupipeddi et al., 2020: arXiv
234
CPU<->GPU Memory Swapping / Paging

● Swap-out activations / weights to CPU once a layer is completed


● Swap-in activations /weights to GPU before a layer is started
● Exact timing of swap-in/swap-out depends on layers size and layer forward/backward time

● Benefits:
○ 60-80% memory reduction
○ Network usually not slower. If it is slower, swap-int layers earlier (less memory
reduction)
○ Faster training due to larger batch size for very large models

235
Mixed Precision Training (FP16+FP32) / BF16 training

Mixed Precision Training:

● Keep 32-bit master weights


● Do forward pass with 16-bit
● Scale 16-bit loss to prevent under/overflow
● Compute gradients
● Update 32-bit weights; copy 32-bit weights to 16-bit buffers

BrainFloat-16 Training:

● Range: FP16 +-65504; BF16 & FP32 -+3e^38


● Cast everything to BF16
● Train normally (no under/overflow due to larger range)

Benefits:

● Faster training, depending on network about 2x speedup


● Usually save some memory, especially if your activations are large Micikevicius et al., 2018: arXiv

236
Gradient Checkpointing: Forward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet

Active
Layer

Chen et al., 2016: arXiv


237
Gradient Checkpointing: Forward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet

Active
Layer

Chen et al., 2016: arXiv


238
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Missing gradients! Active


Recompute from last checkpoint Layer
with forward pass
Chen et al., 2016: arXiv
239
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Forward pass to compute Active


activation gradient Layer

Chen et al., 2016: arXiv


240
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Active
Layer

Chen et al., 2016: arXiv


241
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Active
Layer

Chen et al., 2016: arXiv


242
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Active
Layer

Chen et al., 2016: arXiv


243
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Active
Layer

Chen et al., 2016: arXiv


244
Gradient Checkpointing: Backward

● Do not store activation gradients in the forward pass


● Recompute activation gradients in the backward pass by restarting a forward pass from a
checkpoint node

Dropped Checkpointed Not Computed Yet Has Gradient

Active
Layer

Chen et al., 2016: arXiv


245
Gradient Checkpointing

Benefits:

● Trade computation to reduce memory footprint


● Best used for functions that are cheap to compute but have a large activation gradient (ReLU,
layer norm, softmax)
● Very beneficial for nonlinear activation functions
● Easy to use in PyTorch (torch.utils.checkpoint) and TensorFlow 2.0 (recompute_grad (nightly))

246
Reversible Residual Connections

● Divide network output and residual connection into two halves. Compound them into a
reversible structure:

Forward Backward

Benefits:

● Saves some memory for free (if your framework supports it, e.g. JAX)
● Usually, gradient checkpointing should be prefered:
○ Can save more memory due to being more general.
○ Easy to implement. Supported by major frameworks.

Gomez et al., 2017: arXiv


247
Gradient Accumulation

● Split larger mini-batches into “micro-batches”


● Do standard forward/backward passes with micro-batches, but do not update the weights right away (and
do not reset the gradient on the weights)
● Accumulate the gradient on the weights for all micro-batches
● Update the weights once enough micro-batches have been computed

Benefits / Tradeoffs:

● As long as your model runs with batch size 1 you can simulate any batch size
● Easy to implement and can reduce memory footprint significantly
● Slow if micro-batch size is very small
● Can improve data parallel performance significantly (speedups) especially for very large models

248
Parallelism

249
Parallelism Overview

● Data parallelism
● Model parallelism
● Pipeline parallelism
● ZeRo parallelism optimizations
● 3D parallelism

250
Data Parallelism

Idea: Keep the same model parameters across multiple accelerators. Feed them different mini-batches and
average the gradient across accelerators.
Forward Backward
Device0 Device1 Device0 Device1

Layer 2 Layer 2 Layer 2 Layer 2


Same model
parameters for
Layer 1 Layer 1 each device Layer 1 Layer 1

Sync
Grad Grad

Different input
Krizhevsky 2014 (arXiv) mini-batches Update Update
251
Model Parallelism

Idea: Keep the same mini-batch across multiple accelerators; split the layers parameters across all devices and
synchronize layer outputs after each layer.
Forward Backward
Device0 Sync Device1 Device0 Device1

Layer 2 Layer 2 Layer 2 Layer 2


Different model
Sync Sync
parameters for
Layer 1 Layer 1 each device Layer 1 Layer 1

Sync

Grad Grad

Same input
Krizhevsky 2014 (arXiv) mini-batches Update Update
252
Pipeline Parallelism

Idea: Split network by depth into k pieces onto k accelerators. Each accelerator holds 1/kth of layers. Use
micro-batches to overlap computation and communication.

Krizhevsky 2014 (arXiv); Harlap et al., 2018 (arXiv); Huang et al., 2018 (arXiv)
253
ZeRO Parallelism Optimizations

Idea: Gradients, parameters, and optimizer state only needed for active layer. We distribute the state across all
GPUs and gather them together when we need them (when they become “active”).

Rajbhandari et al., 2020 (arXiv)


254
3D Parallelism

Data

Model

Pipeline ZeRO

Microsoft Blog Post


(paper coming soon?) 255
Why 3D Parallelism?

● Model parallelism bad if batch size is too large. Communication cannot be overlapped with computation.
● Data parallelism bad if the batch size is too small.
● Pipeline parallelism decreases mini-batch size through micro-batches.
● Pipeline parallelism increase min-batch size through aggregation of micro-batches.
● Pipeline parallelism allows for simple overlap of communication and computation..

Microsoft Blog Post


(paper coming soon?) 256
Efficiency Optimizations

257
Larger Batch Size

● GPUs are more efficient if fully utilized. That usually only happens if batch size is large
● GPUs run better if the mini-batch dimension is 32 or larger
● Often you can achieve faster training by using a memory efficiency technique which slows
down training but enables training with larger batch size
● Larger batch sizes enables larger learning rates. While computation is slower, training might be
faster.

258
Fused Kernels

● Adam with 10^9 parameters:


○ 14 read/writes
○ 32-bit 10^9 parameters = 4 GB
○ Normal Adam: GPU with 600 GB/s -> 14*4/600 = 100ms
○ Fused Adam: 6ms
259
Mixture of Experts

260
Mixture of Experts: Overview

Shazeer et al., 2017: arXiv

Lepikhin et al., 2020: arXiv

261
Transformers Mini-batch Time

262
Mixture of Experts: Balancing and Specialization v1

Version 1 (Shazeer et al., 2017):

Initialize W_g and W_noise with zeros, so outputs are driven by standard normal noise. This
guarantees balancing across experts at the start of training.

The noise also helps to decrease early advantage of previously picked experts.

263
Mixture of Experts: Balancing and Specialization v1

Version 1 (Shazeer et al., 2017):

An additional balancing loss assigns high loss to experts which have very high probability. This
prevents failure cases where an expert is always picked with 100% probability.

Coefficient of variation: CV(X) = std(X)/mean(X)

264
Mixture of Experts: Balancing and Specialization v1

Version 1 (Shazeer et al., 2017)

Importance loss can be satisfied by picking a subset of experts. To prevent this degeneration we want
to pick all experts with roughly the same probability over time.

If we view the softplus term as something analogous to a standard deviation and the mean softmax
as the expected value, we can express an approximate probability for this with a CDF of the normal
distribution.

265
Mixture of Experts: Balancing and Specialization v2

Version 2 (Lepikhin et al., 2020):

No noise. Initialize layers normally. Keep track of , how many times each expert was used for the
sequence S. With the mean gate probability of we can now define a balancing auxiliary loss:

Where k is a constant loss weight (a good value is 0.1; usually between 0.01 and 1.0)

266
Mixture of Experts: Balancing and Specialization v2

Version 2 (Lepikhin et al., 2020):

● Random dispatch: Use 2nd expert proportionally to the softmax gate probability.
● Have a frequency cutoff — a token budget — for each expert. If this budget is exceeded the
expert degenerated to a zero matrix. This effectively reduces the output of the MoE layer to zero
and thus only the residual connection output around the MoE layer is fed to the next layer.

267
Mixture of Experts: Balancing and Specialization

Many cases of expert degeneration:


1. Overbalancing: All experts are approximately equally used. However, gate probability
approaches 1/#Experts. No expert is better than another expert.

2. Underbalancing: The same top-k experts are used for every token. This leads to two strong
experts, but all other experts do not learn anything and are “wasted capacity”.

3. Sequence-level degeneration: Model balances experts by using each expert for a particular
sequence index. For example, for indices 0, 1, 2, 3 always experts E3, E1, E2, E0. This leads to
sequence experts, but not content experts.

268
Mixture of Experts: Benefits

● Works well on diverse data like


multilingual machine translation
● Can be difficult to train due to
balancing/specialization issues
● Only faster than transformers if you
can run it with a large enough
batch size to saturate distributed
experts
● If you scale the model across a
cluster, you will need excellent
interconnect performance
(TPU v4 Pod, NVIDIA SuperPod)

Shazeer et al., 2017: arXiv


Lepikhin et al., 2020: arXiv
269
07

Closing Notes

270
Why we should strive for efficiency

Our field has seen a dramatic increase in scale in the past 2 years.

Striving for efficiency means caring about:

1) Costs

2) Accessibility

3) Production needs

4) The sustainability of this growth

271
Closing Notes

In this tutorial, we covered a wide range of ideas, applications and practical considerations that helps us
build more efficient systems, including:

1) Core efficiency techniques

2) Efficiency improvements to attention mechanisms

3) Case studies of efficient models

4) Practical considerations for scaling models

272
We hope you
enjoyed it and
learned something
new!

273
Thank you!
Slides available at: bit.ly/2SmhKY7

274

You might also like